Handling Large Data

James_Wagner · January 23, 2021, 4:54pm

Greetings -

I am trying to process the data from an instrument that I sell. The data is saved as CSV files at one file per day. At normal sampling rates, these files are about 30Meg, and contain about 10 samples per second * 60 sec.minute * 60 min/hour * 24 hours per day = 86400 carriage-return delimited records. I can get the files copied from the instrument’s SD card into a computer with no problem.

Each record contains 8 comma delimited fields (lets say “A,B,C,D,E,F,G,H”) and I need to convert that into longitudinal arrays, lets say A(86400) and B(86400) and so forth.

Originally, I used TextInputStream.ReadLine on the computer file, then split() on the resulting line, appending the split results, field by field, to the longitudinal arrays. This was painfully slow.

Now, I am wondering if I can use MemoryBlocks to improve the speed. I am pretty certain that I can use TextInputStream.ReadAll to put an entire day-file into a MemoryBlock. BUT, the next question is how to separate that into lines (records). Can I use a second MemoryBlock as the result of a split operation that takes the first MemoryBlock as an argument? If so, how do I “configure” that second MemoryBlock so that it behaves as a string array, rather than just a string? I realize that “configure” is not a very good term, here, but the question remains about how to get the MemoryBlock to behave as an array of strings, or, is it not possible?

Should add, here, that this is my first attempt, ever, at using MemoryBlock, so I may not have all of the concepts fully under control!

I might add that a typical use of this device may result in up to 50 day-files, so processing time adds up fast. I also recognize that string arrays may not be the most efficient in the long term and that detabase warrants consideration.

Many thanks
Jim

James Wagner
Oregon Research Electronics

Tim_Parnell · January 23, 2021, 5:15pm

This can easily be sped up by using ReadAll instead. Have you done code profiling to see where the slowdowns are? Something like this should be trivial for today’s modern machines.

I made an example project and it parses 1.4mb of 86400 CSV lines in ~200ms: parselines.xojo_binary_project
(I just copied your A,B,C,D,E,F,G,H string, you could modify the test set creation code to provide longer fields)

I didn’t even want to test what ReadLine does. Would be curious how long “painfully slow” is and if 200ms is better than ReadLine.

Update: I just updated the project to write longer lines, 11mb test file Runtime: 278.05ms

James_Wagner · January 23, 2021, 5:26pm

“Painfully slow” was on the order of 10+minutes per 30+Meg file!

I’ll take a look at your project. Thanks for the suggestion!

Jim

kevin_g · January 23, 2021, 5:53pm

If the text is ASCII / UTF-8 then you should see a performance increase by using the B functions.

This is what I would try:

Load the entire file into a string
Use SplitB to split the string into an array of lines
Iterate the array splitting each line using SplitB

also look at using Pragmas to squeeze out some extra performance.

James_Wagner · January 23, 2021, 5:57pm

Yes, it is ASCII data. Will take a look at what Pragmas might do.

Thanks
Jim

kevin_g · January 23, 2021, 6:09pm

The Pragmas should help reduce the loop overhead.

SplitB should be much faster than Split.

Richard_Duke · January 23, 2021, 8:29pm

What pragma does it need ?

Tim_Parnell · January 23, 2021, 8:34pm

There are pragmas to disable background tasks and things connected to the debugger.

I still recommend using code profiling to learn where the actual slowdowns are, because in this case I don’t think it’s BackgroundTasks causing the problem.

Markus_Winter · January 23, 2021, 8:44pm

As Tim already pointed out: use ReadAll. 30 MB is nothing to a computer with Gigabytes of RAM.

It really baffles me that people still insist on using ReadLine as it is the biggest slowdowner I ever encountered.

kevin_g · January 23, 2021, 8:47pm

If you are using a Mac then you could use Xcode Instruments to profile your running code. It’s miles better than the Xojo profiler.

James_Wagner · January 23, 2021, 10:02pm

I have not used TextInputStream.ReadLine for several years, now, thanks (I think) to Norman’s advice. I simply mentioned that as the starting point, some years ago.

I guess that I need to learn how to use the profiler. There are other steps involved, beyond what I described in the first message, For example, several of the fields need to be scaled (with gain and offset constants) into doubles. Thats not a high cost task, but having to repeat that so many times per file adds up. Then, the data has to be filtered DSP style. All of these tasks can stand improvements I think. Profiling should help, here, also.

But, the immediate question is how to get the data from CSV format into longitudinal arrays most efficiently, and I appreciate the comments and ideas. As an aside, its not “just” 30Meg, its 30Meg repeated possibly more than 50 times (that adds up to 1.5Gig!). That comes with the further task of caching that first-phase result to speed up second-phase graphical presentations for the user.

Thanks, everyone
Jim

Oregon Research Electronics

Kem_Tekinay · January 23, 2021, 10:42pm

It’s hard to advise without seeing your code, but I doubt a MemoryBlock would help here.

Off the top of my head, it seems like this should go pretty quickly:

var tis as TextInputStream = TextInputStream.Open( f )
var data as string = tis.ReadAll( Encodings.UTF8 )
tis.Close

data = data.ReplaceLineEndings( &uA )
var rows() as string = data.Split( &uA )

var A() as Integer 
A.ResizeTo( rows.LastRowIndex )
// And other arrays as needed

for rowIndex as integer = 0 to rows.LastRowIndex
  var row as string = rows( rowIndex )
  var fields() as string = row.Split( "," )

  A( rowIndex ) = fields( 0 ).ToInteger
  // etc
next

If you’re already doing this, it would be helpful to see your code.

Robert_Weaver · January 24, 2021, 1:44am

It appears that that text file should completely fit in memory. However, in cases where the file is too big to read all in one shot, I use the following read-as-chunks algorithm which is still a lot faster than reading line by line, and isn’t much more difficult to use:

Sub ReadChunks(txtIn as TextInputStream)
  Const chunkSize=5000
  dim residual As String = ""
  While not txtIn.EOF
    'The "residual" string will hold the last incomplete line of input text
    'and is prefixed to the front of subsequent reads.
    dim tArray() As string = split(residual+txtIn.Read(chunkSize),EndOfLine)
    dim nLines As Integer = UBound(tArray)-1
    residual=tArray(nLines+1) 'save incomplete line at end of current chunk for next read
    for i As integer = 0 to nLines
      'Process complete lines of text here
      'Eg., split input text into an array.
    Next
  Wend
  'If file doesn't end with an EndOfLine, then the residual string
  'will contain the final partial line which needs to be processed.
  if residual<>"" then
    'Process final incomplete line of text here
  end if
End Sub

Mike_D · January 24, 2021, 1:57am

If your data is plain ASCII or UTF8, as mentioned above, you should definitely use the binary versions of Split (SplitB) which should be quite a bit faster. UTF8 is great because it guarantees all “low ascii” (0-127) characters are the same as in ASCII.

James_Wagner · January 24, 2021, 2:22am

I have it working and it is relatively fast. But, it also reveals a totally different “issue” that I have had every time I have tried to do something like this: user feedback. This is another aspect of handling larger quantities of data.

I have a TextArea in which I was going to display “bullets” but nothing is visible until it is all over. This leaves a big blank hole where the user has no idea if things are still working. It takes about 5 seconds per file, at this point and will probably take a bit longer when I add the scaling and filtering. With 50 files, this makes an awfully long dead time, I try to do a TextArea1.refresh when each file is finished before it starts a new one, but that does not seem to help. I could easily stand to pause a half second between each file processing, but I don’t see a way to do it.

Recall that there were (still are?) specific code events where it would update the system service loop, if that is the correct term. Not sure, though, where to find that information or how to use it. I’ve tried threading and timers, but I seem to be missing something there, also.

Thanks for your help.

Jim

Tim_Hare · January 24, 2021, 2:59am

You have to use a thread AND you have to sleep the thread often enough that the app still appears responsive, but not too often or it will take a long time to finish. Don’t rely on Xojo switching context in a timely manner.

James_Wagner · January 24, 2021, 3:34am

Ahhh, thank you!

Jim

Matthew_Dinmore1 · January 26, 2021, 3:51am

I recently wrote a high-speed, general CSV parser (hoping to be able to release as OS at some point), and had the opportunity with the design to experiment with different block sizes. Interestingly, I found that simply reading all (with any size file) wasn’t optimal. Rather, a read block size of 100k was about optimal. With 100k blocks, I got processing speeds of around 7.5MB/sec regardless of file size (and I tested with files up to 624MB). However, with readall, performance fell off with as file size increase. At 624MB, performance was half that of the 100k buffer.
The parser uses splits roughly like Kem’s example, but also respects CSV quoted fields, to include those that include line breaks, per the CSV “standard” (I use that loosely…). It uses a read-as-chunks approach similar to Robert’s example.

James_Wagner · January 26, 2021, 5:48am

Thanks for that. I appreciate the observations and suggestions.

Jim Wagner

kevin_g · January 26, 2021, 9:58am

Do you have a test project + a sample file that you could share?