Reading a Binary Stream

DaveS · July 20, 2017, 12:36am

confused…
I need to read a Binary Stream and analyze virtually byte by byte.
I had THOUGHT originally to use a TextInputStream and READALL… but that became an issue with Memory on large files
So then I switched to a BinaryStream… and figured I’d read it in chunks of 16k at a time and was perplexed as to why it was “slow”
so I made the chunks smaller… meaning it had to process MORE chunks to do the same file…
This was FASTER?! (huh?)

1K and 2K chunks… 1 second for the entire file +/- very small fractions
4K … 2 seconds
8K … 3 seconds
16K … 5 seconds

now those times are not “per chunk” that is for the entire file… so at 1K it processed 48 chunks
and at 16K it processed 3 chunks

So why does reading larger chunks end up with more processing time… when the total amount of data is the same is all instances… I thought having less disk I/O would be a good think…

Would it be even better to not use BS.READ(chunksize)… and just go right to ReadUint8?

Is there some internal Buffer that BinaryStream maintains that I am fighting against by creating my chunks of data (which are strings)?

The idea is to minimize the memory requirement, and increase the speed as fast as possible…
but at some point I do in fact need to look at each and every character (utf-8)

Tim_Hare · July 20, 2017, 1:11am

I think your chunk size is way too small to give meaningful results. Think in MB instead of KB.

DaveS · July 20, 2017, 1:21am

how are the results not meaningful? the larger my “chunk” the more time it takes to process the entire file

reading 48 x 1k … total processing time 1 second
reading 1 x 48k … total processing time 14 seconds

Beatrix_Willius · July 20, 2017, 2:39am

How large are your files and how do you process them? My app can read files that can be several GBs and as Tim Hare wrote MBs are better than KBs for speed of reading. So very likely your processing is the culprit and not the reading of the data.

DaveS · July 20, 2017, 2:51am

from 1k to 1gig… this is for an app that will process files… not my files, files I may never see…

And I’m not quite sure how much more I can say…

for b=0 to filesize/blocksize step blocksize
s=bs.read(blocksize)
for x=1 to s.len
c=mid(s,x,1)
do stuff
next x
next b

simplified of course… but the point is
for the EXACT SAVE FILE… the less blocks that are read (ie. larger blocksize) the SLOWER the whole process is
not just the X loop… but the B loop…

basically my question is… would this be even faster

for b=1 to filesize
c=chrb(bs.readUint8)
do stuff
next b

its not about the SIZE… its about how FAST it can process the size.

Tim_Hare · July 20, 2017, 3:07am

If you’re treating the characters as bytes, use MidB and LenB. Mid degrades with string size. That’s probably what you’re seeing. Or stuff it in a Memoryblock and use mb.byte().

Beatrix_Willius · July 20, 2017, 4:36am

Don’t use mid at all. Read your data in a large blocks - say 1 MB. Then do a split for the current block and do your processing on the single bytes of the array.

DaveS · July 20, 2017, 4:50am

my point is why can I read 1000 1k blocks FASTER than I can read that 1 1MB block you keep insisting is better?
my observations are counter to everything you both seem to be advocating

And I THINK it is because there is sometype of backing store that a BS uses… making it meaningless and wasteful to MOVE it to a memory block or string and then extract it from there…
So I’m betting (and observation has yet to prove this…since I have not yet tested it)…
but I bet BS.READUINT8 will be the best overall

Beatrix_Willius · July 20, 2017, 4:56am

BS = bullshit? Or BS = binarystream?

No. You are seeing the mid and not the reading. Larger blocks are faster to read.

Jeff_Tullin · July 20, 2017, 5:15am

I tend to read files into a memoryblock and access the bytes of the memoryblock directly.
Thats really fast.

But as everyone above is saying, retry your tests with the

c=mid(s,x,1)

line commented out to get a true idea of the speed differences.

Then, replace your ‘read into string’ with ‘read into a memoryblock’
and access the bytes with

c = m.byte(x)

Tim_Hare · July 20, 2017, 5:47am

Mid() cannot index the bytes of the string like MibB() can. Mid() has to start at the beginning of the string each time, so it gets exponentially slower the longer the string is.

Tim_Hare · July 20, 2017, 6:36am

As a point of reference, going 100 times through a 100-byte string was 10x faster than 1 time through a 10000-byte string. At 100 x 200 bytes vs. 1 x 20000 bytes, the difference was closer to 15x.