confused…
I need to read a Binary Stream and analyze virtually byte by byte.
I had THOUGHT originally to use a TextInputStream and READALL… but that became an issue with Memory on large files
So then I switched to a BinaryStream… and figured I’d read it in chunks of 16k at a time and was perplexed as to why it was “slow”
so I made the chunks smaller… meaning it had to process MORE chunks to do the same file…
This was FASTER?! (huh?)
1K and 2K chunks… 1 second for the entire file +/- very small fractions
4K … 2 seconds
8K … 3 seconds
16K … 5 seconds
now those times are not “per chunk” that is for the entire file… so at 1K it processed 48 chunks
and at 16K it processed 3 chunks
So why does reading larger chunks end up with more processing time… when the total amount of data is the same is all instances… I thought having less disk I/O would be a good think…
Would it be even better to not use BS.READ(chunksize)… and just go right to ReadUint8?
Is there some internal Buffer that BinaryStream maintains that I am fighting against by creating my chunks of data (which are strings)?
The idea is to minimize the memory requirement, and increase the speed as fast as possible…
but at some point I do in fact need to look at each and every character (utf-8)
How large are your files and how do you process them? My app can read files that can be several GBs and as Tim Hare wrote MBs are better than KBs for speed of reading. So very likely your processing is the culprit and not the reading of the data.
from 1k to 1gig… this is for an app that will process files… not my files, files I may never see…
And I’m not quite sure how much more I can say…
for b=0 to filesize/blocksize step blocksize
s=bs.read(blocksize)
for x=1 to s.len
c=mid(s,x,1)
do stuff
next x
next b
simplified of course… but the point is
for the EXACT SAVE FILE… the less blocks that are read (ie. larger blocksize) the SLOWER the whole process is
not just the X loop… but the B loop…
basically my question is… would this be even faster
for b=1 to filesize
c=chrb(bs.readUint8)
do stuff
next b
its not about the SIZE… its about how FAST it can process the size.
If you’re treating the characters as bytes, use MidB and LenB. Mid degrades with string size. That’s probably what you’re seeing. Or stuff it in a Memoryblock and use mb.byte().
Don’t use mid at all. Read your data in a large blocks - say 1 MB. Then do a split for the current block and do your processing on the single bytes of the array.
my point is why can I read 1000 1k blocks FASTER than I can read that 1 1MB block you keep insisting is better?
my observations are counter to everything you both seem to be advocating
And I THINK it is because there is sometype of backing store that a BS uses… making it meaningless and wasteful to MOVE it to a memory block or string and then extract it from there…
So I’m betting (and observation has yet to prove this…since I have not yet tested it)…
but I bet BS.READUINT8 will be the best overall
Mid() cannot index the bytes of the string like MibB() can. Mid() has to start at the beginning of the string each time, so it gets exponentially slower the longer the string is.
As a point of reference, going 100 times through a 100-byte string was 10x faster than 1 time through a 10000-byte string. At 100 x 200 bytes vs. 1 x 20000 bytes, the difference was closer to 15x.