Best amount to read/write to a file at once?

Arnaud_N · March 29, 2020, 3:29pm

Hello,

One of my apps copies files (to tell simply). Until now, I just read the whole file and wrote it to the other one. Obviously, when the file is large (e.g. above 3 GB), the app takes a lot of RAM (more than the actual read amount, it seems) and even fails often (the resulting file is near 170 MB); I’m now considering reading and writing as chucks of data but I can’t seem to find an answer to these questions: how much is too much and how much is not enough?

By not enough, I mean I could read a 5GB byte by byte; it would take too much time, but would work anyway. On the other hand, I don’t want to fill the available RAM of my app by reading too much. And, if the user has low RAM (old computer, VM, etc.), maybe 100 MB per read could be too much. Should I compute the available RAM of the computer to decide how much to read per chunk? What is relevant here?

Regards.

Dale_Arends · March 29, 2020, 5:23pm

Ideally you would read and write files in chunks of a multiple of the device’s block size. The block size is the amount of data the device transfers in a single read or write to or from the media. For example, on Windows with an NTFS-formatted drive, the typical block size is 4096 (4K bytes). On the other hand, you don’t want to have the amount to read too small since this will require the shifting between read and write operations too often.

Assuming a block size of 4096 bytes, reading in 64K at a time means the device will make 16 reads before writing out the data in 16 writes with each read or write being of optimum size. Increasing from 64K to 128K will increase the number of reads and writes while still maintaining the optimum size for each read and write.

That all being the case, there is really no “correct” answer except to try to determine the block size for the drive(s) and read and write in multiples of that.

Christian_Schmitz · March 29, 2020, 5:47pm

Mac? Win? Linux?

You need to copy in blocks, e.g. 10 MB at a time.

MarkusR · March 29, 2020, 6:05pm

and this methods are also to slow? i just guess they use os api methods
FolderItem.CopyTo
FolderItem.MoveTo

Norman_Palardy · March 29, 2020, 7:23pm

with a big file they may block

Jeff_Tullin · March 29, 2020, 7:40pm

if it is that simple, what is wrong with .copyfileto or .copyto?

or shelling out and having the OS do it.

Are you amending the file before the copy?

Arnaud_N · March 29, 2020, 9:32pm

[quote=481291:@Dale Arends]Ideally you would read and write files in chunks of a multiple of the device’s block size. The block size is the amount of data the device transfers in a single read or write to or from the media. For example, on Windows with an NTFS-formatted drive, the typical block size is 4096 (4K bytes). On the other hand, you don’t want to have the amount to read too small since this will require the shifting between read and write operations too often.

That all being the case, there is really no “correct” answer except to try to determine the block size for the drive(s) and read and write in multiples of that.[/quote]
Thanks. I’ll follow the rules and read by multiple of block size. But, perhaps there’s a in-between acceptable value?

Arnaud_N · March 29, 2020, 9:34pm

All three, actually.

OK, I take 10 MB as a reference. In your opinion, would this be the minimum amount per chunk?
Thanks.

Arnaud_N · March 29, 2020, 9:35pm

[quote=481299:@Markus Rauch]and this methods are also to slow? i just guess they use os api methods
FolderItem.CopyTo
FolderItem.MoveTo[/quote]
In my case, the destination isn’t a whole file (it’s inside a bigger one). Also, the data are being encrypted. Those methods wouldn’t work.
Thanks anyway.

Christian_Schmitz · March 29, 2020, 9:35pm

I bet for modern computer 10 MB is fine.

if you do 4 KB per read/write, the overhead of management for API calls may take longer than read/write itself.

Arnaud_N · March 29, 2020, 9:39pm

[quote=481309:@Jeff Tullin]if it is that simple, what is wrong with .copyfileto or .copyto?
or shelling out and having the OS do it.[/quote]
What is simple was only my question, not the actual code being written. The destination is inside an other file (which contains other data) and the data are being encrypted/shrunk. All these 3 methods wouldn’t work.

I’m reading it (currently as a whole file, planned for later, by chunks) and writing the data inside another file (like an archive).
Thanks.

Arnaud_N · March 29, 2020, 9:44pm

OK, I’ll take it and make my tests. I guess another difficulty is to adjust based on the file size (e.g. reading 10 MB for a 10 GB file might be more problematic and lengthy than 5 MB for a 100 MB file). There are so much things to take into account

That’s my question, with better words. What is the approximate amount where API calls would be most efficient? (assuming I can obtain the required driver/disk information)

Perhaps I’m just overthinking that?
Thank you.

Christian_Schmitz · March 29, 2020, 10:05pm

Well, if you want to update a progress bar, the 10 MB is probably just fine as it is done in a fraction of a second.
So you can in each loop do a read, a write and update the bar.

Dale_Arends · March 30, 2020, 4:11pm

[quote=481341:@Christian Schmitz]I bet for modern computer 10 MB is fine.

if you do 4 KB per read/write, the overhead of management for API calls may take longer than read/write itself.[/quote]
Right, 10MB on most machines today is probably fine.

To be clear, I wasn’t advocating a 4KB operation. The hardware will read or write in whatever size blocks the drive formatting dictates. That shouldn’t be confused with the amount the program can or should read or write per operation. What it does mean is that if you choose to specify how much to read or write, rather than letting the system deal with it, specifying an amount not a multiple of the block size will cause the hardware to make either inefficient or extra accesses to the media. On today’s lightning fast systems, it isn’t generally a problem but in the old days (yes, I remember them) it could be significant.

Bernardo_Monsalve · March 30, 2020, 5:42pm

This could help, “minimizing the number of system calls and determining the correct sizes for the buffers is crucial to performance.”

SSD is different to HDD: “size of a block can vary between 256 KB and 4 MB”

In my case, for example, 4MB (2^20 * 4) for big files is the same as 4KB (2^10). Are less calls in xojo code with 4MB but the system calls are the same and the UI progress is pretty good.

Arnaud_N · March 31, 2020, 8:45am

Thanks for your answers. I’ll choose 10 MB.

Dale, what do you call an old computer? My everyday computer is a 2008 Mac Pro (3,0), to which I’ve added up to 18 GB of RAM, USB 3.0 support, a SSD disk (using PCIE) and Mojave; although the native hardware can’t be enhanced (CPU, motherboard), I consider it being current.

Arnaud_N · March 31, 2020, 10:48am

While converting my code to read by chunks of 10 MB instead of the whole file, a new question arose.
I’m actually compressing the data before writing them in the destination (various algorithms may be used). By using chunks (necessary for huge files), I must compress every chunk instead of a single, bigger, one.
As an example, assume the source file is 20 MB. Using the former method, reading the file as a whole, I guess the encryption method, generally speaking, would be more efficient than compressing 2 chunks of data separately (the more you divide a string, the less the compression works), but the former method wouldn’t work for huge files.
The latter method almost makes compressing useless.

Should I decide about a size limit where I choose from one method or the other, or what would be best?
Thanks.

Dale_Arends · March 31, 2020, 5:37pm

[quote=481637:@Arnaud Nicolet]Thanks for your answers. I’ll choose 10 MB.

Dale, what do you call an old computer? My everyday computer is a 2008 Mac Pro (3,0), to which I’ve added up to 18 GB of RAM, USB 3.0 support, a SSD disk (using PCIE) and Mojave; although the native hardware can’t be enhanced (CPU, motherboard), I consider it being current.[/quote]
I consider any system older than, oh, 1990 as old. Back then I was writing timing critical ProDOS 1.1.1 code for the Apple 2e. CPU cycle counting was important so that the reads and writes to the hardware were safely completed before the OS grabbed the data and returned it to the user so the drive could reuse its buffer.

I wonder if you could separate the read/write and compressing/decompressing. Could you do the compression of the whole file into a separate memoryblock and then write it out in chunks? Or reverse that for reads?

Norman_Palardy · March 31, 2020, 5:42pm

helpers

Arnaud_N · March 31, 2020, 7:20pm

Ah, OK, so mine is a brand new model

Well, the problem is the same whatever I do: be it in the main app or with a helper, I must avoid reading whole files if they are big. Reading by chunks is fine, but if I then concatenate them in any way, the original problem of my process being limited by [allocated] RAM remains (and the data gets truncated [a lot]). Keeping as chunks makes compression worthless.

Is this again a find-the-best-compromise value thing? (e.g. I concatenate some chunks and compress the result, repeating for the whole file. So I’d read 10 MB, concatenate as 100 MB, compress that? What would be better than 100 MB?).

Looks like it’s a never ending problem.

Thank you.