One of my apps copies files (to tell simply). Until now, I just read the whole file and wrote it to the other one. Obviously, when the file is large (e.g. above 3 GB), the app takes a lot of RAM (more than the actual read amount, it seems) and even fails often (the resulting file is near 170 MB); I’m now considering reading and writing as chucks of data but I can’t seem to find an answer to these questions: how much is too much and how much is not enough?
By not enough, I mean I could read a 5GB byte by byte; it would take too much time, but would work anyway. On the other hand, I don’t want to fill the available RAM of my app by reading too much. And, if the user has low RAM (old computer, VM, etc.), maybe 100 MB per read could be too much. Should I compute the available RAM of the computer to decide how much to read per chunk? What is relevant here?
Ideally you would read and write files in chunks of a multiple of the device’s block size. The block size is the amount of data the device transfers in a single read or write to or from the media. For example, on Windows with an NTFS-formatted drive, the typical block size is 4096 (4K bytes). On the other hand, you don’t want to have the amount to read too small since this will require the shifting between read and write operations too often.
Assuming a block size of 4096 bytes, reading in 64K at a time means the device will make 16 reads before writing out the data in 16 writes with each read or write being of optimum size. Increasing from 64K to 128K will increase the number of reads and writes while still maintaining the optimum size for each read and write.
That all being the case, there is really no “correct” answer except to try to determine the block size for the drive(s) and read and write in multiples of that.
[quote=481291:@Dale Arends]Ideally you would read and write files in chunks of a multiple of the device’s block size. The block size is the amount of data the device transfers in a single read or write to or from the media. For example, on Windows with an NTFS-formatted drive, the typical block size is 4096 (4K bytes). On the other hand, you don’t want to have the amount to read too small since this will require the shifting between read and write operations too often.
That all being the case, there is really no “correct” answer except to try to determine the block size for the drive(s) and read and write in multiples of that.[/quote]
Thanks. I’ll follow the rules and read by multiple of block size. But, perhaps there’s a in-between acceptable value?
[quote=481299:@Markus Rauch]and this methods are also to slow? i just guess they use os api methods
In my case, the destination isn’t a whole file (it’s inside a bigger one). Also, the data are being encrypted. Those methods wouldn’t work.
[quote=481309:@Jeff Tullin]if it is that simple, what is wrong with .copyfileto or .copyto?
or shelling out and having the OS do it.[/quote]
What is simple was only my question, not the actual code being written. The destination is inside an other file (which contains other data) and the data are being encrypted/shrunk. All these 3 methods wouldn’t work.
I’m reading it (currently as a whole file, planned for later, by chunks) and writing the data inside another file (like an archive).
OK, I’ll take it and make my tests. I guess another difficulty is to adjust based on the file size (e.g. reading 10 MB for a 10 GB file might be more problematic and lengthy than 5 MB for a 100 MB file). There are so much things to take into account
That’s my question, with better words. What is the approximate amount where API calls would be most efficient? (assuming I can obtain the required driver/disk information)
[quote=481341:@Christian Schmitz]I bet for modern computer 10 MB is fine.
if you do 4 KB per read/write, the overhead of management for API calls may take longer than read/write itself.[/quote]
Right, 10MB on most machines today is probably fine.
To be clear, I wasn’t advocating a 4KB operation. The hardware will read or write in whatever size blocks the drive formatting dictates. That shouldn’t be confused with the amount the program can or should read or write per operation. What it does mean is that if you choose to specify how much to read or write, rather than letting the system deal with it, specifying an amount not a multiple of the block size will cause the hardware to make either inefficient or extra accesses to the media. On today’s lightning fast systems, it isn’t generally a problem but in the old days (yes, I remember them) it could be significant.
Dale, what do you call an old computer? My everyday computer is a 2008 Mac Pro (3,0), to which I’ve added up to 18 GB of RAM, USB 3.0 support, a SSD disk (using PCIE) and Mojave; although the native hardware can’t be enhanced (CPU, motherboard), I consider it being current.
While converting my code to read by chunks of 10 MB instead of the whole file, a new question arose.
I’m actually compressing the data before writing them in the destination (various algorithms may be used). By using chunks (necessary for huge files), I must compress every chunk instead of a single, bigger, one.
As an example, assume the source file is 20 MB. Using the former method, reading the file as a whole, I guess the encryption method, generally speaking, would be more efficient than compressing 2 chunks of data separately (the more you divide a string, the less the compression works), but the former method wouldn’t work for huge files.
The latter method almost makes compressing useless.
Should I decide about a size limit where I choose from one method or the other, or what would be best?
[quote=481637:@Arnaud Nicolet]Thanks for your answers. I’ll choose 10 MB.
Dale, what do you call an old computer? My everyday computer is a 2008 Mac Pro (3,0), to which I’ve added up to 18 GB of RAM, USB 3.0 support, a SSD disk (using PCIE) and Mojave; although the native hardware can’t be enhanced (CPU, motherboard), I consider it being current.[/quote]
I consider any system older than, oh, 1990 as old. Back then I was writing timing critical ProDOS 1.1.1 code for the Apple 2e. CPU cycle counting was important so that the reads and writes to the hardware were safely completed before the OS grabbed the data and returned it to the user so the drive could reuse its buffer.
I wonder if you could separate the read/write and compressing/decompressing. Could you do the compression of the whole file into a separate memoryblock and then write it out in chunks? Or reverse that for reads?
Well, the problem is the same whatever I do: be it in the main app or with a helper, I must avoid reading whole files if they are big. Reading by chunks is fine, but if I then concatenate them in any way, the original problem of my process being limited by [allocated] RAM remains (and the data gets truncated [a lot]). Keeping as chunks makes compression worthless.
Is this again a find-the-best-compromise value thing? (e.g. I concatenate some chunks and compress the result, repeating for the whole file. So I’d read 10 MB, concatenate as 100 MB, compress that? What would be better than 100 MB?).