About 8 years ago I built an app for in-house use. Some of our clients require that we deliver our final deliverables to them (film scans and related files) packaged according to the Library of Congress Bagit specification. Easy enough - it’s just a structured folder and an MD5 or SHA hash of each file written to a text file. The purpose of bagit is to provide a standardized system to verify file integrity of stuff that’s (typically) put on an LTO tape or other long-term storage system.
The app I wrote has some annoying bugs and needs a refresh. it was made so long ago I’m thinking of completely redoing it. The way it works now, you point it to a folder and it walks that folder’s directory tree, hashing every file. It creates a text manifest file that contains the hash and the relative path to each file. I am using threads so that you can run multiple checksums at the same time. Each thread launches the OS-native MD5 tool via a shell command and runs it there, so you can have multiple instances running simultaneously. Xojo threads are only involved in so much as they manage the calls to the OS-level md5 shell command, and report back when done. None of the heavy lifting is done directly in the app, it’s just managing the shell commands. My app is about 2 times faster than the freely available apps out there (a big deal when you’re talking 4 hours vs 8 hours), many of which are pretty buggy and unreliable and typically not able to process more than one file at a time.
So I’m thinking that while I’m overhauling it, it might be worth a revisit to how I do this, and I’m considering using preemptive threading and actually generating the hashes in Xojo instead of sending it out to multiple shells to take advantage of multi-core machines.
The main consideration here is that the files being hashed are highly variable in size. On the one hand you might have a folder with 20,000 or more sequentially numbered images, under 200MB/image. Or you might have MOV or MXF files that are in excess of 1 TB in size. Or some combination of the two. In our case, a bag might be anywhere from about 10TB if it’s going on an LTO-8 tape, to 30+TB if it’s going on hard drives and is a particularly large project.
Is it worth trying to do this entirely within Xojo using the internal tools (obviously breaking them up into a digest in a way like what @Kem_Tekinay described here), or should I keep relying on OS-native shell commands?
I feel like with preemptive threading and a digest-style hash, I could write the app so that if it hits a very large file, it splits the chunks up among available cores to try to get through it faster. Right now, if you have a mix of small and large files, it can take 10-20 minutes to get through a single very large file running on a single core. But I would imagine if I could split the chunks up across several cores I could process that file faster. What’s described in the link above was before preemptive threading was implemented in Xojo and this seems like a good use case for threads.
Or is there an external library I can send a file path to and have it handle that file directly, ideally in a fast way that takes advantage of multiple cores?
(And please, I don’t want to hear about MD5 being insecure or outdated. It’s just for checking file integrity in what is effectively a closed system, and it’s what our clients request. Plus the tool needs to run in reverse as well, to verify a previously bagged set, many of which were done as MD5.)