Running encryption hashes on large files

About 8 years ago I built an app for in-house use. Some of our clients require that we deliver our final deliverables to them (film scans and related files) packaged according to the Library of Congress Bagit specification. Easy enough - it’s just a structured folder and an MD5 or SHA hash of each file written to a text file. The purpose of bagit is to provide a standardized system to verify file integrity of stuff that’s (typically) put on an LTO tape or other long-term storage system.

The app I wrote has some annoying bugs and needs a refresh. it was made so long ago I’m thinking of completely redoing it. The way it works now, you point it to a folder and it walks that folder’s directory tree, hashing every file. It creates a text manifest file that contains the hash and the relative path to each file. I am using threads so that you can run multiple checksums at the same time. Each thread launches the OS-native MD5 tool via a shell command and runs it there, so you can have multiple instances running simultaneously. Xojo threads are only involved in so much as they manage the calls to the OS-level md5 shell command, and report back when done. None of the heavy lifting is done directly in the app, it’s just managing the shell commands. My app is about 2 times faster than the freely available apps out there (a big deal when you’re talking 4 hours vs 8 hours), many of which are pretty buggy and unreliable and typically not able to process more than one file at a time.

So I’m thinking that while I’m overhauling it, it might be worth a revisit to how I do this, and I’m considering using preemptive threading and actually generating the hashes in Xojo instead of sending it out to multiple shells to take advantage of multi-core machines.

The main consideration here is that the files being hashed are highly variable in size. On the one hand you might have a folder with 20,000 or more sequentially numbered images, under 200MB/image. Or you might have MOV or MXF files that are in excess of 1 TB in size. Or some combination of the two. In our case, a bag might be anywhere from about 10TB if it’s going on an LTO-8 tape, to 30+TB if it’s going on hard drives and is a particularly large project.

Is it worth trying to do this entirely within Xojo using the internal tools (obviously breaking them up into a digest in a way like what @Kem_Tekinay described here), or should I keep relying on OS-native shell commands?

I feel like with preemptive threading and a digest-style hash, I could write the app so that if it hits a very large file, it splits the chunks up among available cores to try to get through it faster. Right now, if you have a mix of small and large files, it can take 10-20 minutes to get through a single very large file running on a single core. But I would imagine if I could split the chunks up across several cores I could process that file faster. What’s described in the link above was before preemptive threading was implemented in Xojo and this seems like a good use case for threads.

Or is there an external library I can send a file path to and have it handle that file directly, ideally in a fast way that takes advantage of multiple cores?

(And please, I don’t want to hear about MD5 being insecure or outdated. It’s just for checking file integrity in what is effectively a closed system, and it’s what our clients request. Plus the tool needs to run in reverse as well, to verify a previously bagged set, many of which were done as MD5.)

I’d keep it the way it is unless you have a compelling need to rework it, such as making it work in the macOS sandbox. There is a rule “never rewrite your software” that sounds like it applies here. Refactoring parts is ok, but replacing entirely tends to be a bad idea.

From a high level, I recommend fixing what needs to be fixed, but leave the rest alone.

Well the whole thing was slapped together pretty quickly. It works but I’ve been wanting to re-do it for a while. And if I can make it faster, which I think will work if I am correct about doing it in preemptive threads, that’s a big deal.

I have my doubts it’d be faster.

I don’t think you can split the hashing of a single file among multiple threads, and you’d have to time how fast Xojo can run an MD5 digest on a large file in chunks vs. using a command line tool like md5.

If it turns out the latter is faster (as I suspect it will be), you can still take advantage of multiple cores in a single app by setting up an array of Shells in Asynchronous mode. Each Shell would be pull the next file off of a stack, run it, report its findings, then grab the next file. You can monitor progress via a Timer.

The nice thing about this approach is that you won’t have to deal with any issues that preemptive threads might present, but you’ll still get the benefit of concurrent processing.

Or, as Thom said, fix what needs to be fixed and leave the rest.

There is a lot of overhead spinning up Shells (ask @Sam_Rowlands for details), so I would be curious, just as an experiment, to find out (if? / where?) the threshold ( is? / may be?) for reading in and hashing smaller files within the framework.

The Shells can be set to Interactive and that should mitigate such overhead, if it’s a factor.

1 Like

It is a very good idea!

I’m just marginally interested in these ideas because I use a system of MD5ing to determine if files need to be uploaded.

“Watching from the sidelines” as it were.
Interested in the results, but without enough of a need to invest time into research.

MBS has HashFileMBS(file)

https://www.monkeybreadsoftware.net/encryptionandhash-md5digestmbs-shared-method.shtml

Runs on preemptive thread already so might fit into existing workflow. No idea about speed. I used the SHA256 to hash some 24gig video files and it took around 5 - 10 seconds, running off local SSD though.

1 Like

The md5 command line tool can take multiple files , so here is a refinement I might add:

Gather all the files into an array of FolderItem, then sort by size. Set some threshold, like say, 1 MB.

Each Shell would keep pulling files off the stack until the cumulative size exceeds that threshold. It would then process all the files at once, parse the results, then do it again.

Edit: I’d go by size threshold and count, so no more than 1 MB of cumulative file size, and no more than, say, 10 files at a time.

1 Like

This is essentially what I do now. If you have an image sequence, say, 20k files in a folder, it absolutely flies through them this way and I’ve had no issues with there being a lot of overhead in lauching all those shells. We mostly run this app on an old Xeon MacPro and it can process 8-10 files per second, depending on the resolution of those files (you start getting into I/O bandwidth issues with bigger files), using 6 of the 8 cores in the machine.

It’s the single very large files that kind of suck to process because once the app hits a few of those it gets bogged down for 10+ minutes per file. with each core working on a single file, sometimes it just looks like it’s hung up) because all of the cores are working on these big files. I’m just looking for a way to get through those faster, and the idea of splitting it up into chunks and sending those out for processing in separate threads might be a way to get through a very large file, faster.

Unless I misunderstood what you’re trying to do, you can’t hash a single file in chunks.

Edit: That is, where each chunk is handled by a different process.

I just ran a test on a 1.54 GB file using the command line, MD5Digest and MD5DigestMBS. The command line was the fastest at about 1.92s, followed by MBS at around 1.96s, followed by the native at around 2.3s.

For MBS and native, I read the file in ~ 1 MB chunks. For the command-line, I just ran it in a terminal outside Xojo.

All the results matched.

This is on MacOS 15.3.1 with an M3 Max, compiled with Aggressive and the time-saving pragmas in place.

#if not DebugBuild
  #pragma BackgroundTasks False
#endif
#pragma NilObjectChecking False
#pragma StackOverflowChecking False
#pragma BoundsChecking False

dim msg as string
dim sw as new Stopwatch_MTC
sw.Start

var m as new MD5DigestMBS // or MD5Digest

var f as new FolderItem( "/Path/to/file", FolderItem.PathModes.Shell )

var bs as BinaryStream = BinaryStream.Open( f )

while not bs.EndOfFile
  m.Process bs.Read( 1000000 )
wend

bs.Close

sw.Stop
msg = format( sw.ElapsedMicroseconds, "#," ) + " microsecs"
AddToResult msg
AddToResult EncodeHex( m.Value ).Lowercase
2 Likes

Ahh. Ok. Now that I look into this closer I see what you mean, since MD5Digest is just a class you’re passing the chunks to.

1 Like

openssl md5 comes in about about 1.88s, making it the fastest of all.

Is that an external library or a shell command?

I like your idea of using multiple files, so I’m going to look into that as well. But part of what’s nice about doing each separately is that I can track exactly where I am in the process. Doing them in chunks with multiple files sent to the same shell command would mess up the accuracy of that. Or at least, the smoothness of the incrementation as they’re finished and the progress bar fills. But that’s probably not a big enough deal to get hung up on.

Which?

Sorry. meant to quote that!

Openssl md5

openssl is a utility that comes with the Mac and (I think) most, if not all, Linux OS’s. There is also a Windows version, but I don’t know if it comes with that OS.

It handles all manner of hashing and encryption, and is universally relied upon to handle such functions.

On the Mac, it’s /usr/bin/openssl. You can see its various sub apps in Terminal with:

openssl help
1 Like

It doesn’t. Was just looking at that. Requires installing the C++ redist and the OpenSSL Windows packages from what I’m reading.