Running encryption hashes on large files

I could live with that, probably, since this is for internal use. but it’d be nice to have a library I can call from Xojo that’s fast and lets me keep everything self-contained. I’m considering moving this over to a Windows machine, so I want to make sure I have cross-platform ability with minimal fuss.

If you stick with a Mac, or Linux, I have a pretty fast solution for you:

find /path/to/dir -type f -print0 | xargs -0 -P 6 openssl md5 %

This will use 6 cores (-P 6) to process every file within the given directory.

And this regular expression will pick out the file path and hash for each result:

^MD5\((.*)\)= (.*)$

Finally, this will create a file “out.txt” with content like:

/path/to/file1: hash1
/path/to/file2: hash2

The shell command:

find ~ -type f -print0 | xargs -0 -P 6 openssl md5 % | grep MD5 | sed -E 's#^MD5\((.*)\)= (.*)$#\1: \2#' > out.txt

In this case, Xojo may not be the most efficient solution, but anyway, you can stick parts of this into a Shell and parse the results if desired.

Thanks. Sticking with Xojo to at least manage things. The Bagit spec requires specific formatting on the manifest file, and the creation of some other sidecar files. It’s probably doable with some shell scripts, but because there are variables like what type of encryption you use, and that you can create a list of excluded file types/names, and entering custom metadata, it’s just easier to manage with a GUI. And I want to add some other features as well, like a batch interface for doing multiple jobs sequentially, and a simple tool for verifying received packages.

As for target platforms, it’s a long story. Our entire storage infrastructure is built around a TigerStore SAN and we use Macs, Windows, and Linux machines here. We have hundreds of TB of very fast storage on a 40GbE network. Tigerstore is the SAN metadata server and handles things like file locking and permissions so that multiple users can access the same stuff at the same time (and no, SMB can’t do that with the performance levels we require). With the SAN, network volumes are mounted such that the wrkstation thinks its writing to a natively formatted, direct-attached drive. It’s very fast, without all the overhead that comes with SMB or other network file sharing systems.

But Tiger Soft doesn’t yet support Linux kernels beyond 2.9 and that’s a problem. We have quite a few aging linux boxes that I’d love to upgrade with newer versions, but we’re stuck until they update their drivers. We can only do Mac and Windows for now because any software running on Linux has to run on hardware from 6-8 years ago. newer kernels don’t install on new generation hardware (at least not easily). Tiger says they’re working on linux, but it’s been slow going.

Macs would be great - my Mac Studio is a beast for what it cost. But it can’t access our 40GbE network at speeds faster than about 10Gb/s because Apple knows best. They know what I want and apparently fast networking ain’t it. The older MacPros we use now have 10GbE NICs and can get to the 40GbE network that way, but they’re very old (2009-2010) and they max out when we run this software on more than half the available cores. A faster mac with a current generation CPU could certainly process the files faster, but can’t run on the 40GbE network because of lack of drivers or poor driver support on modern Mac OS versions. So if I was to make something that’s higher performance, I have to have the network speed to go along with it. 10GbE Isn’t enough to process several gigantic files at once. Our SAN can move close to 3GB/s (that’s gigabytes) but a 10GbE connection to the network caps out at 1.25GB/s nominally, more like 700-800MB/s in the real world, on a good day.

So that leaves Windows, at least until Linux drivers are available for the SAN, then we can do it there and it should be crazy fast.