improving performance in shell

Is it possible to improve the speed at which a shell process runs by wrapping it in something like nice? That is, instead of calling:

md5 -r /path/to/my/file

what about

nice -20 md5 -r /path/to/my/file

or something similar? basically when the md5 command is running it’s using 1.5% of the CPU it’s running on, and some of my files are huge. i’d like to boost its use of the available horsepower. Any way to do that?

I’ve tried this: shell.Execute("nice", "-20 md5 -r /path/to/my/file") but the results I’m getting are garbage (instead of the md5 hash, I’m getting “nice: No such file or directory” as a result.

Try the full path to nice.

Have you tested performance of using a shell to md5 vs the MBS function which will process a file in chuncks are return the hash? It launches its own thread to do it, so I believe you still get the benefit of each call being able to be on another CPU core.

I’ll give that a shot. Thanks.

I have. The native md5 is faster and presumably more reliable. Per the MD5DigestMBS.HashFile docs: “May raise OutOfMemoryException or IOException” – given that it’s entirely possible we’ll regularly be running files that are 1TB in size, I’m not willing to risk this. We’ve run files with the native md5 that are that big, and it works fine.

But in my tests, MD5DigestMBS wasn’t especially fast running one instance at a time. So, I’d still need to call it multiple times - I ran a test and it took 9 hours to process about 1000 files totaling 1.5TB. Same test with the built-in md5 ran it a bit faster, and in my preliminary tests this afternoon with 4 md5 instances running at once, it was about 30% faster on my machine. But I’d like to try to make use of all the available clock time more effectively if possible.

Turns out it works fine without the full path. I was giving nice a screwy argument which is why it was failing. That said, it doesn’t seem to have made a difference, performance-wise. It’s still only using about 1.5% of CPU per md5 instance.

It sounds like the command line might be IO bound. Have a look at the command line parameters to see if you can set a buffer / chunk size.

re> the MBS function. I’m sure you can process a file in chunks which would avoid any out of memory issues. You could try processing the file in 100MB chunks to see if that is faster than the command line.

That MBS function already does process the file in chunks, according to the docs. So @Christian Schmitz would have to comment on the caveat listed here

[quote]Function: Calculates hash from whole file.

Notes:
Plugin will start a preemptive thread to read in file and process all data in chunks.
Returns hash on success or empty string on failure. May raise OutOfMemoryException or IOException.[/quote]

There is no need to use a plugin. The shell works fine here.
You have to ask where the real bottleneck of the checksum creation is.

You wrote that the md5 call does use only a few % CPU. This points to an I/O bottleneck. Therefore it does not make much sense to perform several checksums at once.

You can try it yourself in shell. In my example I have a folder with 5 big ZIP files.
Calling it one after another gives

[quote] time (for fi in *.zip; do md5 “$fi” > /dev/null;done)
real 0m41.256s
user 0m13.147s
sys 0m1.986s
[/quote]
But calling it parallel takes much more time!

[quote]for fi in *.zip; do time md5 “$fi” > /dev/null & done
…
real 1m6.632s
user 0m6.883s
sys 0m1.065s
[/quote]
Since each call returns a time here only the last is listed.

This shows that a queued processing is faster than a parallel processing.

I just ran your tests and confirmed it’s taking longer with concurrent runs. So the question is: where is the bottleneck? My test files were 220MB WAV files (a folder of 5 of them, all the exact same size). They’re on a local drive. Took about 20 seconds per file running concurrently.

Incidentally, calling md5 via ‘nice -20’ actually makes them a bit slower.

So I moved them to our SAN, which is where we would be doing this work off of in real life. I was pretty surprised to see how much slower that is (36 seconds, vs 22 seconds off the local drive). The SAN is connected to this machine via a 10GbE network and is a true SAN (the drives appear as local volumes, no SMB). We regularly move files around on this machine at saturation on the 10GbE NIC. And this machine is one of the slower connections. most of the workstations are connected to it via a 40GbE NIC, and the SAN can easily move 1.5GB/s.

I’ll have to test this on one of the MacPro Xeon boxes that are connected to the SAN via 40GbE to try to eliminate that bottleneck. the connection I have on this iMac is good enough for the work I do from it, but it’s nothing like the performance on those machines.

It is worth noting that the python-based command line tool we’ve used in the past to do this same work lets you specify the number of concurrent processes. Past a certain number you start to see slowdowns, but in testing I’ve found that running 4-8 concurrent processes (depending on the machine you’re on - I can do 8 on a Linux box with dual 14-core Xeons in it) is possible without slowdowns, and with a significant increase in overall processing speed. The only reason we’re not using it is that it’s buggy and sometimes refuses to run on certain volumes. nobody can figure out why, but my software works fine on those volumes where the command line python app can’t.

*[I realized after posting this that I was running a large file copy in the background. I stopped that and revised my SAN speed numbers, and am now much less worried!)

Good luck hunting.

You may use “dd” to see the the read performance using a command like this one

It does read a file and reports the speed when finished.

Perhaps it does help to optimize your setup.

Curious, why do you want/need the hash? What will you be doing with it?

Our primary business is motion picture film scanning and restoration, and most of our clients are film archives and libraries. They like to have digital deliverables “bagged” per the library of congress Bagit specification: Basically a simple but well defined packaging system for any kind of digital files. The bag consists of a data folder containing all of the files you’re sending, along with a manifest file with a checksum for each file, and some other sidecar metadata files. These can then be checked on the receiving end to ensure there was no corruption of the files in transit.

I’m running a bag on my software now of 92600 files, each 75MB (a 4k film scan to a DPX image sequence), and it’s running at about 8500 files/hour on my lowly iMac 6-core i7. Not too shabby - certainly faster than the python script we were using previously and so far without the issues we were having.

You said in a different thread:

I am afraid calling a shell does not necessarily run on a different core. Actually, chances are at least on Mac and perhaps in Windows, that the child process takes place on the very core the parent application spawned it from.

I believe you are actually compounding drags on execution. Threads seem to run concurrently. But in fact, in Xojo they simply execute in small intervals taken away from the main thread.

I suspect by spawning multiple shells on the same core, as well as running threads, you load more and more work on a single core, probably effectively slowing down your app.

This can be verified fairly easily checking execution time with different number of threads, and shell sessions.

You may want to explore using helper apps that will actually launch on different cores, and communicate with them each through IPCSocket.

Ah, so you are calculating the hash on your end to verify that the files match what’s listed in the manifest, right?

I’d bet you could maximize performance by splitting the bags across multiple drives. Start a shell process for each drive and have that calculate the md5 for each file it finds. You should be able to minimize the io bottleneck that way since each process will be loading data from a different drive.

Probably, but with a typical bag being around 6TB (the size of an LTO7 tape), it’s impractical to have the files spread around like that. Any savings in time on the bagging side would be negated by the file copy time to set that up.

Looking at the CPU activity it doesn’t seem like it’s flooding a single CPU. In fact, all cores in use are seeing pretty low usage, so an I/O bottleneck makes the most sense.

I get how threads work, but the actual checksumming is being done by the OS’s command line md5 tool. When I spawn 4 threads, each using a shell to launch an md5 instance, I’m seeing 4 instances of md5 in the activity monitor. At that stage, isn’t it up to the OS how those are delegated to the available cores? I honestly don’t know, but it’s not my application that’s doing the heavy lifting, it’s another - I’m not seeing the CPU load of my application increase appreciably either. How is that different than a helper application?

My delegator thread just monitors how many worker threads are running and if it’s less than the max, it fires up another one. The worker threads just send a command to the OS-level md5 application to do something, and they wait for a response. So the threads themselves are doing no heavy lifting. From what I see in the Activity monitor, my app isn’t under a noticeably larger load when running 4 shells vs 1.

Shell is very different from the Terminal (command prompt under Windows) where you launch several apps which take place on whatever core the system decides.

At least under Unix (Mac), shell is a child process of the app. Meaning, it runs on the very same core. I suspect the same holds true under Windows.

Since you seem to be spawning pretty heavy shells, if they execute on the same core, they may consume enough CPU to drag down the app.

I suspect as well that moving large files concurrently may create big bottlenecks on i/o interfaces.

But within the shell, a launched utility will be assigned a core as determined by the system, no? At least, that’s been my experience.

This was my assumption too. In Activity Monitor, I do see that the MD5 instances have my application as their parent thread, but when I pile them on I don’t see any appreciable load on my application (that might be because of I/O bottlenecks), but I’m seeing no difference in CPU usage for the main app.