Which Hash to Use?

I’m using hashes to see if a bunch of external data has changed. I take half a dozen usually short strings, add them together, hash them, store the hash on disk as a string and later do it again to see if the hash (and thus the related short strings) have changed.

Security is not a worry.

What’s the quickest method that produces reasonably small hash? I assume one from crypto.hash?

And does the hash care if it gets a long string?

MD5 is fast and small.
SHA512 is secure and long.

Thanks Christian.

I assume that I need to convert it to hexadecimal if I want to save it in a TextOutputStream

[quote=129729:@Stephen Dodd]Thanks Christian.

I assume that I need to convert it to hexadecimal if I want to save it in a TextOutputStream[/quote]

That is correct.

Depends on the function. Some already encode to hex, I bet.

It’s up to you to encode/decode with the native classes. Consider SHA1 which is also small (20 bytes vs. 16 for MD5), but doesn’t have the weakness of MD5. I just tested and it’s also roughly the same speed as MD5 (both about 680 ms for 10,000 short strings).

If I don’t care about security, does MD5 have weaknesses?

google for 5d41402abc4b2a76b9719d911017c592

if you don’T care for security, it’s perfect!

A salt for hashing is always a good idea.

@Stephen Dodd I do the same type of thing when I update data in a database table. I take a MD5 hash of concated string of columns (and if the column is something other string I used cstr on it). Then I keep the last # digits of the hash.

then whenever I want to make sure the data is “secure” (hasnt been tampered with externally to the app), I compare a freshly generated hash to the hash in the table.

the hash is just used to make sure all the data updates are done via our app and not externally.

It really depends. TextInputStream/TextOutputStream can handle binary data. So it depends on how you want to use that data whether you need to convert it. Just FYI.

If you use ActiveRecord, you can write a method of the object (of the table) to generate the checksum. And with the BeforeSave event handler, you can update the hash every time you save the object.

Someone will correct me if I’m wrong, but I believe the weakness is that MD5 has a possibility of collisions, i.e., two different strings generating the same hash. This is highly unlikely, of course, but since SHA1 is just as fast and only slightly larger…

For a solid hash that you can define the outcome length, look into PBKDF2.

For security, absolutely. But not for speed as PBKDF2 is specifically designed to be slow.

Luckily in my case collisions are acceptable.

Thanks all! :slight_smile:

Depending on the format of your files, you may be able to use a custom hash.

For an app I have I need a hash that isn’t prone to collisions but is absurdly fast to generate. MD5 is unusable for this.

What I use is the hash Gabest came up with many years ago, which is perfect for video files. I had it in python and c 6 years ago and @Tim Hare helped me getting in RB shape:

http://forums.realsoftware.com/viewtopic.php?f=1&t=28911&p=80126&st=0&sk=t&sd=a

I have used this hash in a database that contains a few million hashes and no collissions have happened yet. It wouldn’t work with any kind of data, though. Video is easier because bytesize or file bookends change very easily.

With this algorithm I have been able to scan 1TB of video files with several hundred multi-gigabyte files in less than a minute. With MD5 I stopped it when it had been 5 minutes and had barely made any progress.

Very interesting information, thank you gentlemen.