Comparing video files with Crypto

Hi All.

I want to compare some video files to see if I have duplicate files, and looking through what I could find in the forum and with the examples that come with xojo, it seems the way to do it is with Crypto. (something).

However, other than that, I have no blinking idea where to start. Can someone help point me in the right direction?

Regards

You’ll want to use a hash function (MD5, SHA1, SHA2, etc.) These will identify exact byte-for-byte duplicates. If security isn’t a concern, then MD5 is probably the best bet since it’ll probably be faster than more modern/advanced hashes like SHA. It’s also useful if you’re processing very large files since it’s the only one Xojo can do incrementally rather than all at once.

Dim file1, file2 As FolderItem ' assume already populated
Dim bs1, bs2 As BinaryStream
bs1 = BinaryStream.Open(file1)
bs2 = BinaryStream.Open(file2)

Dim digest1 As New MD5Digest()
Dim digest2 As New MD5Digest()
Do Until bs1.EndOfFile Or bs2.EndOfFile
  Dim data1 As String = bs1.Read(1024 * 1024)
  Dim data2 As String = bs2.Read(1024 * 1024)
  digest1.Process(data1)
  digest2.Process(data2)
Loop

If digest1.Value = digest2.Value Then
  ' files match
Else
  ' files don't match
End If
1 Like

Thank you, Andrew.
Now I need to look up things like BinaryStream, and MD5Digest. But that is MY job.

Once again, thank you

Regards

Hi Andrew, one small question. This text I quoted… is that supposed to be the image size?

Regards

It’s simply how many bytes to process at once. Doing too many at once can make your app unresponsive. I used 1024*1024 because thats 1 megabyte. Nothing to do with the video dimensions.

@Andrew_Lambert - And, for clarity’s sake, you’re multiplying two numbers because “one megabyte = 1024 * 1024 bytes” is easier to remember than “one megabyte = 1048576 bytes”, correct?

(I’m guessing it was having a pair of values prompted Michael’s question)

If the code were:

Dim file1, file2 As FolderItem ' assume already populated
Dim bs1, bs2 As BinaryStream
bs1 = BinaryStream.Open(file1)
bs2 = BinaryStream.Open(file2)

Dim digest1 As New MD5Digest()
Dim digest2 As New MD5Digest()
Do Until bs1.EndOfFile Or bs2.EndOfFile
  Dim data1 As String = bs1.Read(1048576)
  Dim data2 As String = bs2.Read(1048576)
  digest1.Process(data1)
  digest2.Process(data2)
Loop

If digest1.Value = digest2.Value Then
  ' files match
Else
  ' files don't match
End If

It would function the same, correct?

Anthony

Yeah, it’s just an easier way to say 1MB. I would write 32MB as 1024*1024*32, for example.

2 Likes

Thanks for that explanation.

Thanks to everyone who responded.

Regards

And doing too few at once is not good either. How did you decide that value is “better”?

Yeah. I have found out that little caveat.

Everyone have a great day!

Regards

Trial and error and your judgement of whether to prioritize responsiveness versus efficiency.

If you’re writing a user-facing desktop app, then you’ll probably want to prioritize responsiveness and so choose smaller chunks. It’s less efficient but users won’t complain that the app freezes on large files. OTOH if you’re writing a background service that doesn’t need to respond to the user then you can prioritize efficiency by choosing larger chunks.

This is assuming you’re not running the heavy work on a pre-emptive background thread, which solves the responsiveness problem in a different way.

Hopefully this is obvious, but you don’t need to do the computations on files that couldn’t be duplicates - First thing to compare is file sizes, which is very computationally inexpensive. If the file sizes are different then you can move on to checking hashes.

Also worth noting that with media files, you could have two files that are different in terms of binaries but have the same content practically. The same video file created in a different codec, or with a single byte of metadata changed will yield a false negative in duplicate detection.

If you really want to get into the weeds there are perceptual algorithms that actually look at the visuals or audio as they would be perceived by a human, but they tend to be complex and probably out of the scope of this thread.

2 Likes

That’s one goal I’d like to achieve, though, as I have dozen of terabytes of files to compare, for tidying up my disks. A huge task that currently seems to take all my lifetime.
But I probably wouldn’t have time/knowledge to implement the algorithms you’re talking about either, sadly.

Why? You start simple and make that more complex as needed. The filesize filters out a lot. If a hash of the complete file is too slow maybe a hash of the first 10k of the file does the trick.

Last year I had to make a simple image comparison because the lovely people at Apple don’t save 1x and 2x images for Messages in the correct dimensions. So I scaled the images down to 4x4 and compared the resulting images.

That might be an idea for comparing videos, too. Do the 4x4 for the first couple of stills of the video.

Until now, I didn’t have the idea to check hashes, so I only relied on the file size, optionally comparing the whole data (not great for movie files…) and visual watching. Comparing MD5 is a great step forward in helping me filtering my duplicated files, but I can clearly still see edge cases (e.g. files saved to another container but keeping the same data, or movies I edited and I have to recall what was the modification (and which of both files is best)).

Yes, there are many reasons to need to compare pictures (legit or not). I assume you compared pixels by code; did you compare all pixels or just some?

That would be doable, indeed. I have yet to compare those videos that I digitalised twice (or more) and so have a different size and content (because the start of the capture isn’t accurate, of course, so they are two distinct copies, but I need to keep only one (and I want to make sure to keep the one with less defaults)).

But it’s nice to see replies supporting confidence that it’s doable.

I couldn’t compare pixels. iMessage has low dpi and high dpi images. But the images don’t have 1x and 2x as names. Nor do they have the correct dimensions like 400x1000 and 800x2000. Even that wouldn’t help because a message can have multiple attachments with the same dimensions.

Here is the function:

Private Function GetImageHash(theAttachment as MessageAttachment) As String
'do a simple hash to find similar images

dim options as new Dictionary
options.Value(CGImageSourceMBS.kCGImageSourceThumbnailMaxPixelSize) = kThumbSize
options.Value(CGImageSourceMBS.kCGImageSourceCreateThumbnailFromImageAlways) = true
options.Value(CGImageSourceMBS.kCGImageSourceCreateThumbnailWithTransform) = True

dim theCGImage as CGImageMBS = CGImageSourceMBS.CreateThumbnailMT(theAttachment.AttachmentData, 0, options)
dim HashPicture as Picture = theCGImage.Picture

// Step 1: Calculate brightness values
Dim brightness() As Double

For y As Integer = 0 To kThumbSize - 1
For x As Integer = 0 To kThumbSize - 1
Dim c As Color = HashPicture.RGBSurface.Pixel(x, y)

  // Simple brightness formula
  Dim b As Double = (c.Red + c.Green + c.Blue) / 3.0
  brightness.Add(b)
Next

Next

// Step 2: Calculate average brightness
Dim sum As Double
For Each b As Double In brightness
sum = sum + b
Next

Dim avgBrightness As Double = sum / brightness.Count

// Step 3: Build bitstring
Dim bits As String
For Each b As Double In brightness
If b >= avgBrightness Then
bits = bits + “1”
Else
bits = bits + “0”
End If
Next

// Step 4: Convert bits to hex string
Dim hexHash As String
For i As Integer = 0 To bits.Length - 1 Step 4
Dim nibble As String = bits.Middle(i, 4)
Dim value As Integer = Val(“&b” + nibble)
hexHash = hexHash + Hex(value)
Next

Return hexHash
End Function

Thumbsize = 16

Thanks for sharing.
FWIW, I’m not sure if it has changed since when I tested (years ago), but calling Picture.RGBSurface used to be way slower than storing the RGBSurface in a variable and calling RGBSurfaceVariable.Pixel() (i.e. accessing the RGBSurface of a picture would take time at each call).

I suspect that the bottleneck is the resizing of the images and not the RGBSurface for a 16x16 images. But it’s always good to test.

If you want to compare the content of video files, which could detect the “same” (or similar) content in a different format (different codec, bit-rate, image size, etc.) you would need to use Perceptual hashing - Wikipedia

You really need to vet perceptual hashes with edge cases. I’m a DJ so I often have two copies of a music video - one is the clean version and the other is explicit. They may have visually identical content but their audio differs by a single word, and a lot of the perceptual hash algorithms that don’t sample 100% of the content believe they’re the same file.

But they’re not, and confusing the two is big enough of an issue to possibly cost you your job.