Hi All.
A question as to the “logic” of how to find duplicate files in a folder.
I am working on a program to run through a folder and find duplicate files (pictures, mp3’s, etc) and I current have it working if I compare an md5 on each file. If they match 100% I know they are exactly the same.
But if I have two of the same picture (one bigger than the other) how do I determine this?
Any ideas?
Regards
You’ll have to do some kind of similarity analysis on the picture data. Perhaps you could start by scaling each picture down to a standard small size - say, 100x100 pixels - and see how similar the pictures are by comparing each pixel.
Be prepared for disappointment, though. This won’t be foolproof. It’s possible that a more advanced analysis library could do a better job (neural network?) but you’ll always have some degree of uncertainty because at a fundamental level, a picture that has been resized is a different image.
+1
Shrink both images to the same size.
If you have MBSplugins, check out the Picture.CompareMBS function which will give you a percentage
You decide the threshold
they are comparing pixel by pixel? Wow!
Modern OSs, coupled with modern CPUs and GPUs, have fairly amazing, very fast methods for doing analysis and manipulation of large datasets, which can include image data. All that memory and those processor cycles have to be used for something. 
In Messages the great clever people at Apple have images in normal dpi and high dpi. The images are not labelled @1x and @2x and the images don’t have the correct dimensions. Therefore, I had ChatGPT write me something to find/exclude similar images:
Private Function GetImageHash(theAttachment as MessageAttachment) As String
'do a simple hash to find similar images
dim options as new Dictionary
options.Value(CGImageSourceMBS.kCGImageSourceThumbnailMaxPixelSize) = kThumbSize
options.Value(CGImageSourceMBS.kCGImageSourceCreateThumbnailFromImageAlways) = true
options.Value(CGImageSourceMBS.kCGImageSourceCreateThumbnailWithTransform) = True
dim theCGImage as CGImageMBS = CGImageSourceMBS.CreateThumbnailMT(theAttachment.AttachmentData, 0, options)
dim HashPicture as Picture = theCGImage.Picture
// Step 1: Calculate brightness values
Dim brightness() As Double
For y As Integer = 0 To kThumbSize - 1
For x As Integer = 0 To kThumbSize - 1
Dim c As Color = HashPicture.RGBSurface.Pixel(x, y)
// Simple brightness formula (you can make it fancier if needed)
Dim b As Double = (c.Red + c.Green + c.Blue) / 3.0
brightness.Add(b)
Next
Next
// Step 2: Calculate average brightness
Dim sum As Double
For Each b As Double In brightness
sum = sum + b
Next
Dim avgBrightness As Double = sum / brightness.Count
// Step 3: Build bitstring
Dim bits As String
For Each b As Double In brightness
If b >= avgBrightness Then
bits = bits + "1"
Else
bits = bits + "0"
End If
Next
// Step 4: Convert bits to hex string
Dim hexHash As String
For i As Integer = 0 To bits.Length - 1 Step 4
Dim nibble As String = bits.Middle(i, 4)
Dim value As Integer = Val("&b" + nibble)
hexHash = hexHash + Hex(value)
Next
Return hexHash
End Function
Because a message usually doesn’t have hundreds of images I only need something really simple. So my thumbnail size is 16.