Find duplicate files in a folder

Hi All.

A question as to the “logic” of how to find duplicate files in a folder.

I am working on a program to run through a folder and find duplicate files (pictures, mp3’s, etc) and I current have it working if I compare an md5 on each file. If they match 100% I know they are exactly the same.

But if I have two of the same picture (one bigger than the other) how do I determine this?

Any ideas?

Regards

You’ll have to do some kind of similarity analysis on the picture data. Perhaps you could start by scaling each picture down to a standard small size - say, 100x100 pixels - and see how similar the pictures are by comparing each pixel.

Be prepared for disappointment, though. This won’t be foolproof. It’s possible that a more advanced analysis library could do a better job (neural network?) but you’ll always have some degree of uncertainty because at a fundamental level, a picture that has been resized is a different image.

+1
Shrink both images to the same size.
If you have MBSplugins, check out the Picture.CompareMBS function which will give you a percentage
You decide the threshold

they are comparing pixel by pixel? Wow!

Modern OSs, coupled with modern CPUs and GPUs, have fairly amazing, very fast methods for doing analysis and manipulation of large datasets, which can include image data. All that memory and those processor cycles have to be used for something. :grin:

In Messages the great clever people at Apple have images in normal dpi and high dpi. The images are not labelled @1x and @2x and the images don’t have the correct dimensions. Therefore, I had ChatGPT write me something to find/exclude similar images:

Private Function GetImageHash(theAttachment as MessageAttachment) As String
  'do a simple hash to find similar images
  
  dim options as new Dictionary
  options.Value(CGImageSourceMBS.kCGImageSourceThumbnailMaxPixelSize) = kThumbSize
  options.Value(CGImageSourceMBS.kCGImageSourceCreateThumbnailFromImageAlways) = true
  options.Value(CGImageSourceMBS.kCGImageSourceCreateThumbnailWithTransform) = True
  
  dim theCGImage as CGImageMBS = CGImageSourceMBS.CreateThumbnailMT(theAttachment.AttachmentData, 0, options)
  dim HashPicture as Picture = theCGImage.Picture
  
  // Step 1: Calculate brightness values
  Dim brightness() As Double
  
  For y As Integer = 0 To kThumbSize - 1
    For x As Integer = 0 To kThumbSize - 1
      Dim c As Color = HashPicture.RGBSurface.Pixel(x, y)
      
      // Simple brightness formula (you can make it fancier if needed)
      Dim b As Double = (c.Red + c.Green + c.Blue) / 3.0
      brightness.Add(b)
    Next
  Next
  
  // Step 2: Calculate average brightness
  Dim sum As Double
  For Each b As Double In brightness
    sum = sum + b
  Next
  
  Dim avgBrightness As Double = sum / brightness.Count
  
  // Step 3: Build bitstring
  Dim bits As String
  For Each b As Double In brightness
    If b >= avgBrightness Then
      bits = bits + "1"
    Else
      bits = bits + "0"
    End If
  Next
  
  // Step 4: Convert bits to hex string
  Dim hexHash As String
  For i As Integer = 0 To bits.Length - 1 Step 4
    Dim nibble As String = bits.Middle(i, 4)
    Dim value As Integer = Val("&b" + nibble)
    hexHash = hexHash + Hex(value)
  Next
  
  Return hexHash
  
End Function

Because a message usually doesn’t have hundreds of images I only need something really simple. So my thumbnail size is 16.