Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Duplicate photo searching with compare only pure imagedata and image similarity?

Having approximately 600GB of photos collected over 13 years - now stored on freebsd zfs/server.

Photos comes from family computers, from several partial backups to different external USB HDDs, reconstructed images from disk disasters, from different photo manipulation softwares (iPhoto, Picassa, HP and many others :( ) in several deep subdirectories - shortly = TERRIBLE MESS with many duplicates.

So in the first i done:

  • searched the the tree for the same size files (fast) and make md5 checksum for those.
  • collected duplicated images (same size + same md5 = duplicate)

This helped a lot, but here are still MANY MANY duplicates:

  • photos what are different only with exif/iptc data added by some photo management software, but the image is the same (or at least "looks as same" and have the same dimensions)
  • or they are only a resized versions of the original image
  • or they are the "enhanced" versions of originals, etc..

Now the questions:

  • how to find duplicates withg checksuming only the "pure image bytes" in a JPG without exif/IPTC and like meta informations? So, want filter out the photo-duplicates, what are different only with exif tags, but the image is the same. (therefore file checksuming doesn't works, but image checksuming could...). This is (i hope) not very complicated - but need some direction.
  • What perl module can extract the "pure" image data from an JPG file what is usable for comparison/checksuming?

More complex

  • how to find "similar" images, what are only the
    • resized versions of the originals
    • "enchanced" versions of the originals (from some photo manipulation programs)
  • is here already any algorithm available in a unix command form or perl module (XS?) what i can use to detect these special "duplicates"?

I'm able make complex scripts is BASH and "+-" :) know perl.. Can use FreeBSD/Linux utilities directly on the server and over the network can use OS X (but working with 600GB over the LAN not the fastest way)...

My rough idea:

  • delete images only at the end of workflow
  • use Image::ExifTool script for collecting duplicate image data based on image-creation date, and camera model (maybe other exif data too).
  • make checksum of pure image data (or extract histogram - same images should have the same histogram) - not sure about this
  • use some similarity detection for finding duplicates based on resize and foto enhancement - no idea how to do...

Any idea, help, any (software/algorithm) hint how to make order in the chaos?

Ps:

Here is nearly identical question: Finding Duplicate image files but i'm already done with the answer (md5). and looking for more precise checksuming and image comparing algorithms.

like image 270
jm666 Avatar asked Jun 22 '13 11:06

jm666


People also ask

How do you identify identical images?

The Google picture search is the most widely used image search engine due to its extensive database that contains billions of images uploaded over the web. It is best to use image search Google when your aim is to find identical pictures against your queried image.

What does duplicate image mean?

Definition: Photographic copy of original photograph, usually transparency or negative: copy should reproduce colours, contrast and detail of original as closely as possible. * Duplicate may be larger than original.


1 Answers

Assuming you can work with localy mounted FS:

  • rmlint : fastest tool I've ever used to find exact duplicates
  • findimagedupes : automatize the whole ImageMagick way (as Randal Schwartz's script that I haven't tested? it seems)
  • Detecting Similar and Identical Images Using Perseptual Hashes goes all the way (a great reference post)
  • dupeguru-pe (gui) : dedicated tool that is fast and does an excellent job
  • geeqie (gui) : I find it fast/excellent to finish the job, using the granular deduplication options. Also then you can generate an ordered collection of images such that 'simular images are next to each other, allowing you to 'flip' between the two to see the changes.
like image 116
tuk0z Avatar answered Oct 02 '22 16:10

tuk0z