Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In shell, how do I delete numbered duplicate files?

Tags:

bash

awk

I've got a directory with a few thousand files in it, named things like:

filename.ext
filename (1).ext
filename (2).ext
otherfile.ext
otherfile (1).ext
etc.

Most of the files with bracketed numbers are duplicates of the original, but in some cases they're not.

How can I keep my original files, delete the duplicates, but not lose the files that are different?

I know that I could rm *\).ext, but that obviously doesn't make sure that files match the original.

I'm using OS X, so I have a md5 program that functions sort of like md5sum in Linux, though it puts the hash at the end of the line instead of the beginning. I was thinking I could use an awk script to take the output of md5 *.ext | awk 'some script', find duplicates by md5, and delete them, but the command line is too long (bash: /sbin/md5: Argument list too long).

And I don't know what to write in the script. I was thinking of storing things in an array with this:

awk '{a[$NF]++} a[$NF]>1{sub(/).*/,""); sub(/.*(/,""); system("rm " $0);}'

But that always seems to delete my original.

What am I doing wrong? How do I do it right?

Thanks.

like image 894
Graham Avatar asked Oct 03 '12 17:10

Graham


1 Answers

Your awk script deletes original files because when you sort your files, . (period) sorts after (space). SO the first file that's seen is numbered, not the original, and subsequent checks (including the one against the original) compare files to the first numbered one.

Not only does rm *\).txt fail to match the original, it loses files that may not have an original in the first place.

I wouldn't do this quite this way. Rather than checking every numbered file and verifying whether it matches an original, you can go through your list of originals, then delete the numbered files that match them.

Instead:

$ for file in *[^\)].txt; do echo "-- Found: $file"; rm -v $(basename "$file" .txt)\ \(*\).txt; done

You can expand this to check MD5's along the way. But it's more code, so I'll break it into multiple lines, in a script:

#!/bin/bash

shopt -s nullglob              # Show nothing if a fileglob matches no files

for file in *[^\)].ext; do
  md5=$(md5 -q "$file")        # The -q option gives you only the message digest
  echo "-- Found: $file ($md5)"
  for duplicate in $(basename "$file" .ext)\ \(*\).ext; do
     if [[ "$md5" = "$(md5 -q "$duplicate")" ]]; then
        rm -v "$duplicate"
     fi
  done
done

As an alternative, you can probably get away with doing this a little more simply, with less CPU overhead than calculating MD5 digests. Unix and Linux have a shell tool called cmp, which is like diff without the output. So:

#!/bin/bash

shopt -s nullglob

for file in *[^\)].ext; do
  for duplicate in $(basename "$file" .ext)\ \(*\).ext; do
    if cmp "$file" "$duplicate"; then
      rm -v "$file"
    fi
  done
done
like image 101
ghoti Avatar answered Oct 27 '22 09:10

ghoti