Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I need help comparing files in a directory recursively to find duplicates

Tags:

java

file-io

I'm working on a program that will (hopefully) compare all files in a given directory, identify duplicates, add them to a list, then display the list to the user so they can verify they want those files deleted before deleting them and I'm seriously stuck. So far I've been able to recursively list all the files and I've been messing around with comparing them to find the duplicates. I'm quickly realizing to accomplish what I want I'm going to need to compare more than one file attribute. Not all files will be text files and comparing text is mostly what I've found as far as example code on the internet goes, I'm trying learn more about the binary data because comparing byte arrays and file names is the best I could come up with. Specifically I'm asking which attributes would be best to compare in order to balance accuracy in finding the duplicates and being able to handle a reasonable sized directory? And if you don't mind how could I implement it in my code? Hopefully my question wasn't too terrible, I'd really appreciate any help I can get. Here's what I have, and yes, a couple of the methods and the second file I did find here in case you were wondering. P.S. I'm really sorry about the pointless variables if I missed any, I tried to clean up the code a little before posting it

ListFilesInDir.java

import java.io.*;
import java.nio.file.Files;
import java.nio.file.attribute.*;
import java.security.*;
import java.util.*;

public final class ListFilesInDir {

static File startingDir;

static List<File> files;
static List<File> dirs;
static TreeMap<Integer, File> duplicates;
static ArrayList<Integer> usedIndexes = new ArrayList<Integer>();
static ArrayList<File> duplicateList = new ArrayList<File>();

static File out = new File("ListDuplicateFiles.txt");
static PrintWriter output;

static int key = 0;
static String tabString;
static TreeMap<Integer, File> tMap = new TreeMap<Integer, File>();

static int num1 = 0;
static int num2 = 0;
static File value1 = null;
static File value2 = null;
static String path1 = null;
static String name1 = null;
static String path2 = null;
static String name2 = null;

public static void main(String[] args) throws FileNotFoundException {
    new ListFilesInDir(args[0]);
}

public ListFilesInDir(String string) throws FileNotFoundException {
    startingDir = new File(string);
    dirs = new ArrayList<File>();
    duplicates = new TreeMap<Integer, File>();
    output = new PrintWriter(out);

    getFiles(startingDir);
    compareFiles();
    writeDuplicateList();
}

public void getFiles(File root) throws FileNotFoundException {
    System.out.println("Adding files to list...");
    ListFilesInDir.files = getFileList(root);
    for (File file : files) {
        if (!file.isFile()) {
            System.out.println("Adding DIR: " + key + " name: " + file);
            dirs.add(file);
        } else {
            System.out.println("Adding FILE: " + key + " name: " + file);
            tMap.put(key, file);
        }
        key++;
    }
    System.out.println(dirs.size());
    System.out.println("Complete");
}

public static void compareFiles() throws FileNotFoundException {
    System.out.println("Preparing to compare files...");
    for (num1 = 0; num1 < files.size(); num1++) {
        for (num2 = 0; num2 < files.size(); num2++) {

            if (num1 != num2) {
                value1 = files.get(num1);
                value2 = files.get(num2);
                path1 = value1.getAbsolutePath();
                path2 = value2.getAbsolutePath();
                name1 = path1.substring(path1.lastIndexOf(File.separator));
                name2 = path2.substring(path2.lastIndexOf(File.separator));
                HashMap<Integer, File> testMap = new HashMap<Integer, File>();

                System.out.println(num1 + "|" + num2 + " : " + value1
                        + " - " + value2);
                if (CompareBinaries.fileContentsEquals(
                        value1.getAbsolutePath(), value2.getAbsolutePath()) == true) {
                    if (testMap.put(num1, value1) != null) {
                        TreeSet<File> fileTreeSet;
                    }
                    addDuplicate(num1, value1);
                    files.remove(num1);

                    System.out.println("added(binary): " + num1 + ":"
                            + value1);

                } else if (value1.getName().equalsIgnoreCase(
                        value2.getName())) {
                    addDuplicate(num1, value1);
                    files.remove(num1);
                    System.out.println("added(name): " + num1 + ":"
                            + value1);
                }
            }
        }
    }
    System.out.println("Complete");

}

public static void writeDuplicateList() {
    int printKey = 0;
    for (File file : duplicateList) {
        output.printf("%03d | %s\n", printKey, file);
        System.out.printf("%03d | %s\n", printKey, file);
        printKey++;
    }

    output.append(docsInfo());
    output.close();
    output.flush();

    System.out.println("\n"+files.size()+" files in "+startingDir.getAbsolutePath() +", "+duplicateList.size()+" duplicate files.");
}

static public String docsInfo() {
    String s = "\n\n" + files.size() + " files in "
            + startingDir.getAbsolutePath() + ", " + duplicates.size()
            + " duplicate files.";
    return s;
}

static public List<File> getFileList(File file)
        throws FileNotFoundException {
    List<File> result = getUnsortedFileList(file);
    Collections.sort(result);
    return result;
}

static private List<File> getUnsortedFileList(File file)
        throws FileNotFoundException {
    List<File> result = new ArrayList<File>();
    File[] filesAndDirs = file.listFiles();
    List<File> filesDirs = Arrays.asList(filesAndDirs);
    int dirKey = 0;

    for (File fileList : filesDirs) {
        result.add(fileList);
        if (!fileList.isFile()) {

            List<File> deeperList = getUnsortedFileList(fileList);
            result.addAll(deeperList);
        }
    }
    return result;
    }

        static private void validateDir(File dir) throws FileNotFoundException {
    if (dir == null)
        throw new IllegalArgumentException("Directory is null!");
    if (!dir.exists())
        throw new FileNotFoundException("Directory doesn't exist: " + dir);
    if (!dir.isDirectory())
        throw new IllegalArgumentException(dir + "is not a directory!");
    if (!dir.canRead())
        throw new IllegalArgumentException("Directory cannot be read: "
                + dir);
     }

         public static void addDuplicate(int i, File file)throws FileNotFoundException{
          if (!duplicates.containsKey(i)) {
           duplicates.put(i, file);
               duplicateList.add(file);

          }
     }
    }

CompareBinaries.java

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.Arrays;


public class CompareBinaries {

private final static int BUFFSIZE = 1024;
private static byte buff1[] = new byte[BUFFSIZE];
private static byte buff2[] = new byte[BUFFSIZE];

public static boolean inputStreamEquals(InputStream is1, InputStream is2) {
    if(is1 == is2) return true;

    if(is1 == null && is2 == null) {
        System.out.println("both input streams are null");
        return true;
    }

    if(is1 == null || is2 == null) return false;
    try {
        int read1 = -1;
        int read2 = -1;

        do {
            int offset1 = 0;
            while (offset1 < BUFFSIZE
                        && (read1 = is1.read(buff1, offset1, BUFFSIZE-offset1)) >= 0) {
                        offset1 += read1;
                }

            int offset2 = 0;
            while (offset2 < BUFFSIZE
                        && (read2 = is2.read(buff2, offset2, BUFFSIZE-offset2)) >= 0) {
                        offset2 += read2;
                }
            if(offset1 != offset2) return false;
            if(offset1 != BUFFSIZE) {
                Arrays.fill(buff1, offset1, BUFFSIZE, (byte)0);
                Arrays.fill(buff2, offset2, BUFFSIZE, (byte)0);
            }
            if(!Arrays.equals(buff1, buff2)) return false;
        } while(read1 >= 0 && read2 >= 0);
        if(read1 < 0 && read2 < 0) return true; // both at EOF
        return false;

    } catch (Exception ei) {
        return false;
    }
}

public static boolean fileContentsEquals(File file1, File file2) {
    InputStream is1 = null;
    InputStream is2 = null;
    if(file1.length() != file2.length()) return false;

    try {
        is1 = new FileInputStream(file1);
        is2 = new FileInputStream(file2);

        return inputStreamEquals(is1, is2);

    } catch (Exception ei) {
        return false;
    } finally {
        try {
            if(is1 != null) is1.close();
            if(is2 != null) is2.close();
        } catch (Exception ei2) {}
    }
}

public static boolean fileContentsEquals(String fn1, String fn2) {
    return fileContentsEquals(new File(fn1), new File(fn2));
}

}

like image 328
Kevin Bigler Avatar asked Nov 03 '12 12:11

Kevin Bigler


People also ask

How to compare two folders for duplicate files on Mac?

Part 1. How to compare two folders for duplicate files on Mac If you are a Mac user, you can use Cisdem Duplicate Finder to quickly find duplicate files in and across folders and effortlessly remove duplicates. Cisdem Duplicate Finder finds all types of duplicate files such as duplicate images and duplicate documents.

How do I search for duplicates in a folder?

Launch it and you’ll see a complex looking search dialog. You’ll want to select “Duplicates Search” in the Search Mode box at the top of the window and then choose folders to search by clicking the “Browse” button to the right of Base Folders.

How to recursively compare two folders in PowerShell?

Using PowerShell you can recursively compare two folders easily. Start PowerShell. Copy the following snippet and paste them all in the PowerShell window, and press ENTER Enter the source folder and destination folders, replacing the source_folder_path and dest_folder_path placeholders respectively.

What is the best easy to use duplicate file finder?

The Best Easy-to-Use Tool: Auslogics Duplicate File Finder. Many duplicate file finders are rather complex, and packed with many different options. Auslogics Duplicate File Finder is different from most, offering a simple interface that walks you through the process.


1 Answers

You could use an hash function to compare two files - two files (in a different folder) can have same name and attributes (eg length) but different content. For example, you can create a text file and then copy it on a different folder changing one letter in the content.

An hash function does some clever maths on the file content ending up with a number, even small difference in content will end up with two very different numbers.

Taking for example the md5 hash function, this produces a 16 bytes number out of a byte array of any length. While it is theoretically possible to create two files with the same md5 but different content, the probability is very low (while two files with same name and size but different content is a relatively high probability event)

The point is, you can build a table of md5 of file contents, this has to be calculated only once and it is quick to compare - if the md5 are different, the files are different with an 100% confidence. Only in the unlikely event the md5 are the same you will have to resort to byte-by-byte comparison to be 100% sure.

like image 169
thedayofcondor Avatar answered Oct 01 '22 23:10

thedayofcondor