Removing duplicate strings from huge text files

Tags:

java

I want to remove the duplicate strings from a text file. In order to do that I put every single line in a HashSet and then write them into another file. And it works fine. But when it comes to big files (180mb 5 million lines) it doesn't work very well. Assuming the fact that it is not possible to store 5 million strings in a HashSet or any other collection, I made a loop so I store the first 100 000 lines, then write them to file, then clear the HashSet and to it again until there is no more lines in the file. Unfortunately, this won't remove all the duplicates but I think it can remove about 70-90% of them. But it doesn't work. When I test it with the 180mb file with 5 million lines. I count about 300 000 duplicates and the new file has about 3 million lines. It should have about 5 million - 300 000. And when I count the iterations they should be 5 million, but they are 3,4 million.

    public File removeDuplicates(File file) {
    System.out.println("file opened");
    Scanner sc;
    HashSet<String> set = new HashSet<String>();
    JFileChooser chooser = new JFileChooser();
    File createdFile = null;
    int returnVal = chooser.showSaveDialog(parent);
    if (returnVal == JFileChooser.APPROVE_OPTION) {
        BufferedWriter bufferedWriter = null;
        createdFile = chooser.getSelectedFile();
        try {           

            if (!createdFile.exists()) {
                createdFile.createNewFile();
            }
        }catch(Exception e) {
            e.printStackTrace();
        }
    }
    try {
        sc = new Scanner(file);
        boolean hasMore = true;
        while (hasMore) {
            hasMore = false;
            while (sc.hasNextLine() && set.size() < PERIOD) {
                set.add(sc.nextLine());
                repeated++;
            }
            createdFile = this.writeToFile(set,createdFile);
            set.clear();
            hasMore = true;
            if (sc.hasNextLine() == false)
                hasMore = false;
            set.clear();
        }
    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return createdFile;

}
private File writeToFile(HashSet<String> set, File f) {
        BufferedWriter bufferedWriter = null;
        try {           
            Writer writer = new FileWriter(f, true);
            bufferedWriter = new BufferedWriter(writer);
            for (String str : set) {
                bufferedWriter.write(str);
                bufferedWriter.newLine();
            }
        } catch (Exception e) {
            e.printStackTrace();
        }finally {
            if (bufferedWriter != null)
                try {
                    bufferedWriter.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
        }


    return f;
}

repeated is the variable that counts the iterations. Is it something from the code or is it from the RAM consumption? And is there any way to make it work?

974

asked Sep 13 '18 13:09

Dexxrey

1 Answers

De-duplicate

Let's assume for a moment, that you simply want to de-duplicate that file. I'd say the fastest, no-hassle method would be good old unix utils:

cat myfile.txt | sort -u > sorted.txt

Improving your Solution

(TL;DR Increase JVM Heap Size, initialize HashSet size and use the last solution in this answer!)

In case you need to do this in Java, let's first try to make this more efficient. Like many people have mentioned, 180MB isn't all that much. Just load the whole file, no need to chunk it (plus then you will not eliminate all duplicates). Take this line for example:

HashSet<String> set = new HashSet<String>();

This will create a HashSet with an initial capacity of n (I think 16 Elements?) and a load factor of 0.75, meaning that as you add lines, it will have to re-allocate memory and copy everything over. Here is something useful to read, especially "Performance"

So let's increase that size to avoid allocation:

Set<String> set = new HashSet<String>(5000000);

I left the load factor as is, but that means it will reallocate once it's 75% full. If you know the size of your file for sure, you can adjust those settings.

Alright, I had to learn it the hard way - always measure first! That is rule number one of performance work. I wrote all that and then tested my own implementation on my fast workstation (with 16GB RAM and a fast multi-core CPU) and summed all that up in my edit. Now I was curious to try your solution (which I should have done right away). So I re-ran it on my notebook at home (8GB RAM, 4+ year old CPU).

Alright, here is the simplified code:

import java.io.*;
import java.util.*;

public class SortTest {

    public static void main(String[] args) throws IOException {
        if (args.length != 1) {
            System.err.println("Pass filename as argument!");
            System.exit(1);
        }

        Set<String> set = new HashSet<String>();
        File createdFile = new File("./outfile");
        createdFile.createNewFile();

        try (BufferedReader br = new BufferedReader(new FileReader(new File(args[0])))) {
            for (String line = br.readLine(); line != null; line = br.readLine()) {
                set.add(line);
            }
        } catch (IOException ex) {
            throw new RuntimeException("Fatal Error.",  ex);
        }

        try (BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(createdFile, true))) {
            for (String line : set) {
                bufferedWriter.write(line);
                bufferedWriter.newLine();
            }
        }
    }
}

Changes: I removed the chunking, loading the whole file at once. I'm using a BufferedReader, bc. a Scanner is more useful for parsing (read integers etc.) and might incur overhead. I also added the writing of the file to the end and I don't need to recreate the BufferedWriter every time. Also note that File.createNewFile() will only create a file if it doesn't exist and return whether it did, so your check is superfluous. (Note that I have omitted proper error handling for brevity)

I used name.basics from https://datasets.imdbws.com/ That is a 509MB file (unzipped), containing 8.837.960 lines. Those are actually unique, so the end result is the same.

It actually consumed a heck of a lot of resources and my system becomes rather slow. At first, I even got an OutOfMemory Error! But running it with more Heap space worked: time java -Xmx4g SortTest ./name.basics.tsv Gives me:

real 0m44.289s

user 1m23.128s

sys 0m2.856s

So around 44 seconds, not bad. Now let's avoid the allocations and set:

Set<String> set = new HashSet<String>(9000000, 0.9f);

Result:

real 0m38.443s

user 1m12.140s

sys 0m2.376s

Well, that looks better. I have to say though, that I reran those tests multiple times and the times can vary up to 5 seconds, so in reality, the results are very close.

Just for fun, I'll also show my own little implementation, which uses more modern and succinct Java (again, no proper error handling):

import java.nio.file.*;
import java.util.*;

public class SortTest2 {

    public static void main(String[] args) throws Exception {
        Set<String> uniq = new HashSet<>(100000, 0.9f);
        try (Stream<String> stream = Files.lines(Paths.get(args[0]))) {
            stream.forEach(uniq::add);
        }

        Files.write(Paths.get("./outfile2"), (Iterable<String>) uniq::iterator);
    }
}

Results:

real 0m38.321s

user 1m16.452s

sys 0m2.828s

Less code, but the result is pretty much the same. Note: if you replace the HashSet with a LinkedHashSet, it will preserve the ordering of your lines! This is a good example why you should declare your variables and arguments with the most generic type possible. If you use Set<String> uniq, you have to change only that line to change implementation (HashSet vs. LinkedHashSet).

I actually wanted to have a look at it with a profiler, but the run time was so short, I didn't even get results before the program terminated.

If the file fits in your RAM and you use the appropriate max heap argument (-Xmx), it shouldn't be a problem.

By the way: I re-tested the cat | sort -u version - it took 55 seconds!

Note: heavily edited post after more testing

EDIT

Following user DodgyCodeException's suggestion and removed superfluous .stream() call in the second version.

OK, this is the best solution™ - I would say it was a collaborative effort, thanks to users Hulk and vlaz.

import java.nio.file.*;
import java.util.stream.*;

public class SortTest3 {

    public static void main(String[] args) throws Exception {
        try (Stream<String> stream = Files.lines(Paths.get(args[0]))) {
            Files.write(Paths.get("./outfile3"), (Iterable<String>) stream.distinct()::iterator);
        }
    }
}

Not only is this solution very succinct (possibly too much so), as fast as the other one, but best of all it preserves order. All thanks to .distinct().

Alternative Solutions

I think the above solution should suffice for most use-cases and is rather simple. But let's say you need to deal with a file, which does not fit into RAM, or you need to preserve line ordering. We can take the idea behind this solution and change it up a bit.

You read the file, line by line, so you'll always have one line in memory - let's say of average length m. You then need some identifier to store and compare later, preferably with a constant size k and k << m. So you need a hash function, but not a fast one with many collisions, but a cryptographic hash function, which is more collision resistant (e.g., SHA1, 2 or 3). But note: the more collision resistant, the larger the hash and the larger the computational work you need to put in.

Read line
Calculate hash
Look for value in linked list:
- if you find one larger, insert before
- if you find one equal, discard line
Write line to output file if not discarded

You'll need a linked list to keep insertion cheap (and that list has to grow). The list will be kept ordered by the insertion strategy and the output file will preserve order by writing the lines out immediately.

This would take up approximately n * k + m in space, but it calculating the hash function will be computationally expensive.

Note that this does not deal with collisions. If you use a good hash function, you can just pretend they won't happen (as they are very unlikely). If it is critical, you might need add another mechanism to confirm uniqueness, e.g., store the line number alongside the hash and fetch the previously seen line for a comparison. You'll then need to find a scheme to store the lines with collided hashes.

172

answered Oct 18 '22 14:10

Benjamin Maurer

Related questions
                            
                                Passing a non-associative function to reduce
                            
                                What is the point of SplittableRandom?
                            
                                Jacoco generate coverage report for only a single test class
                            
                                webView.canGoBack() always false
                            
                                android-RuntimePermissions sample code not working
                            
                                spring-boot @scheduled not running in different threads?
                            
                                Write sql native query with left join and pagination in hibernate (springboot) [duplicate]
                            
                                How to get annotation from overridden method in Java?
                            
                                Spring boot 2 - Actuator endpoint, where is /beans endpoint
                            
                                Android/FireStore QuerySnapshot convert to CustomObject
                            
                                String.format fails to format numbers on some devices?
                            
                                Collectors.joining vs StringBuilder.append
                            
                                What is the advantage of declaring ObjectMapper as a bean?
                            
                                Spring Boot JDBCTemplate with c3p0 DataBase Pooling and failover
                            
                                Find free port in Java
                            
                                Is it possible to avoid serialization when using the Vert.x eventbus 'locally' (java, single jvm)?
                            
                                parsing/converting task with characters and numbers within
                            
                                Effective Java claims that elements.clone() suffices
                            
                                @Timed annotation in spring metrics
                            
                                How to traverse multiple list using Java 8 stream?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With