As part of a project I'm working on, I'd like to clean up a file I generate of duplicate line entries. These duplicates often won't occur near each other, however. I came up with a method of doing so in Java (which basically made a copy of the file, then used a nested while-statement to compare each line in one file with the rest of the other). The problem, is that my generated file is pretty big and text heavy (about 225k lines of text, and around 40 megs). I estimate my current process to take 63 hours! This is definitely not acceptable.
I need an integrated solution for this, however. Preferably in Java. Any ideas? Thanks!
The uniq command is used to remove duplicate lines from a text file in Linux. By default, this command discards all but the first of adjacent repeated lines, so that no output lines are repeated. Optionally, it can instead only print duplicate lines.
We can remove duplicate element in an array by 2 ways: using temporary array or using separate index. To remove the duplicate element from array, the array must be in sorted order. If array is not sorted, you can sort it by calling Arrays. sort(arr) method.
Hmm... 40 megs seems small enough that you could build a Set
of the lines and then print them all back out. This would be way, way faster than doing O(n2) I/O work.
It would be something like this (ignoring exceptions):
public void stripDuplicatesFromFile(String filename) { BufferedReader reader = new BufferedReader(new FileReader(filename)); Set<String> lines = new HashSet<String>(10000); // maybe should be bigger String line; while ((line = reader.readLine()) != null) { lines.add(line); } reader.close(); BufferedWriter writer = new BufferedWriter(new FileWriter(filename)); for (String unique : lines) { writer.write(unique); writer.newLine(); } writer.close(); }
If the order is important, you could use a LinkedHashSet
instead of a HashSet
. Since the elements are stored by reference, the overhead of an extra linked list should be insignificant compared to the actual amount of data.
Edit: As Workshop Alex pointed out, if you don't mind making a temporary file, you can simply print out the lines as you read them. This allows you to use a simple HashSet
instead of LinkedHashSet
. But I doubt you'd notice the difference on an I/O bound operation like this one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With