Deleting duplicate lines in a file using Java

Tags:

As part of a project I'm working on, I'd like to clean up a file I generate of duplicate line entries. These duplicates often won't occur near each other, however. I came up with a method of doing so in Java (which basically made a copy of the file, then used a nested while-statement to compare each line in one file with the rest of the other). The problem, is that my generated file is pretty big and text heavy (about 225k lines of text, and around 40 megs). I estimate my current process to take 63 hours! This is definitely not acceptable.

I need an integrated solution for this, however. Preferably in Java. Any ideas? Thanks!

923

asked Jun 15 '09 13:06

Monster

1 Answers

Hmm... 40 megs seems small enough that you could build a Set of the lines and then print them all back out. This would be way, way faster than doing O(n²) I/O work.

It would be something like this (ignoring exceptions):

public void stripDuplicatesFromFile(String filename) {     BufferedReader reader = new BufferedReader(new FileReader(filename));     Set<String> lines = new HashSet<String>(10000); // maybe should be bigger     String line;     while ((line = reader.readLine()) != null) {         lines.add(line);     }     reader.close();     BufferedWriter writer = new BufferedWriter(new FileWriter(filename));     for (String unique : lines) {         writer.write(unique);         writer.newLine();     }     writer.close(); }

If the order is important, you could use a LinkedHashSet instead of a HashSet. Since the elements are stored by reference, the overhead of an extra linked list should be insignificant compared to the actual amount of data.

Edit: As Workshop Alex pointed out, if you don't mind making a temporary file, you can simply print out the lines as you read them. This allows you to use a simple HashSet instead of LinkedHashSet. But I doubt you'd notice the difference on an I/O bound operation like this one.

157

answered Sep 25 '22 00:09

Michael Myers

Related questions
                            
                                Do Java programs ever crash?
                            
                                Is Google Guava "harder" to use than Apache Collections? [closed]
                            
                                Java - adding elements to list while iterating over it
                            
                                What is the UTF-8 representation of "end of line" in text file
                            
                                How can I get a frame sample (jpeg) from a video (mov)
                            
                                Resource entry com.crashlytics.android.build_id is already defined
                            
                                JAVA_HOME is set to an invalid directory while running ./gradlew on OSX
                            
                                What is the rationale behind this code block in java?
                            
                                Basic render 3D perspective projection onto 2D screen with camera (without opengl)
                            
                                maven build goal need to specify
                            
                                Custom validator message: Throwing exception in implementation of ConstraintValidator cause UnexpectedException
                            
                                Java 8 stream average for float
                            
                                Unable to tunnel through proxy. Proxy returns “HTTP/1.1 407” via https
                            
                                Compare two dates in Java
                            
                                Generating an alphabetic sequence in Java
                            
                                Validate positive integers
                            
                                ArrayList and modifying objects included in it
                            
                                Where to put a text file in Grails, and how to get the path
                            
                                Will it be possible to use Java 8 on Glassfish 3?
                            
                                Java: What's the difference between autoboxing and casting?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Deleting duplicate lines in a file using Java

Tags:

java

file

text

file-io

duplicates

Monster

People also ask

1 Answers

Michael Myers

Recent Activity

Donate For Us