How to find common strings among two very large files?

Tags:

I have two very large files (and neither of them would fit in memory). Each file has one string (which doesn't have spaces in it and is either 99/100/101 characters long) on each line.

Update: The strings are not in any sorted order.
Update2: I am working with Java on Windows.

Now I want to figure out the best way to find out all the strings that occur in both the files.

I have been thinking about using external merge sort to sort both the files and then do comparison but I am not sure if that would be the best way to do it. Since the strings are mostly around the same length, I was always wondering if computing some kind of a hash for each string would be a good idea, since that should make comparisons between strings easier, but then that would mean I have to store the hashes computed for the strings I have encountered from the files so far so that they can be used later when comparing them with other strings. I am not able to pin down on what exactly would be the best way. I am looking for your suggestions.

When you suggest a solution, also please state if the solution would work if there were more than 2 files and strings which occur in all of them had to be figured out.

753

asked Mar 18 '09 13:03

Skylark

3 Answers

I would sort each file, then use a Balanced Line algorithm, reading one line at a time from one file or the other.

answered Oct 17 '22 05:10

mbeckish

You haven't said what platform you're working on, so I assume you're working on Windows, but in the unlikely event that you're on a Unix platform, standard tools will do it for you.

sort file1 | uniq > output
sort file2 | uniq >> output
sort file3 | uniq >> output
...
sort output | uniq -d

answered Oct 17 '22 04:10

Leonard

I'd do it as follows (for any number of files):

Sort just 1 file (#1).
Walk through each line of the next file (#2) and do a binary search on the #1 file (based on the number of lines).
If you find the string; write it on another temp file (#temp1).
After you finished with #2, sort #temp1 go to #3 and do the same search but this time on #temp1, not #1, which should take much less than the first one as this only has repeated lines.
Repeat this process with new temporary files, deleting previous #temp files. Each iteration should take less and less, as the number of repeated lines diminishes.

answered Oct 17 '22 04:10

Seb

Related questions
                            
                                Java - making objects with key/value pairs?
                            
                                How to remove entire substring from '<' to '>' in Java
                            
                                Bus error troubleshooting
                            
                                integer to string conversion / integer-string concatenation in C++ - more compact solutions?
                            
                                File paths in Python in the form of string throw errors
                            
                                TimeSpan String Formatting
                            
                                How can I underline text in iOS 6?
                            
                                In Python, how should one extract the second-last directory name in a path?
                            
                                how to load local html file to string variable in ios swift?
                            
                                MySQL SHA256 with Insert Statement
                            
                                Find count of characters within the string in Python
                            
                                Regex Persian Date validation
                            
                                Convert column values to lower case only if they are string
                            
                                How to remove all the values in a string except for the chosen ones [duplicate]
                            
                                Is there an easy way to get the number of repeating character in a word?
                            
                                How to create a new string without spaces from another string using a loop
                            
                                Converting String to Integer without using Multiplication C#
                            
                                How to count the number of dashes between any two alphabetical characters?
                            
                                string replace on escape characters
                            
                                How do I create a Lua Table in C++, and pass it to a Lua function?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to find common strings among two very large files?

Tags:

string

file

algorithm

Skylark

People also ask

3 Answers

mbeckish

Leonard

Seb

Recent Activity

Donate For Us