Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find common strings among two very large files?

I have two very large files (and neither of them would fit in memory). Each file has one string (which doesn't have spaces in it and is either 99/100/101 characters long) on each line.

Update: The strings are not in any sorted order.
Update2: I am working with Java on Windows.

Now I want to figure out the best way to find out all the strings that occur in both the files.

I have been thinking about using external merge sort to sort both the files and then do comparison but I am not sure if that would be the best way to do it. Since the strings are mostly around the same length, I was always wondering if computing some kind of a hash for each string would be a good idea, since that should make comparisons between strings easier, but then that would mean I have to store the hashes computed for the strings I have encountered from the files so far so that they can be used later when comparing them with other strings. I am not able to pin down on what exactly would be the best way. I am looking for your suggestions.

When you suggest a solution, also please state if the solution would work if there were more than 2 files and strings which occur in all of them had to be figured out.

like image 753
Skylark Avatar asked Mar 18 '09 13:03

Skylark


People also ask

How do I find the common string in two files?

Use comm -12 file1 file2 to get common lines in both files. You may also needs your file to be sorted to comm to work as expected. Or using grep command you need to add -x option to match the whole line as a matching pattern. The F option is telling grep that match pattern as a string not a regex match.

Which command is used to find similar lines in two files Linux?

Use comm command; it compare two sorted files line by line.


3 Answers

I would sort each file, then use a Balanced Line algorithm, reading one line at a time from one file or the other.

like image 53
mbeckish Avatar answered Oct 17 '22 05:10

mbeckish


You haven't said what platform you're working on, so I assume you're working on Windows, but in the unlikely event that you're on a Unix platform, standard tools will do it for you.

sort file1 | uniq > output
sort file2 | uniq >> output
sort file3 | uniq >> output
...
sort output | uniq -d
like image 24
Leonard Avatar answered Oct 17 '22 04:10

Leonard


I'd do it as follows (for any number of files):

  • Sort just 1 file (#1).
  • Walk through each line of the next file (#2) and do a binary search on the #1 file (based on the number of lines).
  • If you find the string; write it on another temp file (#temp1).
  • After you finished with #2, sort #temp1 go to #3 and do the same search but this time on #temp1, not #1, which should take much less than the first one as this only has repeated lines.
  • Repeat this process with new temporary files, deleting previous #temp files. Each iteration should take less and less, as the number of repeated lines diminishes.
like image 21
Seb Avatar answered Oct 17 '22 04:10

Seb