I have two very large files (and neither of them would fit in memory). Each file has one string (which doesn't have spaces in it and is either 99/100/101 characters long) on each line.
Update: The strings are not in any sorted order.
Update2: I am working with Java on Windows.
Now I want to figure out the best way to find out all the strings that occur in both the files.
I have been thinking about using external merge sort to sort both the files and then do comparison but I am not sure if that would be the best way to do it. Since the strings are mostly around the same length, I was always wondering if computing some kind of a hash for each string would be a good idea, since that should make comparisons between strings easier, but then that would mean I have to store the hashes computed for the strings I have encountered from the files so far so that they can be used later when comparing them with other strings. I am not able to pin down on what exactly would be the best way. I am looking for your suggestions.
When you suggest a solution, also please state if the solution would work if there were more than 2 files and strings which occur in all of them had to be figured out.
Use comm -12 file1 file2 to get common lines in both files. You may also needs your file to be sorted to comm to work as expected. Or using grep command you need to add -x option to match the whole line as a matching pattern. The F option is telling grep that match pattern as a string not a regex match.
Use comm command; it compare two sorted files line by line.
I would sort each file, then use a Balanced Line algorithm, reading one line at a time from one file or the other.
You haven't said what platform you're working on, so I assume you're working on Windows, but in the unlikely event that you're on a Unix platform, standard tools will do it for you.
sort file1 | uniq > output
sort file2 | uniq >> output
sort file3 | uniq >> output
...
sort output | uniq -d
I'd do it as follows (for any number of files):
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With