Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for matching strings between two large files

I have a question regarding a search algorithm. I currently have 2 files in plain text, each one of them has at least 10 million lines. For now, each line is a string, and I want to find each string in the first file that also appears in the second file. Is there a good way to do this efficiently? Any suggestions from either algorithm or a special language feature is appreciated.

like image 407
Shang Wang Avatar asked Sep 01 '11 21:09

Shang Wang


1 Answers

If you don't know anything about the structure of the files (such as whether or not they're sorted), there are many different approaches that you could take to solve the problem that, depending on your constraints on memory and space usage, may be what you're looking for.

If you have gobs of free RAM available, one option would be to create a hash table in memory to hold strings. You could then load all of the strings from the first file into the hash table. Then, load each of the strings from the second file one at a time. For each string, check if it's in the hash table. If so, report a match. This approach uses O(m) memory (where m is the number of strings in the first file) and takes at least Ω(m + n) time and possibly more, depending on how the hash function works. This is also (almost certainly) the easiest and most direct way to solve the problem.

If you don't have much RAM available but time isn't much of a constraint, you can use a modified version of this first algorithm. Choose some number of lines to load in from the first file. Then, load just those strings into a hash table. Once you've done this, scan over the entire second file to find any matches. Then, evict all of the lines from the hash table and load in the next set of lines from the first file and repeat. This has runtime Ω(mn/b), where b is the block size (since there are O(m/b) iterations of a complete linear scan of all n bytes in the second file). Alternatively, if you know that one file is much smaller than the other, you might want to consider just loading that whole file into memory and then scanning the other file.

If you don't have much RAM available but do have the ability to use up more disk space, one option might be to use an external sorting algorithm to sort each of the two files (or, at least, construct a directory listing the lines of each file in sorted order). Once you have the files in sorted order, you can scan across them in parallel, finding all matches. This uses the more general algorithm for finding all duplicated elements in two sorted ranges, which works like this:

  1. Keep track of two indices, one into the first list and one into the second list, that both start at zero.
  2. While both lists have items left:
    1. If the items at the corresponding indices match, report a match.
    2. Otherwise, if the item in the first list is smaller than the item in the second list, increase the index into the first list.
    3. Otherwise, increase the index of the second list.

This algorithm would take roughly O(n log n) time to sort the two files and would then make a total of O(n) comparisons to do the work to find the common items across the lists. However, since string comparisons do not necessarily run in O(1) time (in fact, they often take much longer), the actual runtime for this might be much greater. If we assume that each file consists of n strings of length m, then the runtime for sorting would be O(mn log n), since each comparison takes O(m) time. Similarly, the comparison step might take O(mn) time, because each string comparison could take up to O(m) time as well. As a possible optimization, you may want to consider computing a small hash code (say, 16 or 32 bits). Assuming that the hash code gives good uniformity, this can dramatically cut down on the time required to compare strings, since most strings that aren't identical will have different hash codes and can be compared in O(1) time.

Finally, if each line of the file is reasonably long (say, at least 8 characters), one option might be to compute a 64-bit or larger hash value for each of the lines of the files. You could then use any of the above techniques to try to see if any hash codes are repeated in the two files (holding everything in a hash table, using external sorting, etc.) Assuming that you have enough bits in your hash code, the number of conflicts should be low and you should be able to find matches quickly and with far less memory usage.

Hope this helps!

Woohoo! This is my 1000th answer on Stack Overflow! :-)

like image 96
templatetypedef Avatar answered Sep 22 '22 04:09

templatetypedef