I have a sorted list of 1,000,000 strings with a maximum length of 256 with protein names. Every string has an associated ID. I have another unsorted list of 4,000,000,000 strings with a maximum length of 256 with words out of articles and every word has an ID.
I want to find all matches between the list of protein names and the list of words of the articles. Which algorithm should I use? Should I use some prebuild API?
It would be good if the algorithm runs on a normal PC without special hardware.
Estimates of time that the algorithm requires would be nice but not obligatory.
4 billion strings is a lot of strings to search.
You may be able to fit the entire data structure into a memory hash for fast lookup, but more likely you'd want to store the entire list on more spacious (but slower) disk in which case a sorted list would lend itself to the relatively efficient binary search algorithm.
If your binary search or such function was called find_string_in_articles()
, then pseudocode:
foreach $protein_name ( @protein_names ) {
if ( $article_id = find_string_in_articles( $protein_name ) ) {
print( "$protein_name matches $article_id\n" );
}
}
You could sort them and then do "mergesort" which would not actually merge but find you duplicates/overlaps. Wikipedia has good references on that.
Sorting that amount of data probably requires more memory than you have accessible. I don't know if unix sort (available on Windows/Mac too) can handle that, but any decent SQL database can do that.
Another possibility is to use a radix tree on your protein names (those starting with A go to bin A, B to bin B etc.). Then just loop over the 4 gazillion of words and locate overlaps (you probably must implement more than one deep radix binning to discard more proteins at a time).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With