Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prefix search against half a billion strings

I have a list of 500 mil strings. The strings are alphanumeric, ASCII characters, of varying size (usually from 2-30 characters). Also, they're single words (or a combination of words without spaces like 'helloiamastring').

What I need is a fast way to check against a target, say 'hi'. The result should be all strings from the 500mil list which start with 'hi' (for eg. 'hithere', 'hihowareyou' etc.). This needs to be fast because there will be a new query each time the user types something, so if he types "hi", all strings starting with "hi" from the 500 mil list will be shown, if he types "hey", all strings starting with "hey" will show etc.

I've tried with the Tries algo, but the memory footprint to store 300 mil strings is just huge. It should require me 100GB+ ram for that. And I'm pretty sure the list will grow up to a billion.

What is a fast algorithm for this use case?

P.S. In case there's no fast option, the best alternative would be to limit people to enter at least, say, 4 characters, before results show up. Is there a fast way to retrieve the results then?

like image 781
anemaria20 Avatar asked Jan 15 '17 17:01

anemaria20


2 Answers

You want a Directed Acyclic Word Graph or DAWG. This generalizes @greybeard's suggestion to use stemming.

See, for example, the discussion in section 3.2 of this.

like image 124
Jim D. Avatar answered Sep 22 '22 21:09

Jim D.


If the strings are sorted then a binary search is reasonable. As a speedup, you could maintain a dictionary of all possible bigrams ("aa", "ab", etc.) where the corresponding values are the first and last index starting with that bigram (if any do) and so in O(1) time zero in on a much smaller sublist that contains the strings that you are looking for. Once you find a match, do a linear search to the right and left to get all other matches.

like image 38
John Coleman Avatar answered Sep 20 '22 21:09

John Coleman