Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search String By SubWords

What Kind of algorithms + data structures that would help me to do that?

Having a file contains like 10000~ lines loaded in memory in a ordered set. With a given search string I want to be able to get all the lines that have words prefixed with words found in search string. Well let me give an example to clarify this:

Lines:

  1. "A brow Fox flies."
  2. "Boxes are full of food."
  3. "Cats runs slow"
  4. "Dogs hates eagles"
  5. "Dolphins have eyes and teath"

Cases 1:

search string = "fl b a"

"A brow Fox flies."

  • Explanation: search string have three words "fl", "b", and "a" and the only string that have some words that are prefixed with words from the search string is line 1.

Case 2:

search string "e do ha"

"Dogs hates eagles", "Dolphins have eyes and teath"

Solution

(fast enough for me it took about 30ms~(including sorting the final result) on my pc on a set of 10k lines 3 words each line)

  • I used a trie suggested in answer.
  • And some other hacky methods to be able to filter out duplicate and false positive results (mainly used hash sets for this).
like image 942
ToddlerXxX Avatar asked Dec 18 '13 19:12

ToddlerXxX


1 Answers

I think what you're probably wanting is a trie. Construct one for the set of all words in your document, and have each leaf point to a hashset containing the indices of the lines in which the key of the leaf appears.

To execute a search, you'd use each fragment of the search string to navigate to a node in the tree and take the union over the hashsets of all leaves in that node's subtree. Then you'd take the intersection of those unions over the set of fragments to get the list of lines satisfying the search string.

like image 94
Andy Jones Avatar answered Oct 22 '22 14:10

Andy Jones