Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Substring algorithm suggestion

I have a large set (100k) of short strings (not more than 100 chars) and I need to quickly find all those who have a certain substring.

This will be used as a search box where the user starts typing and the system immediately gives "suggestions" (the strings that have as a substring the text that the user typed). Something similar to the "Tag" box in StackOverflow.

As this will be interactive, it should be pretty fast. What algorithm or data structure do you recommend for this?

BTW, I'll be using Delphi 2007.

Thanks in advance.

like image 570
cfischer Avatar asked Sep 16 '10 15:09

cfischer


3 Answers

I wrote out a long blurb, doing a bunch of complexity calculations and xzibit jokes (tree in a tree so you can lookup when you lookup), but then realized that this is easier than I thought. Browsers do this all the time and they never precompute big tables every time you're loading a page.

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

what it means is that you take your 100k strings and combine them into one long string. you then take your query substring, and iterate over your big string, looking for your matches. but you're not jumping by character (which would mean you're looking at 100k*100 iterations). you're jumping by the length of your substring, so the longer your substring gets, the faster this goes.

here's a great example: http://userweb.cs.utexas.edu/users/moore/best-ideas/string-searching/fstrpos-example.html

they're searching for the string EXAMPLE.

this is the kind of stuff browsers and text editors do, and they dont really build giant prefix tables every time you load a page.

like image 100
Oren Mazor Avatar answered Nov 14 '22 09:11

Oren Mazor


The data structure you'll likely want to use is a Trie, specifically a suffix trie. Read this article for a good explanation of how they work and how to write one for your problem.

like image 44
Mike Axiak Avatar answered Nov 14 '22 10:11

Mike Axiak


While you certainly can speed things up with a better data structure, this is a time that brute force might be perfectly adequate. Doing a quick test:

[Edit: changed code to search for substrings, edited again to shorten the substring it searches for compared to the ones it searches in.]

#include <algorithm>
#include <iostream>
#include <vector>
#include <string>
#include <time.h>

std::string rand_string(int min=20, int max=100) { 
    size_t length = rand()% (max-min) + min;
    std::string ret;

    for (size_t i=0; i<length; i++)
        ret.push_back(rand() % ('z' - 'a') + 'a');
    return ret; 
}

class substr {
    std::string seek;
public:
    substr(std::string x) : seek(x) {}

    bool operator()(std::string const &y) { return y.find(seek) != std::string::npos; }
};

int main() { 
    std::vector<std::string> values;

    for (int i=0; i<100000; i++)
        values.push_back(rand_string());

    std::string seek = rand_string(5, 10);

    const int reps = 10;

    clock_t start = clock();
    std::vector<std::string>::iterator pos;
    for (int i=0; i<reps; i++)
         pos = std::find_if(values.begin(), values.end(), substr(seek));
    clock_t stop = clock();

    std::cout << "Search took: " << double(stop-start)/CLOCKS_PER_SEC/reps << " seconds\n";
    if (pos == values.end())
        std::cout << "Value wasn't found\n";
    else
        std::cout << "Value was found\n";
    return 0;
}

On my machine (around 4 years old -- hardly a speed demon by current standards) this runs in around 3 10 milliseconds per search. That's fast enough to appear instantaneous to an interactive user -- and with 10 times as many strings, it would still be fine.

like image 6
Jerry Coffin Avatar answered Nov 14 '22 08:11

Jerry Coffin