Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking for a faster way to perform string searches

I have to recognize a large list of URLs (few million lines) as belonging to a particular category or not. I have another list that has sub-strings that if present in the URL belongs to that category. Say, Category A.

The list of sub-strings to check has around 10k such sub-strings. What I did was simply go line by line in the sub-string file and look for the match and if found the URL belongs to Category A. I found in tests that this was rather time consuming.

I'm not a computer science student so don't have much knowledge about optimizing algorithms. But is there a way to make this faster? Just simple ideas. The programming language is not a big issue but Java or Perl would be preferable.

The list of sub-strings to match will not change much. I will however receive different lists of URLs so have to run this each time I get it. The bottleneck seems to be the URLs as they can get very long.

like image 878
sfactor Avatar asked Apr 13 '11 07:04

sfactor


People also ask

What is the fastest string search algorithm?

The Aho-Corasick string searching algorithm simultaneously finds all occurrences of multiple patterns in one pass through the text. On the other hand, the Boyer-Moore algorithm is understood to be the fastest algorithm for a single pattern.

What is meant by string searching?

A search string is the combination of all text, numbers and symbols entered by a user into a search engine to find desired results. Search strings are used to find files and their content, database information and web pages. A search string may include keywords, numeric data and operators.

How strings are compared by searching using any search algorithm?

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text.

How do you find the substring of a string efficiently?

Run a loop from start to end and for every index in the given string check whether the sub-string can be formed from that index. This can be done by running a nested loop traversing the given string and in that loop running another loop checking for sub-strings starting from every index.


1 Answers

Yes, I implemented the Aho-Corasick algorithm algorithm in java for the problem you are suggesting and it showed a consistent improvement of about x180 on the naive implementation (what you are doing). There are several implementations available online, although I would tweak them for better performance. Note that the solutions complexity is bounded by the length of the word (in your case the URL) and not the number of sub-strings. furthermore it only requires one pass on average for matching.

P.S - we used to give this question to people in job interviews, so there are many ways to solve it. The one I offer is the one we use in production code, which (for now) beats all other solutions.

Edit: wrote the wrong algorithm name previously, fixed...

like image 63
Asaf Avatar answered Sep 23 '22 08:09

Asaf