I have to recognize a large list of URLs (few million lines) as belonging to a particular category or not. I have another list that has sub-strings that if present in the URL belongs to that category. Say, Category A.
The list of sub-strings to check has around 10k such sub-strings. What I did was simply go line by line in the sub-string file and look for the match and if found the URL belongs to Category A. I found in tests that this was rather time consuming.
I'm not a computer science student so don't have much knowledge about optimizing algorithms. But is there a way to make this faster? Just simple ideas. The programming language is not a big issue but Java or Perl would be preferable.
The list of sub-strings to match will not change much. I will however receive different lists of URLs so have to run this each time I get it. The bottleneck seems to be the URLs as they can get very long.
The Aho-Corasick string searching algorithm simultaneously finds all occurrences of multiple patterns in one pass through the text. On the other hand, the Boyer-Moore algorithm is understood to be the fastest algorithm for a single pattern.
A search string is the combination of all text, numbers and symbols entered by a user into a search engine to find desired results. Search strings are used to find files and their content, database information and web pages. A search string may include keywords, numeric data and operators.
In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text.
Run a loop from start to end and for every index in the given string check whether the sub-string can be formed from that index. This can be done by running a nested loop traversing the given string and in that loop running another loop checking for sub-strings starting from every index.
Yes, I implemented the Aho-Corasick algorithm algorithm in java for the problem you are suggesting and it showed a consistent improvement of about x180 on the naive implementation (what you are doing). There are several implementations available online, although I would tweak them for better performance. Note that the solutions complexity is bounded by the length of the word (in your case the URL) and not the number of sub-strings. furthermore it only requires one pass on average for matching.
P.S - we used to give this question to people in job interviews, so there are many ways to solve it. The one I offer is the one we use in production code, which (for now) beats all other solutions.
Edit: wrote the wrong algorithm name previously, fixed...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With