Looking for a faster way to perform string searches

Tags:

I have to recognize a large list of URLs (few million lines) as belonging to a particular category or not. I have another list that has sub-strings that if present in the URL belongs to that category. Say, Category A.

The list of sub-strings to check has around 10k such sub-strings. What I did was simply go line by line in the sub-string file and look for the match and if found the URL belongs to Category A. I found in tests that this was rather time consuming.

I'm not a computer science student so don't have much knowledge about optimizing algorithms. But is there a way to make this faster? Just simple ideas. The programming language is not a big issue but Java or Perl would be preferable.

The list of sub-strings to match will not change much. I will however receive different lists of URLs so have to run this each time I get it. The bottleneck seems to be the URLs as they can get very long.

878

asked Apr 13 '11 07:04

sfactor

1 Answers

Yes, I implemented the Aho-Corasick algorithm algorithm in java for the problem you are suggesting and it showed a consistent improvement of about x180 on the naive implementation (what you are doing). There are several implementations available online, although I would tweak them for better performance. Note that the solutions complexity is bounded by the length of the word (in your case the URL) and not the number of sub-strings. furthermore it only requires one pass on average for matching.

P.S - we used to give this question to people in job interviews, so there are many ways to solve it. The one I offer is the one we use in production code, which (for now) beats all other solutions.

Edit: wrote the wrong algorithm name previously, fixed...

answered Sep 23 '22 08:09

Asaf

Related questions
                            
                                Android Broadcast Address
                            
                                final and private static
                            
                                JPA EntityManager createQuery() with IN not working
                            
                                jQuery AJAX call messes up character encoding
                            
                                Package in Java
                            
                                DAO methods and synchronized
                            
                                Session Lost when closing the browser
                            
                                Understanding Java bytes
                            
                                Execute my groovy script with ant or maven
                            
                                What prevents HttpSession's id from being stolen?
                            
                                Best way to share portions of a Maven pom.xml across unrelated projects?
                            
                                Java convert GIF image to PNG format
                            
                                Warning for generic varargs
                            
                                How to send multiple emails in one session?
                            
                                Getting equal symbol expected while using jstl
                            
                                setTextViewText not updating widget
                            
                                SWT: How to do High Quality Image Resize
                            
                                Illegal start of expression for Annotations
                            
                                Can a Java HashMap's size() be out of sync with its actual entries' size?
                            
                                Date convert dd-MMM-yyyy to dd-MM-yyyy in java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Looking for a faster way to perform string searches

Tags:

java

optimization

search

perl

sfactor

People also ask

1 Answers

Asaf

Recent Activity

Donate For Us