Would Java indexOf (brute force method) be more practical for me or some other substring algorithm?

Tags:

I'm looking at finding very short substrings (pattern, needle) in many short lines of text (haystack). However, I'm not quite sure which method to use outside the naive, brute force method.

Background: I'm doing a side project for fun where I receive text messaging chat logs of multiple users (anywhere from 2000-15000 lines of text and 2-50 users), and I want to find all the various pattern matches in the chat logs based on predetermined words that I've come up with. So far I have about 1600 patterns that I'm looking for, but I may look for more.

So for example, I want to find the number of food related words that are used in an average text message log such as "hamburger", "pizza", "coke", "lunch", "dinner", "restaurant", "McDonalds". While I gave out English examples, I will actually be using Korean for my program. Each of these designated words will have their own respective score, which I put in a hashmap as key and value separately. I then show the top scorers for food related words as well as the most frequent words used by those users for food words.

My current method is to eliminate each line of text by whitespaces, and process each individual word from the haystack by using contains method (which uses the indexOf method and the naive substring search algorithm) of the haystack contains the pattern.

wordFromInput.contains(wordFromPattern);

To give an example, with 17 users in chat, 13000 lines of text, and the 1600 patterns, I've found that this whole program took 12-13 seconds with this method. And on the Android app that I'm developing, it took 2 minutes and 30 seconds to process, which is far too slow.

Originally, I tried to use a hash map and to merely get the pattern instead of searching for it in the ArrayList, but I then realized that is...

not possible with hash table

for what I am trying to do with a substring.

I've looked around through Stackoverflow and found a lot of helpful and related questions, such as these two:

1 and 2. I'm somewhat more familiar with the various string algorithms (Boyer Moore, KMP, etc.)

I initially thought then that the naive method would of course be the worst type of algorithm for my case, but having found this question, I've realized that my case (short pattern, short text), might actually be more effective with the naive method. But I wanted to know if there was something that I was neglecting completely.

Here is a snippet of my code though if anyone wants to see my issue more concretely.

While I removed large parts of the code to simplify it, the primary method that I use to actually match substrings is there in the method matchWords().

I know that's really ugly and bad code (5 for loops...), so if there are any suggestions for that, I'm happy to hear it as well.

So to clean it up:

lines of text from chat logs (2000-10,000+), haystack
1600+ patterns, needle(s)
mostly using Korean characters, although some English is included
Brute force naive method is simply too slow, but debating whether there are other alternatives and even if there are, whether they are practical given the nature of short patterns and text.

I just want some input on my thought process, and possibly some general advice. But additionally, I would like some specific suggestion for a particular algorithm or method if that is possible.

265

asked Mar 01 '14 21:03

Nopiforyou

2 Answers

You can replace the hashtable with a Trie.

Split the line of text into words using white space to separate words. Then check if the word is in the Trie. If it is in the Trie, update a counter associated with the word. Ideally, the counter would be integrated into the Trie.

This appraoch is O(C) where C is the number of characters in the text. It's highly unlikely that you can avoid checking each character at least once. Thus this approach should be as good as you can get at least in terms of big O.

However, it sounds like you may not want to list all of the possible words you are searching for. Therefore, you might want to simply use you could build a counting Trie from all of the words. If nothing else that'll probably make it easier for any pattern matching algorithm you use. Although, it might require some modifications to the Trie.

answered Nov 02 '22 22:11

Nuclearman

What you're describing sounds like an excellent use case for the Aho-Corasick string-matching algorithm. This algorithm finds all matches of a set of pattern strings inside of a source string and does so in linear time (plus the time to report the matches). If you have a fixed set of strings to search for, you can do linear preprocessing work up front on the patterns to search for all matches very quickly.

There's a Java implementation of Aho-Corasick available here. I haven't tried it out, but it might be a good match.

Hope this helps!

answered Nov 02 '22 22:11

templatetypedef

Related questions
                            
                                Zipping InputStream, returning InputStream (in memory, no file)
                            
                                Java concurrency: executing many "infinite" tasks with few threads
                            
                                Output range of Perlin noise
                            
                                Create Fields and methods dynamically
                            
                                Found a swap file by the name ".git/.MERGE_MSG.swp"
                            
                                Java: Definition of methods and variables inside enum's constant
                            
                                Generic array throws ClassCastException when referenced directly (it doesn't when calling through helper method)
                            
                                Exception in thread "main" java.lang.ClassNotFoundException: WordCount
                            
                                Using GSON to parse json object vs json array
                            
                                Getting a TimerTask to run when using JUnit
                            
                                Memory usage of String in Java
                            
                                Detecting Socket Disconnect Using TCP KeepAlive
                            
                                Modular Spring-based application
                            
                                How can I include @Annotations in JavaDoc? [duplicate]
                            
                                Why does my custom SecurityManager cause exceptions the 16th time I create an object with Constructor.newInstance?
                            
                                Java Happens-Before and Thread Safety
                            
                                What is the most efficient (fastest) way to concatenate two large (over 1.5GB) files in java?
                            
                                Migrating from log4j 1.2 to log4j 2 - how to get list of all appenders and rolling file strategy
                            
                                java, xsd & marshalling: jre bug, my fault or xsd issues?
                            
                                Performance difference between passing interface and class reloaded

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Would Java indexOf (brute force method) be more practical for me or some other substring algorithm?

Tags:

java

string

substring

algorithm

Nopiforyou

People also ask

2 Answers

Nuclearman

templatetypedef

Recent Activity

Donate For Us