Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching substrings from a dictionary to other string: suggestions?

Tags:

Hellow Stack Overflow people. I'd like some suggestions regarding the following problem. I am using Java.

I have an array #1 with a number of Strings. For example, two of the strings might be: "An apple fell on Newton's head" and "Apples grow on trees".

On the other side, I have another array #2 with terms like (Fruits => Apple, Orange, Peach; Items => Pen, Book; ...). I'd call this array my "dictionary".

By comparing items from one array to the other, I need to see in which "category" the items from #1 fall into from #2. E.g. Both from #1 would fall under "Fruits".

My most important consideration is speed. I need to do those operations fast. A structure allowing constant time retrieval would be good.

I considered a Hashset with the contains() method, but it doesn't allow substrings. I also tried running regex like (apple|orange|peach|...etc) with case insensitive flag on, but I read that it will not be fast when the terms increase in number (minimum 200 to be expected). Finally, I searched, and am considering using an ArrayList with indexOf() but I don't know about its performance. I also need to know which of the terms actually matched, so in this case, it would be "Apple".

Please provide your views, ideas and suggestions on this problem.

I saw Aho-Corasick algorithm, but the keywords/terms are very likely to change often. So I don't think I can use that. Oh, I'm no expert in text mining and maths, so please elaborate on complex concepts.

Thank you, Stack Overflow people, for your time! :)

like image 562
Inf.S Avatar asked Jan 06 '10 15:01

Inf.S


2 Answers

If you use a multimap from Google Collections, they have a function to invert the map (so you can start with a map like {"Fruits" => [Apple]}, and produce a map with {"Apple" => ["Fruits"]}. So you can lookup the word and find a list of categories for it, in one call to the map.

I would expect I'd want to split the strings myself and lookup the words in the map one at a time, so that I could do stemming (adjusting for different word endings) and stopword-filtering. Using the map should get good lookup times, plus it's easy to try out.

like image 181
Nathan Hughes Avatar answered Oct 11 '22 21:10

Nathan Hughes


Would a suffix tree or similar data structure work for your application? It offers O(m) string lookup, where m is the length of the search string, after an O(n2)--or better with some trickery--initial setup, and, with some extra effort, you can associate arbitrary data, such as a reference to a category, with complete words in your dictionary. If you don't want to code it yourself, I believe the BioJava library includes an implementation.

You can also add strings to a suffix tree after initial setup, although the cost will still be around O(n2). That's probably not a big deal if you're adding short words.

like image 27
MattK Avatar answered Oct 11 '22 22:10

MattK