Implementing keyword comparison scheme (reverse search)

Question

I have a constantly growing database of keywords. I need to parse incoming text inputs (articles, feeds etc) and find which keywords from the database are present in the text. The database of keywords is much larger than the text.

Since the database is constantly growing (users add more and more keywords to watch for), I figure the best option will be to break the text input into words and compare those against the database. My main dilemma is implementing this comparison scheme (PHP and MySQL will be used for this project).

The most naive implementation would be to create a simple SELECT query against the keywords table, with a giant IN clause listing all the found keywords.

SELECT user_id,keyword FROM keywords WHERE keyword IN ('keyword1','keyword2',...,'keywordN');

Another approach would be to create a hash-table in memory (using something like memcache) and to check against it in the same manner.

Does anyone has any experience with this kind of searching and has any suggestions on how to better implement this? I haven't tried yet any of those approaches, I'm just gathering ideas at this point.

Norman Ramsey · Accepted Answer

The classic way of searching a text stream for multiple keywords is the Aho-Corasick finite automaton, which uses time linear in the text to be searched. You'll want minor adaptations to recognize strings only on word boundaries, or perhaps it would be simpler just to check the keywords found and make sure they are not embedded in larger words.

You can find an implementation in fgrep. Even better, Preston Briggs wrote a pretty nice implementation in C that does exactly the kind of keyword search you are talking about. (It searches programs for occurrences of 'interesting' identifiers'.) Preston's implementation is distributed as part of the Noweb literate-programming tool. You could find a way to call this code from PHP or you could rewrite it in PHP---the recognize itself is about 220 lines of C, and the main program is another 135 lines.

All the proposed solutions, including Aho-Corasick, have these properties in common:

A preprocessing step that takes time and space proportional to the number of keywords in the database.
A search step that takes time and space proportional to the length of the text plus the number of keywords found.

Aho-Corasick offers considerably better constants of proportionality on the search step, but if your texts are small, this won't matter. In fact, if your texts are small and your database is large, you probably want to minimize the amount of memory used in the preprocessing step. Andrew Appel's DAWG data structure from the world's fastest scrabble program will probably do the trick.

Hugh Bothwell · Answer

In general,

break the text into words

b. convert words back to canonical root form

c. drop common conjunction words

d. strip duplicates
insert the words into a temporary table then do an inner join against the keywords table, or (as you suggested) build the keywords into a complex query criteria

It may be worthwhile to cache a 3- or 4-letter hash array with which to pre-filter potential keywords; you will have to experiment to find the best tradeoff between memory size and effectiveness.

Implementing keyword comparison scheme (reverse search)

Tags:

php

search

mysql

keyword

tokenize

Eran Galperin

2 Answers

Norman Ramsey

Hugh Bothwell

Recent Activity

Donate For Us

Implementing keyword comparison scheme (reverse search)

Tags:

php

search

mysql

keyword

tokenize

Eran Galperin

2 Answers

Norman Ramsey

Hugh Bothwell

Related questions

Recent Activity

Donate For Us