Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to search for a list of words in a text

I have a list of words, fairly small about 1000 or so. I want to check if any of the words in that list occur in an input text. If so I would like know which ones occur. The input text is a few hundred words each and these are text paragraphs from the web - meaning there a lot of them from different sites. I am trying to find the best algorithm for it.

I can see two obvious ways to do this --

  1. A brute force way of searching for each word from the list in the text.

  2. Create a hash table of words from the input text and then search for each word from the list in the hash table. This is fast.

Is there a better solution?

I am using python though I am not sure if that changes the algorithm anyway.

Also as an optimization to the solution 2 above, I would like to store the hash table generated to persistent storage (DB) so that if the list of words changes I can re-use the hash table without having to create it again. Of course if the input text changes I have to generate the hash table. Is it possible to save a hash table to a DB? Any recommendations? I am currently using MongoDB for my project and I can only store json documents in it. I am a new to MongoDB and have only just started working with it and still do not fully understand the full potential of it.

I have searched SO and see two questions along similar lines and one of them suggests a hash table but I would like to get any pointers towards the optimization I have in mind.

Here are the previously asked questions on SO -

Is there an efficient algorithm to perform inverted full text search?

Searching a large list of words in another large list

EDIT: I just found another question on SO which is about the same problem.

Algorithm for multiple word matching in text

I guess there is no better solution than a hash table. But I would really like to optimize it so that changes to the word list can let me run the algorithm on all the text I have stored up quickly. Should I change the tags added to the question to also include some database technologies?

like image 405
user220201 Avatar asked Jan 15 '14 00:01

user220201


People also ask

Which searching algorithm can be used to perform searching on long list?

Linear Search This type of searching algorithms sequentially checks each element of the list until a match is found or the whole list has been searched. A linear search runs in at worst linear time and makes at most n comparisons, where n is the length of the list.

What is the best searching algorithm?

This type of searching algorithm is used to find the position of a specific value contained in a sorted array. The binary search algorithm works on the principle of divide and conquer and it is considered the best searching algorithm because it's faster to run.

How will you implement word based search algorithm?

A word search algorithm takes a text T of length n and a pattern P of length m as the input. The text is then scanned using a window that has length equal to the size of the pattern. The leftmost ends of the pattern and window are aligned.

What is text search explain different text search algorithms?

The techniques are full text scanning (streaming), word inversion and multi-attribute retrieval (Faloutsos-85, Salton-83). In addition to using the indexes as a mechanism for searching text in information systems, streaming of text was frequently found in the systems as an additional search mechanism.


1 Answers

There is a better solution than a hash table. If you have a fixed set of words that you want to search for over a large body of text, the way you do it is with the Aho-Corasick string matching algorithm.

The algorithm builds a state machine from the words you want to search, and then runs the input text through that state machine, outputting matches as they're found. Because it takes some amount of time to build the state machine, the algorithm is best suited for searching very large bodies of text.

You can do something similar with regular expressions. For example, you might want to find the words "dog", "cat", "horse", and "skunk" in some text. You can build a regular expression:

"dog|cat|horse|skunk"

And then run a regular expression match on the text. How you get all matches will depend on your particular regular expression library, but it does work. For very large lists of words, you'll want to write code that reads the words and generates the regex, but it's not terribly difficult to do and it works quite well.

There is a difference, though, in the results from a regex and the results from the Aho-Corasick algorithm. For example if you're searching for the words "dog" and "dogma" in the string "My karma ate your dogma." The regex library search will report finding "dogma". The Aho-Corasick implementation will report finding "dog" and "dogma" at the same position.

If you want the Aho-Corasick algorithm to report whole words only, you have to modify the algorithm slightly.

Regex, too, will report matches on partial words. That is, if you're searching for "dog", it will find it in "dogma". But you can modify the regex to only give whole words. Typically, that's done with the \b, as in:

"\b(cat|dog|horse|skunk)\b"

The algorithm you choose depends a lot on how large the input text is. If the input text isn't too large, you can create a hash table of the words you're looking for. Then go through the input text, breaking it into words, and checking the hash table to see if the word is in the table. In pseudo code:

hashTable = Build hash table from target words
for each word in input text
    if word in hashTable then
        output word

Or, if you want a list of matching words that are in the input text:

hashTable = Build hash table from target words
foundWords = empty hash table
for each word in input text
    if word in hashTable then
        add word to foundWords
like image 157
Jim Mischel Avatar answered Sep 28 '22 02:09

Jim Mischel