Checking if a string contains an English sentence

Tags:

As of right now, I decided to take a dictionary and iterate through the entire thing. Every time I see a newline, I make a string containing from that newline to the next newline, then I do string.find() to see if that English word is somewhere in there. This takes a VERY long time, each word taking about 1/2-1/4 a second to verify.

It is working perfectly, but I need to check thousands of words a second. I can run several windows, which doesn't affect the speed (Multithreading), but it still only checks like 10 a second. (I need thousands)

I'm currently writing code to pre-compile a large array containing every word in the English language, which should speed it up a lot, but still not get the speed I want. There has to be a better way to do this.

The strings I'm checking will look like this:

"hithisisastringthatmustbechecked"

but most of them contained complete garbage, just random letters.

I can't check for impossible compinations of letters, because that string would be thrown out because of the 'tm', in between 'thatmust'.

390

asked Aug 02 '13 17:08

Nicholas Pipitone

2 Answers

You can speed up the search by employing the Knuth–Morris–Pratt (KMP) algorithm.

Go through every dictionary word, and build a search table for it. You need to do it only once. Now your search for individual words will proceed at faster pace, because the "false starts" will be eliminated.

160

answered Sep 20 '22 02:09

Sergey Kalinichenko

There are a lot of strategies for doing this quickly.

Idea 1

Take the string you are searching and make a copy of each possible substring beginning at some column and continuing through the whole string. Then store each one in an array indexed by the letter it begins with. (If a letter is used twice store the longer substring.

So the array looks like this:

a - substr[0] = "astringthatmustbechecked"
b - substr[1] = "bechecked"
c - substr[2] = "checked"
d - substr[3] = "d"
e - substr[4] = "echecked"
f - substr[5] = null // since there is no 'f' in it
... and so forth

Then, for each word in the dictionary, search in the array element indicated by its first letter. This limits the amount of stuff that has to be searched. Plus you can't ever find a word beginning with, say 'r', anywhere before the first 'r' in the string. And some words won't even do a search if the letter isn't in there at all.

Idea 2

Expand upon that idea by noting the longest word in the dictionary and get rid of letters from those strings in the arrays that are longer than that distance away.

So you have this in the array:

a - substr[0] = "astringthatmustbechecked"

But if the longest word in the list is 5 letters, there is no need to keep any more than:

a - substr[0] = "astri"

If the letter is present several times you have to keep more letters. So this one has to keep the whole string because the "e" keeps showing up less than 5 letters apart.

e - substr[4] = "echecked"

You can expand upon this by using the longest words starting with any particular letter when condensing the strings.

Idea 3

This has nothing to do with 1 and 2. Its an idea that you could use instead.

You can turn the dictionary into a sort of regular expression stored in a linked data structure. It is possible to write the regular expression too and then apply it.

Assume these are the words in the dictionary:

arun
bob
bill
billy
body
jose

Build this sort of linked structure. (Its a binary tree, really, represented in such a way that I can explain how to use it.)

a -> r -> u -> n -> *
|
b -> i -> l -> l -> *
|    |              |
|    o -> b -> *    y -> *
|         |
|         d -> y -> *
|
j -> o -> s -> e -> *

The arrows denote a letter that has to follow another letter. So "r" has to be after an "a" or it can't match.

The lines going down denote an option. You have the "a or b or j" possible letters and then the "i or o" possible letters after the "b".

The regular expression looks sort of like: /(arun)|(b(ill(y+))|(o(b|dy)))|(jose)/ (though I might have slipped a paren). This gives the gist of creating it as a regex.

Once you build this structure, you apply it to your string starting at the first column. Try to run the match by checking for the alternatives and if one matches, more forward tentatively and try the letter after the arrow and its alternatives. If you reach the star/asterisk, it matches. If you run out of alternatives, including backtracking, you move to the next column.

This is a lot of work but can, sometimes, be handy.

Side note I built one of these some time back by writing a program that wrote the code that ran the algorithm directly instead of having code looking at the binary tree data structure.

Think of each set of vertical bar options being a switch statement against a particular character column and each arrow turning into a nesting. If there is only one option, you don't need a full switch statement, just an if.

That was some fast character matching and really handy for some reason that eludes me today.

answered Sep 20 '22 02:09

Lee Meador

Related questions
                            
                                implicit constexpr?
                            
                                how to make each thread use its own RNG in C++11
                            
                                How do I express "<array-of-true-or-false> = <array> <= <scalar>" using Eigen 3?
                            
                                What is application footprint and how to calculate it?
                            
                                Sorting a list of objects holding a vector of unique_ptr
                            
                                Google glog rolling files
                            
                                Good hash function for pair of primitive types
                            
                                Complexity of an algorithm with two recursive calls
                            
                                C++ Most efficient way for storing, loading and looking up a lexicon
                            
                                How can I increment std::bitset
                            
                                Collection of std::functions with different arguments
                            
                                How can I create a partial (hybrid) mock in googlemock?
                            
                                static member function inside class template specialization
                            
                                What is the purpose of ldexp?
                            
                                lvalue binding to rvalue reference
                            
                                Does OpenGL prevent unnecessary API calls?
                            
                                Are there performance risks for using static_cast to deal with a vector of mixed (base & derived) objects? (aka "it this a dumb idea?")
                            
                                SIGPROF kills my server when using google perftools
                            
                                Text Based Adventure Game
                            
                                kivy: python for android + C++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Checking if a string contains an English sentence

Tags:

c++

string

linguistics

Nicholas Pipitone

People also ask

2 Answers

Sergey Kalinichenko

Lee Meador

Recent Activity

Donate For Us