Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I improve spell check time in C program?

As one of the assignments in Harvard's CS50 course, students are tasked with creating a spell checking program. The main thrust of the assignment is speed - pure speed - and I've gotten to the point where I'm beating the staff's implementation, but I feel that I can do better, and am looking for a push in the right direction.

Here is my pseudocode:

// read the dictionary word list
Read entire dictionary in one fread into memory
rawmemchr through and pick out the words
send each word through the hash function
create chain links for any index where collisions occur

// accept the incoming test words
Run the test word through the hash function
compare to the existing table / linked list
return the result of the comparison

With a dictionary of 150K words, and input text up to 6MB, I'm able to accurately spell check in about a half second.

However, when I look at the words that are coming from the input text, it's pretty clear that a large percentage of those words are common (like "the", "and", "for"), and that most of the misspelled words are also checked multiple times.

My intuition says I should be able to "cache" the "good hits" and "bad hits" so that I'm not hashing the same words over and over for table lookups. Even though the current result is very close to O(1), I feel like I should be able to shave a few microseconds off the time by re-evaluating my approach.

For example, after I have loaded the dictionary, the text input could be 8MB of nothing but this: "missspeling". So rather than hash / check the same word over and over (at a computational expense), I would like to understand if there is a way to programmatically discard words which have already been hashed and rejected, but in a way that is more efficient than the hash / check itself. (I'm using MurmurHash3, fwiw).

I realize that the theoretical performance improvement would be limited to situations where the input text is long, and there are a high number of repeat misspellings. Based on some of the input texts I've evaluated, here are some of the results:

Unique Misspellings: 6960
Total Misspellings:  17845
Words in dictionary: 143091
Words in input text: 1150970
Total Time:          0.56 seconds

Unique Misspellings: 8348
Total Misspellings:  45691
Words in dictionary: 143091
Words in input text: 904612
Total Time:          0.83 seconds 

In the second sample run, you can see that I'm having to go back to the hash table approximately 5.5 times for each misspelled word! That seems nuts to me, and I feel there must be a more efficient way to address this circumstance, since most of my program's time is spent in the hash function.

I could implement Posix threads (this runs on an 8 core system) to improve the program's time, but I am more interested in improving my approach and thought process around the problem.

Sorry this is so long winded, but this is my first Stack Overflow post and I'm trying to be thorough. I searched before posting, but most other "spell check" posts are related to "how" and not "improve". I'm grateful for suggestions that get me pointed in the right direction.

http://github.com/Ganellon/spell_check

like image 926
Drew Avatar asked May 17 '16 22:05

Drew


People also ask

What algorithm does spell check use?

Spell checkers can use approximate string matching algorithms such as Levenshtein distance to find correct spellings of misspelled words. An alternative type of spell checker uses solely statistical information, such as n-grams, to recognize errors instead of correctly-spelled words.

Which dynamic programming algorithm can be used in the design of a spell checker?

Spell Checker with Dynamic Programming The Hashing Algorithm will be used to quickly access the dictionary where words are stored, and the Levenshtein Edit Distance Algorithm will be used to suggest a new word instead of a word that is not found in the dictionary or is misspelled.


2 Answers

In both of your trials, what is noticeable is that the majority of words are correctly spelled. Consequently, you should focus on optimizing lookup of words which are in the dictionary.

In your first trial, for example, only 1.5% of all words are misspelled. Suppose it takes twice as long on average to look up a word which is not in the dictionary (because every word in the bucket needs to be checked). Even if if you reduced that to 0 (the theoretical minimum :) ), you would speed your program up by less than 3%.

A common hashtable optimization is to move the key you find to the beginning of the bucket chain, if it is not already there. That will tend to reduce the number of hash entries checked for commonly-used words. It's not a huge speed-up, but in cases where some keys are looked up much more often than others, it can definitely be noticed.

Reducing chain length by decreasing the hashtable occupancy may help more, at the cost of more memory.

Another possibility, since you are not going to modify the dictionary once it is built, is to store each bucket chain in contiguous memory, without pointers. Not only will that reduce memory consumption, it will improve cache performance because since most words are short, most buckets will fit in a single cache line.

And since words tend to be quite short, you may well be able to find a way to optimize the comparison. strcmp() is well optimized but it is generally optimized for larger strings. If you're allowed to use it, the SSE4.2 PCMPESTRI opcode is amazingly powerful (but figuring out what it does and how to use it to solve your problem can be a huge time-waster). More simply, you should be able to compare four eight-byte prefixes simultaneously with 256-bit comparison operations (and you might even have 512-bit operations available to you), so with clever data arrangement, you may well be able to do an entire bucket comparison in parallel.

That's not to say that hashtables are necessarily the optimal datastructure for this problem. But remember that the more you can do in a single cache-line, the faster you program will run. Linked-list-intensive datastructures can turn out to be suboptimal even if they look good on paper.


After thinking about this problem for a couple of days and actually writing some code, I came to the conclusion that optimizing for successful hashtable lookup speed is probably not correct for a realworld spellchecker. It's true that most words in the text being looked up are usually correctly spelled -- although that depends on the spellcheck user -- but the algorithm which attempts to suggest correct spellings is likely to do a lot of unsuccessful lookups as it cycles through possible misspellings. I know that's probably out of scope for this problem, but it does make a difference for optimization, because you end up with two quite different strategies.

If you're trying to reject quickly, you need lots of possibly empty bucket chains, or a Bloom filter or its moral equivalent, so you can reject most errors on the first probe.

For example, if you have a good hash algorithm which yields more bits than you need -- and you almost certainly do, because spellcheck dictionaries aren't that big -- then you can just use some otherwise-unused bits in the hash for a secondary hash. Without even going to the trouble of implementing the entire Bloom filter, you can just add, say, a 32-bit mask to each bucket header representing the possible values of five secondary hash bits in the values stored in that bucket. Combined with a sparse table -- I used 30% occupancy for the experiment, which is not that sparse -- you should be able to reject 80-90% of lookup failures without going beyond the bucket header.

On the other hand, if you're trying to optimize for success, then it might turn out that largish buckets are better because it cuts down on the number of bucket headers, and that improves cache usage. As long as the entire bucket fits into a cache-line, the speed of multiple comparisons is so high that you won't notice the difference. (And since words tend to be short, it's reasonable to expect five or six to fit into a 64-byte cacheline.)

Anyway, without going to too much work, I managed to do a million lookups in 70 milliseconds of CPU. Multiprocessing could speed up elapsed time quite a bit particularly as no locking is required given that the hash table is immutable.

The morals I want to draw from this:

In order to optimize:

  • you need to understand your data

  • you need to understand your expected usage pattern

  • you need to adapt your algorithms based on the above

  • you need to do a lot of experimenting.

like image 148
rici Avatar answered Sep 28 '22 06:09

rici


This is a pretty well-solved problem. ;-) You should look into a data structure called a trie. A trie is a tree built of individual characters, so that the path represents the information. Each node consists of letters you can legitimately add to the current prefix. When a letter is a valid word, that is also recorded.

For four words:

root-> [a]-> [a]-> [r]-> [d]-> [v]-> [a]-> [r]-> [k*]->[s*]
             [b]
                \> [a]-> [c]-> [i*]
                               [u]-> [s*]

This would represent "aardvark", "aardvarks", "abaci", and "abacus." The nodes are vertically contiguous, so 2nd letter [ab] is a node, and 5th letter [i*u] is a node.

Traverse the trie character by character, and check for a valid word when you hit space. If you can't traverse with the character you have, then it's a bad word. If you don't find valid when you hit space, it's a bad word.

This is O(n) to process (n = word length) and it's very, very fast. Building the trie is going to consume a bunch of RAM, but you don't care about that I think.

like image 26
aghast Avatar answered Sep 28 '22 04:09

aghast