Generate misspelled words (typos)

Tags:

I have implemented a fuzzy matching algorithm and I would like to evaluate its recall using some sample queries with test data.

Let's say I have a document containing the text:

{"text": "The quick brown fox jumps over the lazy dog"}

I want to see if I can retrieve it by testing queries such as "sox" or "hazy drog" instead of "fox" and "lazy dog".

In other words, I want to add noise to strings to generate misspelled words (typos).

What would be a way of automatically generating words with typos for evaluating fuzzy search?

754

asked Jun 28 '18 09:06

Video Answer

2 Answers

I would just create a program to randomly alter letters in your words. I guess you can elaborate for specific requirements of your case, but the general idea would go like this.

Say you have a phrase

phrase = "The quick brown fox jumps over the lazy dog"

Then define a probability for a word to change (say 10%)

p = 0.1

Then loop over the words of your phrase and sample from a uniform distribution for each one of them. If the random variable is lower than your threshold, then randomly change one letter from the word

import string
import random

new_phrase = []
words = phrase.split(' ')
for word in words:
    outcome = random.random()
    if outcome <= p:
        ix = random.choice(range(len(word)))
        new_word = ''.join([word[w] if w != ix else random.choice(string.ascii_letters) for w in range(len(word))])
        new_phrase.append(new_word)
    else:
        new_phrase.append(word)

new_phrase = ' '.join([w for w in new_phrase])

In my case I got the following interesting phrase result

print(new_phrase)
'The quick brown fWx jumps ovey the lazy dog'

102

answered Oct 13 '22 00:10

Haven't used this myself, but a quick google search found https://www.dcs.bbk.ac.uk/~ROGER/corpora.html which I guess you can use to get frequent misspellings for words in your text. You can also generate misspellings yourself using keyboard distance, as explained here, I guess: Edit distance such as Levenshtein taking into account proximity on keyboard Perhaps there are some other databases/corpora of frequent misspellings other than the one referred to above, because I would guess that just randomly inserting/deleting/changing characters with a total levenhstein distance of, say, max 3 will not be a useful evaluation of your system, since people don't randomly make mistakes, but exhibit simple, logical patterns in the types of (spelling) mistakes made.

answered Oct 13 '22 00:10

Igor

Related questions
                            
                                Is it necessary to open a SFTPClient per one thread in Paramiko with multi-threading?
                            
                                Customize legend and color scale in interactive charts `altair`
                            
                                Convert dtype of a specific column in a numpy array [duplicate]
                            
                                Faster way to iterate all keys and values in redis db
                            
                                Algorithm to calculate 'initial lists' in O(m*log m)
                            
                                how to create upside down bar graphs with shared x-axis with matplotlib / seaborn and a pandas dataframe
                            
                                Does pytest have anything like google test's non-fatal EXPECT_* behavior?
                            
                                How to sum rows with the same keys?
                            
                                Order one numpy array by another
                            
                                asyncio doesn't send the entire image data over tcp
                            
                                Reverse for 'edit_post' with arguments '('',)' not found. 1 pattern(s) tried: ['edit_post/(?P<post_id>\\d+)/$']
                            
                                How to enable an "allow-insecure-localhost" flag in Chrome from selenium?
                            
                                Remove both duplicates in multiple lists python
                            
                                Python == with or vs. in list comparison
                            
                                PyX not installed correctly when using scapy
                            
                                Using mpi4py to parallelize a 'for' loop on a compute cluster
                            
                                How to create a discrete RGB colourmap with N colours using numpy
                            
                                set operation on a list of elements
                            
                                Python PIL can't open PDFs for some reason
                            
                                New PyYAML version breaks on most custom python objects - RepresenterError

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Generate misspelled words (typos)

Tags:

python

nlp

fuzzy-search

Philip Bergström

People also ask

Video Answer

2 Answers

kosnik

Igor

Recent Activity

Donate For Us