Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate misspelled words (typos)

I have implemented a fuzzy matching algorithm and I would like to evaluate its recall using some sample queries with test data.

Let's say I have a document containing the text:

{"text": "The quick brown fox jumps over the lazy dog"}

I want to see if I can retrieve it by testing queries such as "sox" or "hazy drog" instead of "fox" and "lazy dog".

In other words, I want to add noise to strings to generate misspelled words (typos).

What would be a way of automatically generating words with typos for evaluating fuzzy search?

like image 754
Philip Bergström Avatar asked Jun 28 '18 09:06

Philip Bergström


People also ask

How does the misspelled text generator work?

So those misspelled words attain a search volume that is similar to their correct counterparts. What our Misspelled Text Generator does is, it crawls over those typo errors and creates a keyword list that you can use for your website's keyword optimization process.

How to generate misspelled words for SEO?

Then, you can use RankWatch's Misspelled text maker. After all, it is one of the best misspelled words generator available for free in the market. Plus, using our tool is a child's play. You just need to enter the keywords you want to generate typos of and separate each one of the keywords with a comma, and then submit by pressing enter.

What is the best free misspelled words generator?

Then, you can use RankWatch's Misspelled text maker. After all, it is one of the best misspelled words generator available for free in the market. Plus, using our tool is a child's play.

How does the misspelled Keyword Tool work?

After doing that, the tool creates spelling variations of each keyword you have provided. The misspelled keyword list produced will be broad, and it will include all the alternative spelling mistakes that searchers mostly commit when performing a related search.


Video Answer


2 Answers

I would just create a program to randomly alter letters in your words. I guess you can elaborate for specific requirements of your case, but the general idea would go like this.

Say you have a phrase

phrase = "The quick brown fox jumps over the lazy dog"

Then define a probability for a word to change (say 10%)

p = 0.1

Then loop over the words of your phrase and sample from a uniform distribution for each one of them. If the random variable is lower than your threshold, then randomly change one letter from the word

import string
import random

new_phrase = []
words = phrase.split(' ')
for word in words:
    outcome = random.random()
    if outcome <= p:
        ix = random.choice(range(len(word)))
        new_word = ''.join([word[w] if w != ix else random.choice(string.ascii_letters) for w in range(len(word))])
        new_phrase.append(new_word)
    else:
        new_phrase.append(word)

new_phrase = ' '.join([w for w in new_phrase]) 

In my case I got the following interesting phrase result

print(new_phrase)
'The quick brown fWx jumps ovey the lazy dog'
like image 102
kosnik Avatar answered Oct 13 '22 00:10

kosnik


Haven't used this myself, but a quick google search found https://www.dcs.bbk.ac.uk/~ROGER/corpora.html which I guess you can use to get frequent misspellings for words in your text. You can also generate misspellings yourself using keyboard distance, as explained here, I guess: Edit distance such as Levenshtein taking into account proximity on keyboard Perhaps there are some other databases/corpora of frequent misspellings other than the one referred to above, because I would guess that just randomly inserting/deleting/changing characters with a total levenhstein distance of, say, max 3 will not be a useful evaluation of your system, since people don't randomly make mistakes, but exhibit simple, logical patterns in the types of (spelling) mistakes made.

like image 24
Igor Avatar answered Oct 13 '22 00:10

Igor