I have implemented a fuzzy matching algorithm and I would like to evaluate its recall using some sample queries with test data.
Let's say I have a document containing the text:
{"text": "The quick brown fox jumps over the lazy dog"}
I want to see if I can retrieve it by testing queries such as "sox" or "hazy drog" instead of "fox" and "lazy dog".
In other words, I want to add noise to strings to generate misspelled words (typos).
What would be a way of automatically generating words with typos for evaluating fuzzy search?
So those misspelled words attain a search volume that is similar to their correct counterparts. What our Misspelled Text Generator does is, it crawls over those typo errors and creates a keyword list that you can use for your website's keyword optimization process.
Then, you can use RankWatch's Misspelled text maker. After all, it is one of the best misspelled words generator available for free in the market. Plus, using our tool is a child's play. You just need to enter the keywords you want to generate typos of and separate each one of the keywords with a comma, and then submit by pressing enter.
Then, you can use RankWatch's Misspelled text maker. After all, it is one of the best misspelled words generator available for free in the market. Plus, using our tool is a child's play.
After doing that, the tool creates spelling variations of each keyword you have provided. The misspelled keyword list produced will be broad, and it will include all the alternative spelling mistakes that searchers mostly commit when performing a related search.
I would just create a program to randomly alter letters in your words. I guess you can elaborate for specific requirements of your case, but the general idea would go like this.
Say you have a phrase
phrase = "The quick brown fox jumps over the lazy dog"
Then define a probability for a word to change (say 10%)
p = 0.1
Then loop over the words of your phrase and sample from a uniform distribution for each one of them. If the random variable is lower than your threshold, then randomly change one letter from the word
import string
import random
new_phrase = []
words = phrase.split(' ')
for word in words:
outcome = random.random()
if outcome <= p:
ix = random.choice(range(len(word)))
new_word = ''.join([word[w] if w != ix else random.choice(string.ascii_letters) for w in range(len(word))])
new_phrase.append(new_word)
else:
new_phrase.append(word)
new_phrase = ' '.join([w for w in new_phrase])
In my case I got the following interesting phrase result
print(new_phrase)
'The quick brown fWx jumps ovey the lazy dog'
Haven't used this myself, but a quick google search found https://www.dcs.bbk.ac.uk/~ROGER/corpora.html which I guess you can use to get frequent misspellings for words in your text. You can also generate misspellings yourself using keyboard distance, as explained here, I guess: Edit distance such as Levenshtein taking into account proximity on keyboard Perhaps there are some other databases/corpora of frequent misspellings other than the one referred to above, because I would guess that just randomly inserting/deleting/changing characters with a total levenhstein distance of, say, max 3 will not be a useful evaluation of your system, since people don't randomly make mistakes, but exhibit simple, logical patterns in the types of (spelling) mistakes made.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With