I am trying to write a spell corrector in python for a corpus of tweets i have (I am new to python and nltk). The tweets are in xml format and are tokenised. I have tried using the enchant.checker SpellingCorrector but seem to be getting a bug with it:
>>> text = "this is sme text with a speling mistake."
>>> from enchant.checker import SpellChecker
>>> chkr = SpellChecker("en_US", text)
>>> for err in chkr:
... err.replace("SPAM")
...
>>> chkr.get_text()
'this is SPAM text with a SPAMSSPSPAM.SSPSPAM'
when it should return "this is some text with a spelling mistake."
I have also written a spell corrector for single words that I am happy with but I am struggling to work out how to parse over the tokenised tweet files to get this to work:
def __init__(self, dict_name='en', max_dist=2):
self.spell_dict = enchant.Dict('en_GB')
self.max_dist = max_dist
def replace(self, word):
if self.spell_dict.check(word):
return word
suggestions = self.spell_dict.suggest(word)
if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist:
return suggestions[0]
else:
return word
Can anybody help me at all please?
Thanks
I saw your post and thought I'd do some playing around with it. This is what I got.
I added a few print statements to see what was going on:
from enchant.checker import SpellChecker
text = "this is sme text with a speling mistake."
chkr = SpellChecker("en_US", text)
for err in chkr:
print(err.word + " at position " + str(err.wordpos)) #<----
err.replace("SPAM")
t = chkr.get_text()
print("\n" + t) #<----
and this is the result of running the code:
sme at position 8
speling at position 25
ing at position 29
ng at position 30
AMMstake at position 32
ake at position 37
ke at position 38
AMM at position 40
this is SPAM text with a SPAMSSPSPAM.SSPSPAM
As you can see, as the mispelled words are replaced by "SPAM", the spell checker seems to be dynamically changing, and checking the original text in that it is including parts of "SPAM" in the err var.
I tried the original code from http://pythonhosted.org/pyenchant/api/enchant.checker.html, with the example it looks like you used for you question and still got some unexpected results.
Note: the only thing I added was the print statements:
Orinal:
>>> text = "This is sme text with a fw speling errors in it."
>>> chkr = SpellChecker("en_US",text)
>>> for err in chkr:
... err.replace("SPAM")
...
>>> chkr.get_text()
'This is SPAM text with a SPAM SPAM errors in it.'
My Code:
from enchant.checker import SpellChecker
text = "This is sme text with a fw speling errors in it."
chkr = SpellChecker("en_US", text)
for err in chkr:
print(err.word + " at position " + str(err.wordpos))
err.replace("SPAM")
t = chkr.get_text()
print("\n" + t)
The output did not match the website:
sme at position 8
fw at position 25
speling at position 30
ing at position 34
ng at position 35
AMMrors at position 37 #<---- seems to add in parts of "SPAM"
This is SPAM text with a SPAM SPAMSSPSPAM in it. #<---- my output ???
Anyway, here's something I came up with that solves some of the problem. Instead of replacing with "SPAM", I use a version of the code you posted for single word replacement and replace with an actual suggested word. It is important to note here that the "suggested" word is wrong 100% of the time in this example. I've run accross this issue in the past, "How to implement spelling correction without user interaction." The scope of that would be far beyond you're question. But, I think you're going to need a few array of NLP to get accurate results.
import enchant
from enchant.checker import SpellChecker
from nltk.metrics.distance import edit_distance
class MySpellChecker():
def __init__(self, dict_name='en_US', max_dist=2):
self.spell_dict = enchant.Dict(dict_name)
self.max_dist = max_dist
def replace(self, word):
suggestions = self.spell_dict.suggest(word)
if suggestions:
for suggestion in suggestions:
if edit_distance(word, suggestion) <= self.max_dist:
return suggestions[0]
return word
if __name__ == '__main__':
text = "this is sme text with a speling mistake."
my_spell_checker = MySpellChecker(max_dist=1)
chkr = SpellChecker("en_US", text)
for err in chkr:
print(err.word + " at position " + str(err.wordpos))
err.replace(my_spell_checker.replace(err.word))
t = chkr.get_text()
print("\n" + t)
The problem with your spellchecker is the line
err.replace("SPAM")
You want to feed the misspelled word to the function, i.e.
err.replace(err.word)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With