Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python spell corrector using ntlk

I am trying to write a spell corrector in python for a corpus of tweets i have (I am new to python and nltk). The tweets are in xml format and are tokenised. I have tried using the enchant.checker SpellingCorrector but seem to be getting a bug with it:

>>> text = "this is sme text with a speling mistake."
>>> from enchant.checker import SpellChecker
>>> chkr = SpellChecker("en_US", text)
>>> for err in chkr:
...     err.replace("SPAM")
... 
>>> chkr.get_text()
'this is SPAM text with a SPAMSSPSPAM.SSPSPAM'

when it should return "this is some text with a spelling mistake."

I have also written a spell corrector for single words that I am happy with but I am struggling to work out how to parse over the tokenised tweet files to get this to work:

def __init__(self, dict_name='en', max_dist=2):
        self.spell_dict = enchant.Dict('en_GB')
        self.max_dist = max_dist

    def replace(self, word):
        if self.spell_dict.check(word):
            return word

        suggestions = self.spell_dict.suggest(word)

        if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist:
            return suggestions[0]
        else:
            return word

Can anybody help me at all please?

Thanks

like image 635
user3361260 Avatar asked Mar 20 '23 10:03

user3361260


2 Answers

I saw your post and thought I'd do some playing around with it. This is what I got.

I added a few print statements to see what was going on:

from enchant.checker import SpellChecker

text = "this is sme text with a speling mistake."

chkr = SpellChecker("en_US", text)
for err in chkr:
    print(err.word + " at position " + str(err.wordpos))  #<----
    err.replace("SPAM")

t = chkr.get_text()
print("\n" + t)  #<----

and this is the result of running the code:

sme at position 8
speling at position 25
ing at position 29
ng at position 30
AMMstake at position 32
ake at position 37
ke at position 38
AMM at position 40

this is SPAM text with a SPAMSSPSPAM.SSPSPAM

As you can see, as the mispelled words are replaced by "SPAM", the spell checker seems to be dynamically changing, and checking the original text in that it is including parts of "SPAM" in the err var.

I tried the original code from http://pythonhosted.org/pyenchant/api/enchant.checker.html, with the example it looks like you used for you question and still got some unexpected results.

Note: the only thing I added was the print statements:

Orinal:

>>> text = "This is sme text with a fw speling errors in it."
>>> chkr = SpellChecker("en_US",text)
>>> for err in chkr:
...   err.replace("SPAM")
...
>>> chkr.get_text()
'This is SPAM text with a SPAM SPAM errors in it.'

My Code:

from enchant.checker import SpellChecker

text = "This is sme text with a fw speling errors in it."

chkr = SpellChecker("en_US", text)
for err in chkr:
    print(err.word + " at position " + str(err.wordpos))
    err.replace("SPAM")

t = chkr.get_text()
print("\n" + t)

The output did not match the website:

sme at position 8
fw at position 25
speling at position 30
ing at position 34
ng at position 35
AMMrors at position 37  #<---- seems to add in parts of "SPAM"

This is SPAM text with a SPAM SPAMSSPSPAM in it.  #<---- my output ???

Anyway, here's something I came up with that solves some of the problem. Instead of replacing with "SPAM", I use a version of the code you posted for single word replacement and replace with an actual suggested word. It is important to note here that the "suggested" word is wrong 100% of the time in this example. I've run accross this issue in the past, "How to implement spelling correction without user interaction." The scope of that would be far beyond you're question. But, I think you're going to need a few array of NLP to get accurate results.

import enchant
from enchant.checker import SpellChecker
from nltk.metrics.distance import edit_distance

class MySpellChecker():

    def __init__(self, dict_name='en_US', max_dist=2):
        self.spell_dict = enchant.Dict(dict_name)
        self.max_dist = max_dist

    def replace(self, word):
        suggestions = self.spell_dict.suggest(word)

        if suggestions:
            for suggestion in suggestions:
                if edit_distance(word, suggestion) <= self.max_dist:
                    return suggestions[0]

        return word


if __name__ == '__main__':
    text = "this is sme text with a speling mistake."

    my_spell_checker = MySpellChecker(max_dist=1)
    chkr = SpellChecker("en_US", text)
    for err in chkr:
        print(err.word + " at position " + str(err.wordpos))
        err.replace(my_spell_checker.replace(err.word))

    t = chkr.get_text()
    print("\n" + t)
like image 196
Coaden Avatar answered Apr 06 '23 01:04

Coaden


The problem with your spellchecker is the line

err.replace("SPAM")

You want to feed the misspelled word to the function, i.e.

err.replace(err.word)
like image 40
Ben Olayinka Avatar answered Apr 05 '23 23:04

Ben Olayinka