I'm using NLTK to process some text that is extracted from PDF files. I can recover the text mostly intact, but there are lots of instances where spaces between words are not captured, so I get words like ifI
instead of if I
, or thatposition
instead of that position
, or andhe's
instead of and he's
.
My question is this: how can I use NLTK to look for words it does not recognize/has not learned, and see if there are "nearby" word combinations that are much more likely to occur? Is there a more graceful way to implement this kind of check than simply marching through the unrecognized word, one character at a time, splitting it, and seeing if it makes two recognizable words?
I would suggest that you consider using pyenchant instead, since it is a more robust solution for this sort of problem. You can download pyenchant here. Here is an example of how you would obtain your results after you install it:
>>> text = "IfI am inthat position, Idon't think I will." # note the lack of spaces
>>> from enchant.checker import SpellChecker
>>> checker = SpellChecker("en_US")
>>> checker.set_text(text)
>>> for error in checker:
for suggestion in error.suggest():
if error.word.replace(' ', '') == suggestion.replace(' ', ''): # make sure the suggestion has exact same characters as error in the same order as error and without considering spaces
error.replace(suggestion)
break
>>> checker.get_text()
"If I am in that position, I don't think I will." # text is now fixed
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With