Tokenizing unsplit words from OCR using NLTK

Question

I'm using NLTK to process some text that is extracted from PDF files. I can recover the text mostly intact, but there are lots of instances where spaces between words are not captured, so I get words like ifI instead of if I, or thatposition instead of that position, or andhe's instead of and he's.

My question is this: how can I use NLTK to look for words it does not recognize/has not learned, and see if there are "nearby" word combinations that are much more likely to occur? Is there a more graceful way to implement this kind of check than simply marching through the unrecognized word, one character at a time, splitting it, and seeing if it makes two recognizable words?

Justin O Barber · Accepted Answer

I would suggest that you consider using pyenchant instead, since it is a more robust solution for this sort of problem. You can download pyenchant here. Here is an example of how you would obtain your results after you install it:

>>> text = "IfI am inthat position, Idon't think I will."  # note the lack of spaces
>>> from enchant.checker import SpellChecker
>>> checker = SpellChecker("en_US")
>>> checker.set_text(text)
>>> for error in checker:
    for suggestion in error.suggest():
        if error.word.replace(' ', '') == suggestion.replace(' ', ''):  # make sure the suggestion has exact same characters as error in the same order as error and without considering spaces
            error.replace(suggestion)
            break
>>> checker.get_text()
"If I am in that position, I don't think I will."  # text is now fixed

Tokenizing unsplit words from OCR using NLTK

Tags:

python

split

tokenize

ocr

nltk

charlesreid1

1 Answers

Justin O Barber

Recent Activity

Donate For Us

Tokenizing unsplit words from OCR using NLTK

Tags:

python

split

tokenize

ocr

nltk

charlesreid1

1 Answers

Justin O Barber

Related questions

Recent Activity

Donate For Us