Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What algorithms can group characters into words?

Tags:

I have some text generated by some lousy OCR software.

The output contains mixture of words and space-separated characters, which should have been grouped into words. For example,

Expr e s s i o n Syntax
S u m m a r y o f T e r minology 

should have been

Expression Syntax
Summary of Terminology 

What algorithms can group characters into words?

If I program in Python, C#, Java, C or C++, what libraries provide the implementation of the algorithms?

Thanks.

like image 999
Tim Avatar asked Jul 04 '17 00:07

Tim


1 Answers

Minimal approach:

  1. In your input, remove the space before any single letter words. Mark the final words created as part of this somehow (prefix them with a symbol not in the input, for example).
  2. Get a dictionary of English words, sorted longest to shortest.
  3. For each marked word in your input, find the longest match and break that off as a word. Repeat on the characters left over in the original "word" until there's nothing left over. (In the case where there's no match just leave it alone.)

More sophisticated, overkill approach:

The problem of splitting words without spaces is a real-world problem in languages commonly written without spaces, such as Chinese and Japanese. I'm familiar with Japanese so I'll mainly speak with reference to that.

Typical approaches use a dictionary and a sequence model. The model is trained to learn transition properties between labels - part of speech tagging, combined with the dictionary, is used to figure out the relative likelihood of different potential places to split words. Then the most likely sequence of splits for a whole sentence is solved for using (for example) the Viterbi algorithm.

Creating a system like this is almost certainly overkill if you're just cleaning OCR data, but if you're interested it may be worth looking into.


A sample case where the more sophisticated approach will work and the simple one won't:

  • input: Playforthefunofit
  • simple output: Play forth efunofit (forth is longer than for)
  • sophistiated output: Play for the fun of it (forth efunofit is a low-frequency - that is, unnatural - transition, while for the is not)

You can work around the issue with the simple approach to some extent by adding common short-word sequences to your dictionary as units. For example, add forthe as a dictionary word, and split it in a post processing step.

Hope that helps - good luck!

like image 105
polm23 Avatar answered Oct 11 '22 12:10

polm23