Questions Linux Laravel Mysql Ubuntu Git Menu

HTML CSS JAVASCRIPT SQL PYTHON PHP BOOTSTRAP JAVA JQUERY R React Kotlin

What algorithms can group characters into words?

Tags:

I have some text generated by some lousy OCR software.

The output contains mixture of words and space-separated characters, which should have been grouped into words. For example,

Expr e s s i o n Syntax
S u m m a r y o f T e r minology

should have been

Expression Syntax
Summary of Terminology

What algorithms can group characters into words?

If I program in Python, C#, Java, C or C++, what libraries provide the implementation of the algorithms?

Thanks.

like image

999

asked Jul 04 '17 00:07

Tim

1 Answers

Minimal approach:

In your input, remove the space before any single letter words. Mark the final words created as part of this somehow (prefix them with a symbol not in the input, for example).
Get a dictionary of English words, sorted longest to shortest.
For each marked word in your input, find the longest match and break that off as a word. Repeat on the characters left over in the original "word" until there's nothing left over. (In the case where there's no match just leave it alone.)

More sophisticated, overkill approach:

The problem of splitting words without spaces is a real-world problem in languages commonly written without spaces, such as Chinese and Japanese. I'm familiar with Japanese so I'll mainly speak with reference to that.

Typical approaches use a dictionary and a sequence model. The model is trained to learn transition properties between labels - part of speech tagging, combined with the dictionary, is used to figure out the relative likelihood of different potential places to split words. Then the most likely sequence of splits for a whole sentence is solved for using (for example) the Viterbi algorithm.

Creating a system like this is almost certainly overkill if you're just cleaning OCR data, but if you're interested it may be worth looking into.

A sample case where the more sophisticated approach will work and the simple one won't:

input: Playforthefunofit
simple output: Play forth efunofit (forth is longer than for)
sophistiated output: Play for the fun of it (forth efunofit is a low-frequency - that is, unnatural - transition, while for the is not)

You can work around the issue with the simple approach to some extent by adding common short-word sequences to your dictionary as units. For example, add forthe as a dictionary word, and split it in a post processing step.

Hope that helps - good luck!

like image

105

answered Oct 11 '22 12:10

polm23

Sign in to Comment

Related questions
                            
                                Using a TensorFlow .pb graph as a Keras Model
                            
                                Language code detection of spoken language in Google Speech API [duplicate]
                            
                                Presenting a UIViewController with background transparent in swift 3.0
                            
                                How to use services in custom pipes in angular 2
                            
                                How do I configure MongoRepository to use the new MongoClient API?
                            
                                What do green-shaded boxes in Spark DAG Visualization mean?
                            
                                IE 11 on Windows 7 VS IE 11 on Windows 10
                            
                                How to have only unique options in select picker drop down?
                            
                                type of object while implementing interface
                            
                                Hyperledger fabcar sample fabric showing connect failed ERROR
                            
                                Expected condition failed: waiting for visibility of element located by By.xpath
                            
                                Verbose mode on spies in Mockito

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With