It's part of the process of OCR,which is :
How to segment the sentences into words,and then characters?
What's the candidate algorithm for this task?
As a first pass:
Now all you need a a good enough definition of "large".
First, NIST (Nat'l Institutes of Standards and Tech.) published a protocol known as the NIST Form-Based Handwriting Recognition System about 15 years ago for the this exact question--i.e., extracting and preparing text-as-image data for input to machine learning algorithms for OCR. Members of this group at NIST also published a number of papers on this System.
The performance of their classifier was demonstrated by data also published with the algorithm (the "NIST Handwriting Sample Forms.")
Each of the half-dozen or so OCR data sets i have downloaded and used have referenced the data extraction/preparation protocol used by NIST to prepare the data for input to their algorithm. In particular, i am pretty sure this is the methodology relied on to prepare the Boston University Handwritten Digit Database, which is regarded as benchmark reference data for OCR.
So if the NIST protocol is not a genuine standard at least it's a proven methodology to prepare text-as-image for input to an OCR algorithm. I would suggest starting there, and using that protocol to prepare your data unless you have a good reason not to.
In sum, the NIST data was prepared by extracting 32-bit x 32 bit normalized bitmaps directly from a pre-printed form.
Here's an example:
00000000000001100111100000000000 00000000000111111111111111000000 00000000011111111111111111110000 00000000011111111111111111110000 00000000011111111101000001100000 00000000011111110000000000000000 00000000111100000000000000000000 00000001111100000000000000000000 00000001111100011110000000000000 00000001111100011111000000000000 00000001111111111111111000000000 00000001111111111111111000000000 00000001111111111111111110000000 00000001111111111111111100000000 00000001111111100011111110000000 00000001111110000001111110000000 00000001111100000000111110000000 00000001111000000000111110000000 00000000000000000000001111000000 00000000000000000000001111000000 00000000000000000000011110000000 00000000000000000000011110000000 00000000000000000000111110000000 00000000000000000001111100000000 00000000001110000001111100000000 00000000001110000011111100000000 00000000001111101111111000000000 00000000011111111111100000000000 00000000011111111111000000000000 00000000011111111110000000000000 00000000001111111000000000000000 00000000000010000000000000000000
I believe that the BU data-prep technique subsumes the NIST technique but added a few steps at the end, not with higher fidelity in mind but to reduce file size. In particular, the BU group:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With