I am interested in using OCR to extract bold and italic words from a simple text. For example, if I input a clear image with text like so:
"The quick brown fox jumps over the lazy dog."
I would like to get an output like so: bold("brown", "jumps"), italic("lazy")
I have looked into doing this with OCRopus or Tesseract, but the documentation is poor and I can't tell if it's possible, or how to do it if it is.
There is such function in Tesseract 3.0.1, from trunk. A new class is added to the API - ResultIterator
, which has the following function you are interested in:
WordFontAttributes(bool* is_bold,
bool* is_italic,
bool* is_underlined,
bool* is_monospace,
bool* is_serif,
bool* is_smallcaps,
int* pointsize,
int* font_id).
Actually you can see it yourself from here.
The Tesseract 3.0x's XML-based hOCR format includes character attributes. You may want to try that.
http://code.google.com/p/tesseract-ocr/issues/detail?id=377#c5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With