I'm using the Google Vision API to extract the text from some pictures, however, I have been trying to improve the accuracy (confidence) of the results with no luck.
every time I change the image from the original I lose accuracy in detecting some characters.
I have isolated the issue to have multiple colors for different words with can be seen that words in red for example have incorrect results more often than the other words.
Example:
some variations on the image from gray scale or b&w
What ideas can I try to make this work better, specifically changing the colors of text to a uniform color or just black on a white background since most algorithms expect that?
some ideas I already tried, also some thresholding.
dimg = ImageOps.grayscale(im)
cimg = ImageOps.invert(dimg)
contrast = ImageEnhance.Contrast(dimg)
eimg = contrast.enhance(1)
sharp = ImageEnhance.Sharpness(dimg)
eimg = sharp.enhance(1)
Inevitably, noise in an input image, non-standard fonts that Tesseract wasn't trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text. When that happens, you need to create rules and heuristics that can be used to improve the output OCR quality.
Measuring OCR accuracy is done by taking the output of an OCR run for an image and comparing it to the original version of the same text. You can then either count how many characters were detected correctly (character level accuracy), or count how many words were recognized correctly (word level accuracy).
While Tesseract is known as one of the most accurate free OCR engines available today, it has numerous limitations that dramatically affect its performance; its ability to correctly recognize characters in a scan or image.
Bird’s-Eye View On OCR Accuracy. When it comes to improving OCR accuracy, you basically have two moving parts in the equation. If the quality of the original source image is good, i.e. if the human eyes can see the original source clearly, it will be possible to achieve good OCR results.
For smaller text (font size < 8), it is recommended to have 400–600 DPI Anything above 600 DPI will only increase the image size thereby increasing the processing time but shows no improvement with OCR accuracy. Binarization is the process of converting a colored image (RGB) into a black and white image.
Reading text from image documents using any OCR engine have many issues in order get good accuracy. There is no fixed solution to all the cases but here are a few things which should be considered to improve OCR results. 1) Presence of noise due to poor image quality / unwanted elements/blobs in the background region.
Performing OCR on a document is literally a no-brainer because PDFelement tells you exactly what to do. The moment you open a non-editable PDF file or use the Create PDF to convert an image to PDF, it recognizes this and prompts you to install the OCR plugin and perform OCR. Here’s what you’ll see on your screen:
I can only offer a butcher's solution, potentially a nightmare to maintain.
In my own, very limited scenario, it worked like a charm where several other OCR engines either failed or had unacceptable running times.
My prerequisites:
What I did: - I measured the kerning width of each character. I only had A-Za-z0-9 and a bunch of punctuation characters to worry about. - The program would start at position (0,0), measure the average color to determine the color, then access the whole set of bitmaps generated from characters in all available fonts in that color. Then it would determine which rectangle was closest to the corresponding rectangle on the screen, and advance to the next one.
(Months later, requiring more performances, I added a varying probability matrix to test first the most likely characters).
In the end, the resulting C program was able to read the subtitles out of the video stream with 100% accuracy in real time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With