At my work, I sometimes have to take some printed source code and manually type the source code into a text editor. Do not ask why. Obviously typing it up takes a long time and always extra time to debug typing errors (oops missed a "$" sign there). I decided to try some OCR solutions like: <ul> <li>Microsoft Document Imaging - has built in OCR <ul> <li>Result: Missed all the leading whitespace, missed all the underscores, interpreted many of the punctuation characters incorrectly.</li> <li>Conclusion: Slower than manually typing in code.</li> </ul> </li> <li>Various online web OCR apps <ul> <li>Result: Similar or worse than Microsoft Document Imaging</li> <li>Conclusion: Slower than manually typing in code.</li> </ul> </li> </ul> I feel like source code would be very easy to OCR given the font is sans serif and monospace. Have any of you found a good OCR solution that works well on source code? Maybe I just need a better OCR solution (not necessarily source code specific)?

Google Drive's built-in OCR worked pretty well for me. Just convert scans to a PDF, upload to Google Drive, and choose "Open with... Google Docs". There are some weird things with color and text size, but it still includes semicolons and such. The original screenshot: <img src="https://i.stack.imgur.com/AgffY.png" alt="original screenshot"> The Google Docs OCR: <img src="https://i.stack.imgur.com/0wCeH.png" alt="Google Docs OCR"> Plaintext version: <pre class="prettyprint"><code>#include <stdio.h> int main(void) { char word[51]; int contains = -1; int i = 0; int length = 0; scanf("%s", word); while (word[length] != "\0") i ++; while ((contains == 1 || contains == 2) && word[i] != "\0") { if (word[i] == "t" || word[i] == "T") { if (i <= length / 2) { contains = 1; } else contains = 2; return 0; </code></pre>

Need good OCR for printed source code listing, any ideas?

1 Answers

Google Drive's built-in OCR worked pretty well for me. Just convert scans to a PDF, upload to Google Drive, and choose "Open with... Google Docs". There are some weird things with color and text size, but it still includes semicolons and such.

The original screenshot: The Google Docs OCR:

Plaintext version:

#include <stdio.h> int main(void) { 
char word[51]; int contains = -1; int i = 0; int length = 0; scanf("%s", word); while (word[length] != "\0") i ++; while ((contains == 1 || contains == 2) && word[i] != "\0") { 
if (word[i] == "t" || word[i] == "T") { 
if (i <= length / 2) { 
contains = 1; } else contains = 2; 
return 0;

196

answered Sep 27 '22 20:09

FuturrCoder

Related questions
                            
                                Make tesseract recognise numbers only
                            
                                How to implement Tesseract to run with project in Visual Studio 2010
                            
                                Python OpenCV skew correction for OCR
                            
                                How to Improve OCR on image with text in different colors and fonts?
                            
                                Business card reader or OCR Library for iPhone SDK
                            
                                Strength of Dictionary in Tesseract 3
                            
                                Extracting paragraph breaks from OCR text?
                            
                                How do I use MODI in an ASP.Net Web Application?
                            
                                Check if a PDF file is a scanned one
                            
                                javascript OCR API [closed]
                            
                                Incomplete coordinate values for Google Vision OCR
                            
                                Tesseract does not recognize german "für"
                            
                                How to detect subscript numbers in an image using OCR?
                            
                                Tesseract OCR Text Position
                            
                                OCR engines designed for screen-reading
                            
                                Easy ways to detect and crop blocks (paragraphs) of text out of image?
                            
                                Error: (gcloud.alpha.functions.deploy) ResponseError: status=[403], code=[Forbidden], message=[Cannot access Google Cloud Functions API in project
                            
                                Python Tesseract can't recognize this font

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Need good OCR for printed source code listing, any ideas?

Tags:

ocr

Trevor Boyd Smith

People also ask

1 Answers

FuturrCoder

Recent Activity

Donate For Us