I have been trying to do OCR within R (reading PDF data which data as scanned image). Have been reading about this @ http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/
This a very good post.
Effectively 3 steps:
The effective code for the above 3 steps as per the link post:
lapply(myfiles, function(i){
# convert pdf to ppm (an image format), just pages 1-10 of the PDF
# but you can change that easily, just remove or edit the
# -f 1 -l 10 bit in the line below
shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r 600 ocrbook")))
# convert ppm to tif ready for tesseract
shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i, ".tif")))
# convert tif to text file
shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
# delete tif file
file.remove(paste0(i, ".tif" ))
})
The first two steps are happening fine. (although taking good amount of time, for 4 pages of a pdf, but will look into the scalability part later, first trying if this works or not)
While running this, the fist two steps work fine.
While runinng the 3rd step, i.e
shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
I having this error:
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
Or Tesseract is crashing.
Any workaround or root cause analysis would be appreciated.
By using "tesseract", I created a sample script which works.Even it works for scanned PDF's too.
library(tesseract)
library(pdftools)
# Render pdf to png image
img_file <- pdftools::pdf_convert("F:/gowtham/A/B/invoice.pdf", format = 'tiff', dpi = 400)
# Extract text from png image
text <- ocr(img_file)
write.table(text, "F:/gowtham/A/B/mydata.txt")
I'm new to R and Programming. Guide me if it's wrong. Hope this help you.
The newly released tesseract
package might be worth checking out. It allows you to perform the whole process inside of R without the shell
calls.
Taking the procedure as used in the help documentation of the tesseract
package your function would look something like this:
lapply(myfiles, function(i){
# convert pdf to jpef/tiff and perform tesseract OCR on the image
# Read in the PDF
pdf <- pdf_text(i)
# convert pdf to tiff
bitmap <- pdf_render_page(news, dpi = 300)
tiff::writeTIFF(bitmap, paste0(i, ".tiff"))
# perform OCR on the .tiff file
out <- ocr(paste0, (".tiff"))
# delete tiff file
file.remove(paste0(i, ".tiff" ))
})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With