Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Batch OCR Program for PDFs [closed]

This has been asked before, but I don't really know if the answers help me. Here is my problem: I got a bunch of (10,000 or so) pdf files. Some were text files that were saved using adobe's print feature (so their text is perfect and I don't want to risk screwing them up). And some were scanned images (so they don't have any text and I will have to settle for OCR). The files are in the same directory and I can't tell which is which. Ultimately I want to turn them into .txt files and then do string processing on them. So I want the most accurate OCR possible.

It seems like people have recommended:

  1. adobe pdf (I don't have a licensed copy of this so ... plus if ABBYY finereader or something is better, why pay for it if I won't use it)
  2. ocropus (I can't figure out how to use this thing),
  3. Tesseract (which seems like it was great in 1995 but I'm not sure if there's something more accurate plus it doesn't do pdfs natively and I've have to convert to TIFF. that raises its own problem as I don't have a licensed copy of acrobat so I don't know how I'd convert 10,000 files to tiff. plus i don't want 10,000 30 page documents turned into 30,000 individual tiff images).
  4. wowocr
  5. pdftextstream (that was from 2009)
  6. ABBYY FineReader (apparently its' $$$, but I will spend $600 to get this done if this thing is significantly better, i.e. has more accurate ocr).

Also I am a n00b to programming so if it's going to take like weeks to learn how to do something, I would rather pay the $$$. Thx for input/experiences.

BTW, I'm running Linux Mint 11 64 bit and/or windows 7 64 bit.

Here are the other threads:

Batch OCRing PDFs that haven't already been OCR'd

Open source OCR

PDF Text Extraction Approach Using OCR

https://superuser.com/questions/107678/batch-ocr-for-many-pdf-files-not-already-ocred

like image 213
Aquat33nfan Avatar asked May 17 '11 04:05

Aquat33nfan


People also ask

Can you OCR multiple PDF files at once?

When the batch screen appears, click the "OCR" option and then click "Choose file". Now, select as many PDF files as possible and add them to the program. It is easier when you organize the PDF files in a folder so that you can upload them at once.

Can you OCR multiple files at once?

Simply select Document>OCR Text Recognition>OCR Multiple Files. If you have Acrobat 9 and you just want to OCR a bunch of files, this is probably all you need! Acrobat X can do OCR as part of an Action, so you can combine OCR with other operations as part of a document processing workflow.


1 Answers

Just to put some of your misconceptions straight...

" I don't have a licensed copy of acrobat so I don't know how I'd convert 10,000 files to tiff."

You can convert PDFs to TIFF with the help of Free (as in liberty) and free (as in beer) Ghostscript. Your choice if you want to do it on Linux Mint or on Windows 7. The commandline for Linux is:

gs \
 -o input.tif \
 -sDEVICE=tiffg4 \
  input.pdf

"i don't want 10,000 30 page documents turned into 30,000 individual tiff images"

You can have "multipage" TIFFs easily. Above command does create such TIFFs of the G4 (fax tiff) flavor. Should you even want single-page TIFFs instead, you can modify the command:

gs \
 -o input_page_%03d.tif \
 -sDEVICE=tiffg4 \
  input.pdf

The %03d part of the output filename will automatically translate into a series of 001, 002, 003 etc.

Caveats:

  1. The default resolution for the tiffg4 output device is 204x196 dpi. You probably want a better value. To get 720 dpi you should add -r720x720 to the commandline.
  2. Also, if your Ghostscript installation uses letter as its default media size, you may want to change it. You can use -gXxY to set widthxheight in device points. So to get ISO A4 output page dimensions in landscape you can add a -g8420x5950 parameter.

So the full command which controls these two parameters, to produce 720 dpi output on A4 in portrait orientation, would read:

gs \
 -o input.tif \
 -sDEVICE=tiffg4 \
 -r720x720 \
 -g5950x8420 \
  input.pdf
like image 78
Kurt Pfeifle Avatar answered Sep 23 '22 13:09

Kurt Pfeifle