What are best parameters to run ImageMagick to convert low quality pdf to images (for OCR)

Tags:

I have several low quality pdfs. I would like to use OCR -- to be more precise Ocropus to get text from them. To do use, I use first ImageMagick -- a command line tool to convert pdf to images -- to transforms these pdfs into jpg or png.

However ImageMagick produces very low quality images and Ocropus hardly recognizes anything. I would like to learn what are the best parameters for handling low quality pdfs to provide as-good-as-possible-quality images to OCR.

I have found this page, but I do not know where to start.

956

asked Aug 31 '10 20:08

Skarab

2 Answers

You can learn about the detailed settings ImageMagick's "delegates" (external programs IM uses, such as Ghostscript) by typing

convert -list delegate

(On my system that's a list of 32 different commands.) Now to see which commands are used to convert to PNG, use this:

convert -list delegate | findstr /i png

Ok, this was for Windows. You didn't say which OS you use. [*] If you are on Linux, try this:

convert -list delegate | grep -i png

You'll discover that IM does produce PNG only from PS or EPS input. So how does IM get (E)PS from your PDF? Easy:

convert -list delegate | findstr /i PDF
convert -list delegate | grep -i PDF

Ah! It uses Ghostscript to make a PDF => PS conversion, then uses Ghostscript again to make a PS => PNG conversion. Works, but isn't the most efficient way if you know that Ghostscript can do PDF => PNG in one go. And faster. And in much better quality.

About IM's handling of PDF conversion to images via the Ghostscript delegate you should know two things first and foremost:

By default, if you don't give an extra parameter, Ghostscript will output images with a 72dpi resolution. That's why Karl's answer suggested to add -density 600 which tells Ghostscript to use a 600 dpi resolution for its image output.
The detour of IM to call Ghostscript twice to convert first PDF => PS and then PS => PNG is a real blunder. Because you never win and harldy keep quality in the first step, but very often loose some. Reasons:
- PDF can handle transparencies, which PostScript can not.
- PDF can embed TrueType fonts, which Ghostscript can not. etc.pp. Conversion in the direction PS => PDF is not that critical....)

That's why I'd suggest you convert your PDFs in one go to PNG (or JPEG) using Ghostscript directly. And use the most recent version 8.71 (soon to be released: 9.01) of Ghostscript! Here are example commands:

gswin32c.exe ^
  -sDEVICE=pngalpha ^
  -o output/page_%03d.png ^
  -r600 ^
  d:/path/to/your/input.pdf

(This is the commandline for Windows. On Linux, use gs instead of gswin32c.exe, and \ instead of ^.) This command expects to find an output subdirectory where it will store a separate file for each PDF page. To produce JPEGs of good quality, try

gs \
  -sDEVICE=jpeg \
  -o output/page_%03d.jpeg \
  -r600 \
  -dJPEGQ=95 \
  /path/to/your/input.pdf

(Linux command version). This direct conversion avoids the intermediate PostScript format, which may have lost your TrueType font and transparency object's information that were in the original PDF file.

[*] D'oh! I missed to see your "linux" tag at first...

159

answered Oct 10 '22 02:10

Kurt Pfeifle

-density 600 or so should give you what you need.

answered Oct 10 '22 04:10

Karl Bielefeldt

Related questions
                            
                                Shell - Suppress output of a single command
                            
                                How to find the offset of the section header string table of an elf file?
                            
                                Setting/changing the ctime or "Change time" attribute on a file
                            
                                CURL permission denied via browser, works on ssh
                            
                                Find out how much SSH-connections currently exist [closed]
                            
                                undefined reference to `SHA1' at line
                            
                                Change cd default directory (bash) [closed]
                            
                                How to get latest cmake version from CentOS 6.5
                            
                                Is it safe to temporarily rename /tmp and then create a tmp symlink to a different location?
                            
                                Vagrant up and reload - default: Warning: Connection timeout. Retrying...
                            
                                How do I list all systemd masked units?
                            
                                How to force (or workaround) logrotate to move old logs to olddir on different physical disk?
                            
                                Linux - moving the console cursor visual
                            
                                How to get linux user id by user name?
                            
                                Bitbake build consumes more space
                            
                                Write current date/time to a file using shell script
                            
                                List screen resolutions using Wayland/Weston [closed]
                            
                                bash: export: `-Xmx512m': not a valid identifier when I set MAVEN_OPTS variable
                            
                                SDL/C++ OpenGL Program, how do I stop SDL from catching SIGINT
                            
                                what does '-' stand for in bash?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are best parameters to run ImageMagick to convert low quality pdf to images (for OCR)

Tags:

linux

image-processing

pdf

imagemagick

ghostscript

Skarab

People also ask

2 Answers

Kurt Pfeifle

Karl Bielefeldt

Recent Activity

Donate For Us