Converting PDF to images automatically

Tags:

So the state I'm in released a bunch of data in PDF form, but to make matters worse, most (all?) of the PDFs appear to be letters typed in Office, printed/fax, and then scanned (our government at its best eh?). At first I thought I was crazy, but then I started seeing numerous pdfs that are 'tilted', like someone didn't get them on the scanner properly. So, I figured the next best thing to getting the actual text out of them, would be to turn each page into an image.

Obviously this needs to be automated, and I'd prefer to stick with Python if possible. If Ruby or Perl have some form of implementation that's just too awesome to pass up, I can go that route. I've tried pyPDF for text extraction, that obviously didn't do me much good. I've tried swftools, but the images I'm getting from that are just shy of completely unusable. It just seems like the fonts get ruined in the conversion. I also don't even really care about the image format on the way out, just as long as they're relatively lightweight, and readable.

246

asked Jan 04 '10 20:01

f4nt

1 Answers

If the PDFs are truly scanned images, then you shouldn't convert the PDF to an image, you should extract the image from the PDF. Most likely, all of the data in the PDF is essentially one giant image, wrapped in PDF verbosity to make it readable in Acrobat.

You should try the simple expedient of simply finding the image in the PDF, and copying the bytes out: Extracting JPGs from PDFs. The code there is dead simple, and there are probably dozens of reasons it won't work on your PDF files. But if it does, you'll have a quick and painless way to get the image data out of the PDF files.

180

answered Sep 28 '22 17:09

Ned Batchelder

Related questions
                            
                                What is the equivalent of imp.find_module in importlib
                            
                                Why no @override decorator in Python to help code readability? [closed]
                            
                                Pickling dynamically generated classes?
                            
                                how to create virtualenv with pypy?
                            
                                Django Rest Framework Business Logic
                            
                                Python os.environ throws key error?
                            
                                When is semicolon use in Python considered "good" or "acceptable"?
                            
                                Send asyncio tasks to loop running in other thread
                            
                                What is the difference between AF_INET and PF_INET constants?
                            
                                Securing communication [Authenticity, Privacy & Integrity] with mobile app?
                            
                                pythonic implementation of Bayesian networks for a specific application
                            
                                Distributing a shared library and some C code with a Cython extension module
                            
                                How to verify a JWT using python PyJWT with public key
                            
                                What is the difference between ActivePython and Python?
                            
                                Which should I be using: urlparse or urlsplit?
                            
                                How can a pandas merge preserve order?
                            
                                Aptana Error-pydev: Port not bound (found port -1)?
                            
                                Is it possible to prefill a input() in Python 3's Command Line Interface?
                            
                                How to run recurring task in the Python Flask framework?
                            
                                ReactorNotRestartable error in while loop with scrapy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Converting PDF to images automatically

Tags:

python

image

pdf

f4nt

People also ask

1 Answers

Ned Batchelder

Recent Activity

Donate For Us