Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check whether a PDF-File is valid with Python

Tags:

I get a File via a HTTP-Upload and need to be sure its a pdf-file. Programing Language is Python, but this should not matter.

I thought of the following solutions:

  1. Check if the first bytes of the string are "%PDF". This is not a good check but prevents the use from uploading other files accidentally.

  2. Try the libmagic (the "file" command on the bash uses it). This does exactly the same check as in (1)

  3. Take a lib and try to read the page-count out of the file. If the lib is able to read a pagecount it should be a valid pdf. Problem: I dont know a lib for python which can do this

So anybody got any solutions for a lib or another trick?

like image 381
theomega Avatar asked Feb 17 '09 22:02

theomega


3 Answers

Update 2020

It looks like pdfminer.six is a maintained project (the others, including the one below, seem dead).

ReportLab is another one (mistakenly marked as dead by me)

Original answer

Since apparently neither PyPdf nor ReportLab is available anymore, the current solution I found (as of 2015) is to use PyPDF2 and catch exceptions (and possibly analyze getDocumentInfo())

import PyPDF2  with open("testfile.txt", "w") as f:     f.write("hello world!")  try:     PyPDF2.PdfFileReader(open("testfile.txt", "rb")) except PyPDF2.utils.PdfReadError:     print("invalid PDF file") else:     pass 
like image 84
WoJ Avatar answered Sep 30 '22 14:09

WoJ


In a project if mine I need to check for the mime type of some uploaded file. I simply use the file command like this:

from subprocess import Popen, PIPE
filetype = Popen("/usr/bin/file -b --mime -", shell=True, stdout=PIPE, stdin=PIPE).communicate(file.read(1024))[0].strip()

You of course might want to move the actual command into some configuration file as also command line options vary among operating systems (e.g. mac).

If you just need to know whether it's a PDF or not and do not need to process it anyway I think the file command is a faster solution than a lib. Doing it by hand is of course also possible but the file command gives you maybe more flexibility if you want to check for different types.

like image 45
MrTopf Avatar answered Sep 30 '22 13:09

MrTopf


The two most commonly used PDF libraries for Python are:

  • pyPdf
  • ReportLab

Both are pure python so should be easy to install as well be cross-platform.

With pyPdf it would probably be as simple as doing:

from pyPdf import PdfFileReader
doc = PdfFileReader(file("upload.pdf", "rb"))

This should be enough, but doc will now have documentInfo() and numPages() methods if you want to do further checking.

As Carl answered, pdftotext is also a good solution, and would probably be faster on very large documents (especially ones with many cross-references). However it might be a little slower on small PDF's due to system overhead of forking a new process, etc.

like image 33
Van Gale Avatar answered Sep 30 '22 12:09

Van Gale