Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast PDF splitter library

Tags:

python

c

pdf

pypdf

pyPdf is a great library to split, merge PDF files. I'm using it to split pdf documents into 1 page documents. pyPdf is pure python and spends quite a lot of time in the _sweepIndirectReferences() method of the PdfFileWriter object when saving the extracted page. I need something with better performance. I've tried using multi-threading but since most of the time is spent in python code there was no speed gain because of the GIL (it actually ran slower).

Is there any library written in c that provides the same functionality? or does anyone have a good idea on how to improve performance (other than spawning a new process for each pdf file that I want to split)

Thank you in advance.

Follow up. Links to a couple of command line solutions, that can prove sometimes faster than pyPDF:

  • http://multivalent.sourceforge.net/Tools/pdf/Split.html
  • http://www.linuxsolutions.fr/how-to-extract-pages-from-a-pdf/

I modified pyPDF PdfWriter class to keep track of how much time has been spent on the _sweepIndirectReferences() method. If it has been too long (right now I use the magical value of 3 seconds) then I revert to using ghostscript by making a call to it from python.

Thanks for all your answers. (codelogic's xpdf reference is the one that made me look for a different approach)

like image 901
Nathan Avatar asked Feb 03 '09 17:02

Nathan


People also ask

Is there a free PDF splitter?

PDF. online provides a free online PDF split tool for you to split PDF into multiple files. Simply select the file you want to split, and use the online splitter to extract pages from your PDF into a single PDF file or multiple PDF files.

Is PDF merger and splitter safe?

Icecream PDF Split and Merge is 100% safe. It's a legitimate piece of software that does what it says: cut and combine PDF documents. It doesn't contain malware, nor does it make any modifications to the original PDFs.


2 Answers

mbtPdfAsm is a fast, open source command line tool for PDF processing.

Xpdf is also worth mentioning since it's GPL and written in C++. The source code is well modularized and allows for writing command line tools.

like image 143
codelogic Avatar answered Oct 03 '22 17:10

codelogic


Does it have to be python? My pure-Perl library CAM::PDF is pretty fast at appending and deleting PDF document pages. It saves the sweeping for the very end, where possible.

like image 42
Chris Dolan Avatar answered Oct 03 '22 19:10

Chris Dolan