Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyPDF2 split pdf by pages

Tags:

python

pypdf2

I wanna split pdf file using PyPDF2.

All examples in net is too difficult or don't work or always give error "AttributeError: 'PdfFileWriter' object has no attribute 'stream'"

Can someone help with it ? Need separete one pdf with 3 pages into three different files.

I'm starting from that:

pdfFileObj = open(r"D:\BPO\act.pdf", 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfWriter = PyPDF2.PdfFileWriter()
pdfWriter.addPage(pdfReader.getPage(0))

But don't know what to do next :(

EDIT#1

Was try do a loop for spliting and i'm have a problem: PdfFileWriter make 3 files one with one page, second - with two, and third with three. Where is my mistake in following code:

act_sub_pages_name = ['p01.pdf', 'p02.pdf', 'p03.pdf']
with open(r"D:\BPO\act.pdf", 'rb') as act_mls:
    reader = PdfFileReader(act_mls)
    writer = PdfFileWriter()
    if reader.numPages == 3:
        counter = 0
        for x in range(3):
            path = '\\'.join(['D:\\BPO\\act sub pages', act_sub_pages_name[counter]])
            counter += 1
            writer.addPage(reader.getPage(x))
            with open(path, 'wb') as outfile: writer.write(outfile)

Sry for bad English.

EDIT#2

My solution according by Paul Rooney answer:

act_pdf_file = 'D:\\BPO\\act.pdf'
act_sub_pages_name = ['p01.pdf', 'p02.pdf', 'p03.pdf']

def pdf_splitter(index, src_file):
    with open(src_file, 'rb') as act_mls:
        reader = PdfFileReader(act_mls)
        writer = PdfFileWriter()
        writer.addPage(reader.getPage(index))
        out_file = os.path.join('D:\\BPO\\act sub pages', act_sub_pages_name[index])
        with open(out_file, 'wb') as out_pdf: writer.write(out_pdf)

for x in range(3): pdf_splitter(x, act_pdf_file)

With function all works properly but it a little bit harder.

like image 426
Acamori Avatar asked Jul 17 '17 12:07

Acamori


Video Answer


2 Answers

You can use the write method of the PdfFileWriter to write out to the file.

from PyPDF2 import PdfFileReader, PdfFileWriter

with open("input.pdf", 'rb') as infile:

    reader = PdfFileReader(infile)
    writer = PdfFileWriter()
    writer.addPage(reader.getPage(0))

    with open('output.pdf', 'wb') as outfile:
        writer.write(outfile)

You may want to loop over the pages of the input file, create a new writer object, add a single page. Then write out to an ever incrementing filename or have some other scheme for deciding output filename?

like image 63
Paul Rooney Avatar answered Sep 20 '22 22:09

Paul Rooney


I've used a tool called xpdf for just this sort of task and it works really really well. You can download it here.

It's a command line utility that you can call from python. Make sure it's added to your path so you can call it from the command line.

Here's how you can interface it from python, using subprocess:

import subprocess

text, _ = subprocess.Popen('pdftotext -fixed 0 -clip D:\\BPO\\act.pdf', 
                           shell=True, 
                           stdout=subprocess.PIPE).communicate()

pages = text.decode('latin-1').split('\f')

Pages are separated by formfeed characters, so you'll get a list of pages.

like image 37
cs95 Avatar answered Sep 19 '22 22:09

cs95