Extract specific pages of PDF and save it with Python

Tags:

I have some sources and tried to code which extract some pages and create pdf files. I have a list which looks like this

information = [(filename1,startpage1,endpage1), (filename2, startpage2, endpage2), ...,(filename19,startpage19,endpage19)].

This is my code.

from PyPDF2 import PdfFileReader, PdfFileWriter

reader = PdfFileReader("example.pdf")

for page in range(reader.getNumPages() - 1):
    writer = PdfFileWriter()
    start = information[page][1]
    end = information[page][2]
    while start < end:
        writer.addPage(reader.getPage(start))
        start += 1
        output_filename = "{}_{}_page_{}.pdf".format(
            information[page][0], information[page][1], information[page][2]
        )
    with open(output_filename, "wb") as out:
        writer.write(out)

But the output is weird.. some has nothing inside and some has just one page in it. How can I correct this?

552

asked Jul 28 '18 03:07

2 Answers

I have fixed the issue. it was the equal sign (start<=end).

for page in range(len(information)):
    pdf_writer = PyPDF2.PdfFileWriter()
    start = information[page][1]
    end = information[page][2]
    while start<=end:
        pdf_writer.addPage(pdfReader.getPage(start-1))
        start+=1
    if not os.path.exists(savepath):
        os.makedirs(savepath)
    output_filename = '{}_{}_page_{}.pdf'.format(information[page][0],information[page][1], information[page][2])
    with open(output_filename,'wb') as out:
        pdf_writer.write(out)

answered Oct 21 '22 04:10

Full code and I modified SSS' answer to be portable, flexible, and concurrent with multiple source pdfs. I couldn't test the performance difference between ThreadPoolExecutor and ProcessPoolExecutor, but I assumed that the extraction process is bounded by the reading and writing of PDFs rather than by getPage and addPage.

import concurrent.futures
from multiprocessing import freeze_support
from pathlib import Path
from PyPDF2 import PdfFileReader, PdfFileWriter


def pdf_extract(pdf, segments):
    """
    pdf: str | Path
    segments: [(start, end), {'start': int, 'end': int}]
    """
    with open(pdf, 'rb') as read_stream:
        pdf_reader = PdfFileReader(read_stream)
        for segment in segments:
            pdf_writer = PdfFileWriter()
            # support {'start': 3, 'end': 3} or (start, end)
            try:
                start_page, end_page = segment['start'], segment['end']
            except TypeError:
                start_page, end_page = segment
            for page_num in range(start_page - 1, end_page):
                pdf_writer.addPage(pdf_reader.getPage(page_num))
            p = Path(pdf)
            ouput = p.parent / p.with_stem(f'{p.stem}_pages_{start_page}-{end_page}')
            with open(ouput, 'wb') as out:
                pdf_writer.write(out)


def __pdf_extract(pair):
    return pdf_extract(*pair)


def pdf_extract_batch(pdfs, workers=20):
    """
    pdfs = {pdf_name: [(1, 1), ...], ...}
    """
    with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
        executor.map(__pdf_extract, pdfs.items())


if __name__ == '__main__':
    freeze_support()
    pdf_name = r'C:\Users\maste\Documents\long.pdf'
    segments = [(1, 1), {'start': 3, 'end': 5}]
    # Single
    pdf_extract(pdf_name, segments)
    # Batched (Concurrent)
    pdfs = {pdf_name: segments}
    # pdf_extract_batch(pdfs)

answered Oct 21 '22 05:10

Elijah

Related questions
                            
                                How to import a .pyd file as a python module?
                            
                                Having three points on 3 images from 3 viewpoints how to get its coordinates in 3d space?
                            
                                How to cycle through both colours and linestyles on a matplotlib figure?
                            
                                Selenium Python getting <script> tag information?
                            
                                apply a function to each row of the dataframe
                            
                                Different coefficients: scikit-learn vs statsmodels (logistic regression)
                            
                                json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
                            
                                Does PyCharm use Mypy?
                            
                                How to use 'io.StringIO' with 'print >>'?
                            
                                Why do python module names have some uppercase letters but are always imported in lowercase?
                            
                                How to control the mouse in Minecraft using Python?
                            
                                Pyspark: how are dataframe describe() and summary() implemented
                            
                                Pytest: how to work around missing __init__.py in the tests folder?
                            
                                Allow positional command-line arguments with nargs to be seperated by a flag
                            
                                Pip install in Spyder
                            
                                Flask - Toggle button with dynamic label
                            
                                Importing data from an excel file using python into SQL Server
                            
                                Save Pandas DataFrames with formulas to xlsx files
                            
                                Does the lock in asyncio.Condition have other purpose besides compatibility with threading.Condition?
                            
                                error "socket.timeout: The read operation timed out" while installing a python module

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract specific pages of PDF and save it with Python

Tags:

python

pdf

extract

pypdf2

SSS

People also ask

2 Answers

SSS

Elijah

Recent Activity

Donate For Us