Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select only first page of PDF pypdf2

I am trying to strip out only the first page of multiple PDF files and combine into one file. (I receive 150 PDF files a day, the first page is the invoice which I need, the following three to 12 pages are just backup which I do not need) So the input is 150 PDF files of varying size and the output I want is 1 PDF file containing only the first page of each of the 150 files.

What I seem to have done is to have merged all the pages EXCEPT the first page (which is the only one I need).

# Get all PDF documents in current directory
import os

pdf_files = []
for filename in os.listdir("."):
    if filename.endswith(".pdf"):
        pdf_files.append(filename)
pdf_files.sort(key=str.lower)

# Take first page from each PDF
from PyPDF2 import PdfFileWriter, PdfFileReader

for filename in pdf_files:
    reader = PdfFileReader(filename)

writer = PdfFileWriter()
for pageNum in range(1, reader.numPages):
    page = reader.getPage(pageNum)
    writer.addPage(page)

with open("CombinedFirstPages.pdf", "wb") as fp:
    writer.write(fp)
like image 851
mike horan Avatar asked Nov 05 '17 19:11

mike horan


People also ask

What is difference between PyPDF2 and PyPDF4?

There was a brief series of releases of a package called PyPDF3 , and then the project was renamed to PyPDF4 . All of these projects do pretty much the same thing, but the biggest difference between pyPdf and PyPDF2+ is that the latter versions added Python 3 support.

How do I extract a page from a PDF in Python?

pdf reader object has function getPage() which takes page number (starting form index 0) as argument and returns the page object. Page object has function extractText() to extract text from the pdf page. At last, we close the pdf file object.


1 Answers

Try this:

# Get all PDF documents in current directory
import os

your_target_folder = "."
pdf_files = []
for dirpath, _, filenames in os.walk(your_target_folder):
    for items in filenames:
        file_full_path = os.path.abspath(os.path.join(dirpath, items))
        if file_full_path.lower().endswith(".pdf"):
            pdf_files.append(file_full_path)
pdf_files.sort(key=str.lower)

# Take first page from each PDF
from PyPDF2 import PdfFileReader, PdfFileWriter

writer = PdfFileWriter()

for file_path in pdf_files:
    reader = PdfFileReader(file_path)
    page = reader.getPage(0)
    writer.addPage(page)

with open("CombinedFirstPages.pdf", "wb") as output:
    writer.write(output)
like image 157
DRPK Avatar answered Oct 13 '22 01:10

DRPK