I am trying to strip out only the first page of multiple PDF files and combine into one file. (I receive 150 PDF files a day, the first page is the invoice which I need, the following three to 12 pages are just backup which I do not need) So the input is 150 PDF files of varying size and the output I want is 1 PDF file containing only the first page of each of the 150 files.
What I seem to have done is to have merged all the pages EXCEPT the first page (which is the only one I need).
# Get all PDF documents in current directory
import os
pdf_files = []
for filename in os.listdir("."):
if filename.endswith(".pdf"):
pdf_files.append(filename)
pdf_files.sort(key=str.lower)
# Take first page from each PDF
from PyPDF2 import PdfFileWriter, PdfFileReader
for filename in pdf_files:
reader = PdfFileReader(filename)
writer = PdfFileWriter()
for pageNum in range(1, reader.numPages):
page = reader.getPage(pageNum)
writer.addPage(page)
with open("CombinedFirstPages.pdf", "wb") as fp:
writer.write(fp)
There was a brief series of releases of a package called PyPDF3 , and then the project was renamed to PyPDF4 . All of these projects do pretty much the same thing, but the biggest difference between pyPdf and PyPDF2+ is that the latter versions added Python 3 support.
pdf reader object has function getPage() which takes page number (starting form index 0) as argument and returns the page object. Page object has function extractText() to extract text from the pdf page. At last, we close the pdf file object.
Try this:
# Get all PDF documents in current directory
import os
your_target_folder = "."
pdf_files = []
for dirpath, _, filenames in os.walk(your_target_folder):
for items in filenames:
file_full_path = os.path.abspath(os.path.join(dirpath, items))
if file_full_path.lower().endswith(".pdf"):
pdf_files.append(file_full_path)
pdf_files.sort(key=str.lower)
# Take first page from each PDF
from PyPDF2 import PdfFileReader, PdfFileWriter
writer = PdfFileWriter()
for file_path in pdf_files:
reader = PdfFileReader(file_path)
page = reader.getPage(0)
writer.addPage(page)
with open("CombinedFirstPages.pdf", "wb") as output:
writer.write(output)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With