Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python script to remove blank pages using pyPDF

I am trying to write a couple of python scripts using pyPDF to split PDF pages into six separate pages, order them correctly (usually printed front and back, so every other page needs to have its subpages ordered differently), and remove resulting blank pages at the end of the output document.

I wrote the following script to cut the PDF pages up and reorder them. Cuts each page into two columns and each column into three pages. I am not very experienced with python, so please excuse anything I'm not doing correctly.

#!/usr/bin/env python
import copy, sys
from pyPdf import PdfFileWriter, PdfFileReader
input = PdfFileReader(sys.stdin)
output = PdfFileWriter()

for i in range(0,input.getNumPages(),2):
    p = input.getPage(i)
    q = copy.copy(p)
    r = copy.copy(p)
    s = copy.copy(p)
    t = copy.copy(p)
    u = copy.copy(p)
    (x, y) = p.mediaBox.lowerLeft
    (w, h) = p.mediaBox.upperRight

    p.mediaBox.lowerLeft = (x, 2 * h / 3)
    p.mediaBox.upperRight = (w / 2, h)

    q.mediaBox.lowerLeft = (w / 2, 2 * h / 3)
    q.mediaBox.upperRight = (w, h)

    r.mediaBox.lowerLeft = (x, h / 3)
    r.mediaBox.upperRight = (w / 2, 2 * h / 3)

    s.mediaBox.lowerLeft = (w / 2, h / 3)
    s.mediaBox.upperRight = (w, 2 * h / 3)

    t.mediaBox.lowerLeft = (x, y)
    t.mediaBox.upperRight = (w / 2, h / 3)

    u.mediaBox.lowerLeft = (w / 2, y)
    u.mediaBox.upperRight = (w, h / 3)

    a = input.getPage(i+1)
    b = copy.copy(a)
    c = copy.copy(a)
    d = copy.copy(a)
    e = copy.copy(a)
    f = copy.copy(a)
    (x, y) = a.mediaBox.lowerLeft
    (w, h) = a.mediaBox.upperRight

    a.mediaBox.lowerLeft = (x, 2 * h / 3)
    a.mediaBox.upperRight = (w / 2, h)

    b.mediaBox.lowerLeft = (w / 2, 2 * h / 3)
    b.mediaBox.upperRight = (w, h)

    c.mediaBox.lowerLeft = (x, h / 3)
    c.mediaBox.upperRight = (w / 2, 2 * h / 3)

    d.mediaBox.lowerLeft = (w / 2, h / 3)
    d.mediaBox.upperRight = (w, 2 * h / 3)

    e.mediaBox.lowerLeft = (x, y)
    e.mediaBox.upperRight = (w / 2, h / 3)

    f.mediaBox.lowerLeft = (w / 2, y)
    f.mediaBox.upperRight = (w, h / 3)

    output.addPage(p)
    output.addPage(b)
    output.addPage(q)
    output.addPage(a)
    output.addPage(r)
    output.addPage(d)
    output.addPage(s)
    output.addPage(c)
    output.addPage(t)
    output.addPage(f)
    output.addPage(u)
    output.addPage(e)

output.write(sys.stdout)

Then I use the following script to remove the blank pages.

#!/usr/bin/env python
import copy, sys
from pyPdf import PdfFileWriter, PdfFileReader
input = PdfFileReader(sys.stdin)
output = PdfFileWriter()

for i in range(0,input.getNumPages()):
    p = input.getPage(i)

    text = p.extractText()

    if (len(text) > 10):
        output.addPage(p)

output.write(sys.stdout)

The problem seems to be that while the pages are visibly cropped down, the text draw commands are still there. None of these pages are scanned, so if they are blank, they are really blank. Does anyone have any thoughts on something I could do differently or possibly an entirely different approach to take to remove the blank pages? I would really appreciate any help.

like image 739
rpeck1682 Avatar asked Jun 10 '11 17:06

rpeck1682


1 Answers

PdfFileReader has a method, getPage(self, page number) that returns an object, PageObject, that in turn has a method getContents, which will return None if the page is blank. So, with your pdf object, getNumPages(), iterate with if getPage(i).getContents():, collecting the hits into a list of page numbers to output.

like image 150
Richard Careaga Avatar answered Sep 28 '22 09:09

Richard Careaga