Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Edit existing PDF's pages in Python

Tags:

python

pdf

I have a PDF file which I removed some pages from it. I want to correct(fix) the new pdf page numbers. Is there any way/library to update the page numbers without converting the pdf to another format? I have tried to convert the pdf to text, XML, and JSON and then fix the page number. However, if I convert it back to pdf, it looks messy(cannot keep the style of the original pdf). The problems I have are:

  1. Removing the old page numbers.
  2. Adding new page numbers.

I am using python on Ubuntu. I have tried ReportLab, PyX, and pyfpdf.

like image 238
Sina Avatar asked Jun 25 '19 18:06

Sina


People also ask

Can you edit the pages in a PDF?

Click on the “Edit PDF” tool in the right pane. Use Acrobat editing tools: Add new text, edit text, or update fonts using selections from the Format list. Add, replace, move, or resize images on the page using selections from the Objects list.

Can you parse PDF with Python?

It has an extensible PDF parser that can be used for other purposes than text analysis. PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files.

How do I extract a page from a PDF in Python?

Now, we create an object of PageObject class of PyPDF2 module. pdf reader object has function getPage() which takes page number (starting form index 0) as argument and returns the page object. Page object has function extractText() to extract text from the pdf page. At last, we close the pdf file object.


1 Answers

I have had a similar problem, I honestly could not fully solve it, rather, I fetched the corresponding html and processed it with BeautifulSoup. However, I did get a closer approach than python modules, I used pdftotext.exe from poppler (link at the bottom) to read the pdf file, and it worked just fine, besides the fact that it was not able to distinguish between text columns. As this is not a python module, I used os.system to call the command string on the .exe file.

def call_poppler(input_pdf, input_path):

    """
    Call poppler to generate a txt file
    """
    command_row = input_path + " " + input_pdf
    os.system(command_row)
    txt_name = input_pdf[0:-4] + ".txt"
    processed_paper = open_txt(txt_name)
    return processed_paper

def open_txt(input_txt_name):

    """
    Open and generate a python object out of the
    txt attained with poppler
    """
    opened_file = open(input_txt_name,"rb").readlines()
    output_file = []
    for row in opened_file:
        row = row.decode("utf-8").strip()
        output_file.append(row)
    return output_file

This returns you a processed ".txt" file that you can then process as you want and rewrite as a pdf with some module, such as pypdf, sorry if it was not the answer you wanted, but pdf files are rather hard to handle in python since they are not text based files. Do not forget to give the path of the executable. You can get poppler here: https://poppler.freedesktop.org/

like image 150
Preto Avatar answered Oct 26 '22 10:10

Preto