I have a PDF file which I removed some pages from it. I want to correct(fix) the new pdf page numbers. Is there any way/library to update the page numbers without converting the pdf to another format? I have tried to convert the pdf to text, XML, and JSON and then fix the page number. However, if I convert it back to pdf, it looks messy(cannot keep the style of the original pdf). The problems I have are:
I am using python on Ubuntu. I have tried ReportLab
, PyX
, and pyfpdf
.
Click on the “Edit PDF” tool in the right pane. Use Acrobat editing tools: Add new text, edit text, or update fonts using selections from the Format list. Add, replace, move, or resize images on the page using selections from the Objects list.
It has an extensible PDF parser that can be used for other purposes than text analysis. PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files.
Now, we create an object of PageObject class of PyPDF2 module. pdf reader object has function getPage() which takes page number (starting form index 0) as argument and returns the page object. Page object has function extractText() to extract text from the pdf page. At last, we close the pdf file object.
I have had a similar problem, I honestly could not fully solve it, rather, I fetched the corresponding html and processed it with BeautifulSoup. However, I did get a closer approach than python modules, I used pdftotext.exe from poppler (link at the bottom) to read the pdf file, and it worked just fine, besides the fact that it was not able to distinguish between text columns. As this is not a python module, I used os.system to call the command string on the .exe file.
def call_poppler(input_pdf, input_path):
"""
Call poppler to generate a txt file
"""
command_row = input_path + " " + input_pdf
os.system(command_row)
txt_name = input_pdf[0:-4] + ".txt"
processed_paper = open_txt(txt_name)
return processed_paper
def open_txt(input_txt_name):
"""
Open and generate a python object out of the
txt attained with poppler
"""
opened_file = open(input_txt_name,"rb").readlines()
output_file = []
for row in opened_file:
row = row.decode("utf-8").strip()
output_file.append(row)
return output_file
This returns you a processed ".txt" file that you can then process as you want and rewrite as a pdf with some module, such as pypdf, sorry if it was not the answer you wanted, but pdf files are rather hard to handle in python since they are not text based files. Do not forget to give the path of the executable. You can get poppler here: https://poppler.freedesktop.org/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With