Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove header and footer from pdftotext module in Python

I am using pdftotext python package to extract text from pdf however I need to remove headers and footers from the text file to extract only the content.

There could be two ways to solve this :

  1. Using regular expressions in text file
  2. Using some filter while getting text from pdf

Now, the current problem is headers and footers being inconsistent with pages.

For example, the first 1-2 lines of header might have contractor's address which is consistent but 3rd line of the header has section and the topic which the page is following. Similarly footer consists of project number(not a fixed number value either), subsection number and some design words followed by a date which should be consistent (but different for every project). It should also be noted that the pdf file can be 500+ pages for every project but probably splitting will be done based on sections.

Currently I'm using this code to extract information. Are there any parameters I don't know about which can be used to remove headers and footers?

import pdftotext

def get_data(pdf_path):

    with open(pdf_path, "rb") as f:
        pdf = pdftotext.PDF(f)

    print("Pages : ",len(pdf))

    with open('text-pdftotext.txt', 'w') as k:
        k.write("\n\n".join(pdf))

    f.close()
    k.close()

get_data('specification_file.pdf')
like image 357
Raghav Gupta Avatar asked Jun 23 '26 00:06

Raghav Gupta


1 Answers

pdftotext is best used as designed i.e. as a command line via any shell.

So to remove page break headers and footers use the command exactly as it was designed to be run.

pdftotext -nopgbrk -margint <number> -marginb <number> filename

with xpdf 4.04 that will give you the body text without the toplines and without the bottom lines.

If using the Poppler variant you need to set a region of interest with

  -x <int>             : x-coordinate of the crop area top left corner
  -y <int>             : y-coordinate of the crop area top left corner
  -W <int>             : width of crop area in pixels (default is 0)
  -H <int>             : height of crop area in pixels (default is 0)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!