Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract text from two column pdf with Python? [closed]

Tags:

python

nlp

I have : enter image description here

I have a PDF which are in two-column format.Is there a way to read each PDF according to the two-column format without cropping each PDF individually?

like image 850
alfonso Avatar asked Nov 22 '25 12:11

alfonso


1 Answers

I found an alternative method, you can crop the pdf with two part, left and right, then merge left content and right content for every page, you can try this:

# https://github.com/jsvine/pdfplumber

import pdfplumber


x0 = 0    # Distance of left side of character from left side of page.
x1 = 0.5  # Distance of right side of character from left side of page.
y0 = 0  # Distance of bottom of character from bottom of page.
y1 = 1  # Distance of top of character from bottom of page.

all_content = []
with pdfplumber.open("file_path") as pdf:
    for i, page in enumerate(pdf.pages):
        width = page.width
        height = page.height

        # Crop pages
        left_bbox = (x0*float(width), y0*float(height), x1*float(width), y1*float(height))
        page_crop = page.crop(bbox=left_bbox)
        left_text = page_crop.extract_text()

        left_bbox = (0.5*float(width), y0*float(height), 1*float(width), y1*float(height))
        page_crop = page.crop(bbox=left_bbox)
        right_text = page_crop.extract_text()
        page_context = '\n'.join([left_text, right_text])
        all_content.append(page_context)
        if i < 2:  # help you see the merged first two pages
            print(page_context)

like image 161
fitz Avatar answered Nov 25 '25 02:11

fitz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!