Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Parsing Index page in a PDF text book with Python

I have to extract text from PDF pages as it is with the indentation into a CSV file.

Index page from PDF text book:

I should split the text into class and subclass type hierarchy along with the page numbers. For example in the image, Application server is the class and Apache Tomcat is the subclass in the page number 275

This is the expected output of the CSV:

I have used Tika parser to parse the PDF, but the indentation is not maintained properly (not unique) in the parsed content for splitting the text into class and subclasses.

This is how the parsed text looks like:

Can anyone suggest me the right approach for this requirement?

like image 293
Aryan Avatar asked Mar 03 '18 18:03


People also ask

How do I extract a specific word from a PDF in Python?

Step 1: Import all libraries. Step 2: Convert PDF file to txt format and read data. Step 3: Use “. findall()” function of regular expressions to extract keywords.

2 Answers

despite I have no knowledge of pdf extraction, but it is possible to reconstruct the hierarchy from "the parsed text", because the "subclass" part always starts and ends with an extra newline character.

with following test text:

app architect . 50
app logic . 357
app server . 275

tomcat . 275
websphere . 275
jboss . 164


acceptance . 303
development path . 304

architecting . 48
architectural activity . 25, 320

following code:

import csv
import sys
import re

def gen():
    is_subclass = False
    p_class = None

    with open('test.data') as f:
        s = f.read()
    lines = re.findall(r'[^\n]+\n+', s)
    for line in lines:
        if ' . ' in line:
            class_name, page_no = map(lambda s: s.strip(), line.split('.'))
            class_name, page_no = line.strip(), ''

        if line.endswith('\n\n'):
            if not is_subclass:
                p_class = class_name
                is_subclass = True

        if is_subclass:
            yield (p_class, class_name, page_no)
            yield (class_name, '', page_no)

        if line.endswith('\n\n'):
            is_subclass = False

writer = csv.writer(sys.stdout)


app architect,,50
app logic,,357
app server,tomcat,275
app server,websphere,275
app server,jboss,164
architect,development path,304
architectural activity,,"25, 320"

hope this helps.

like image 187
georgexsh Avatar answered Oct 26 '22 03:10


So here is the solution:

  1. Install Fitz(PyMuPDF) https://github.com/rk700/PyMuPDF
  2. Run the code below in the same folder than your PDF file with Python 2.7
  3. Compare the result


import fitz
import json
import re
import csv

class MyClass:
    def __init__(self, text, main_class):
        my_arr = re.split("[.]*", text)
        if main_class != my_arr[0].strip():
            main_class = my_arr[0].strip()
        self.main_class = main_class
        self.sub_class = my_arr[0].strip()
            self.page = my_arr[1].strip()
            self.page = ""

def add_line(text, is_recording, main_class):
        obj = MyClass(text, main_class)
        if obj.sub_class == "Glossary":
            return False, main_class
        return True, obj.main_class
    elif text == "Contents":
        return True, main_class
    return False, main_class

last_text = ""
is_recording = False
main_class = ""
table = []

doc = fitz.open("TCS_1.pdf")
page = doc.getPageText(2, output="json")
blocks = json.loads(page)["blocks"]
for block in blocks:
    if "lines" in block:
        for line in block["lines"]:
            line_text = ""
            for span in block["lines"]:
                line_text += span["spans"][0]["text"].encode("utf-8")
            if last_text != line_text:
                is_recording, main_class = add_line(line_text, is_recording, main_class)
                last_text = line_text

writer = csv.writer(open("output.csv", 'w'), delimiter=',', lineterminator='\n')
for my_class in table:
    writer.writerow([my_class.main_class, my_class.sub_class, my_class.page])
    # print(my_class.main_class, my_class.sub_class, my_class.page)

Here is the CSV output of the file provided: enter image description here

like image 31
Jonathan Gagne Avatar answered Oct 26 '22 03:10

Jonathan Gagne