Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing Index page in a PDF text book with Python

I have to extract text from PDF pages as it is with the indentation into a CSV file.

Index page from PDF text book:

I should split the text into class and subclass type hierarchy along with the page numbers. For example in the image, Application server is the class and Apache Tomcat is the subclass in the page number 275

This is the expected output of the CSV:

I have used Tika parser to parse the PDF, but the indentation is not maintained properly (not unique) in the parsed content for splitting the text into class and subclasses.

This is how the parsed text looks like:

Can anyone suggest me the right approach for this requirement?

like image 293
Aryan Avatar asked Mar 03 '18 18:03

Aryan


People also ask

How do I extract a specific word from a PDF in Python?

Step 1: Import all libraries. Step 2: Convert PDF file to txt format and read data. Step 3: Use “. findall()” function of regular expressions to extract keywords.


2 Answers

despite I have no knowledge of pdf extraction, but it is possible to reconstruct the hierarchy from "the parsed text", because the "subclass" part always starts and ends with an extra newline character.

with following test text:

app architect . 50
app logic . 357
app server . 275

tomcat . 275
websphere . 275
jboss . 164

architect

acceptance . 303
development path . 304

architecting . 48
architectural activity . 25, 320

following code:

import csv
import sys
import re


def gen():
    is_subclass = False
    p_class = None

    with open('test.data') as f:
        s = f.read()
    lines = re.findall(r'[^\n]+\n+', s)
    for line in lines:
        if ' . ' in line:
            class_name, page_no = map(lambda s: s.strip(), line.split('.'))
        else:
            class_name, page_no = line.strip(), ''

        if line.endswith('\n\n'):
            if not is_subclass:
                p_class = class_name
                is_subclass = True
                continue

        if is_subclass:
            yield (p_class, class_name, page_no)
        else:
            yield (class_name, '', page_no)

        if line.endswith('\n\n'):
            is_subclass = False


writer = csv.writer(sys.stdout)
writer.writerows(gen())

yields:

app architect,,50
app logic,,357
app server,tomcat,275
app server,websphere,275
app server,jboss,164
architect,acceptance,303
architect,development path,304
architecting,,48
architectural activity,,"25, 320"

hope this helps.

like image 187
georgexsh Avatar answered Oct 26 '22 03:10

georgexsh


So here is the solution:

  1. Install Fitz(PyMuPDF) https://github.com/rk700/PyMuPDF
  2. Run the code below in the same folder than your PDF file with Python 2.7
  3. Compare the result

Code:

import fitz
import json
import re
import csv

class MyClass:
    def __init__(self, text, main_class):
        my_arr = re.split("[.]*", text)
        if main_class != my_arr[0].strip():
            main_class = my_arr[0].strip()
        self.main_class = main_class
        self.sub_class = my_arr[0].strip()
        try:
            self.page = my_arr[1].strip()
        except:
            self.page = ""

def add_line(text, is_recording, main_class):
    if(is_recording):
        obj = MyClass(text, main_class)
        if obj.sub_class == "Glossary":
            return False, main_class
        table.append(obj)
        return True, obj.main_class
    elif text == "Contents":
        return True, main_class
    return False, main_class

last_text = ""
is_recording = False
main_class = ""
table = []

doc = fitz.open("TCS_1.pdf")
page = doc.getPageText(2, output="json")
blocks = json.loads(page)["blocks"]
for block in blocks:
    if "lines" in block:
        for line in block["lines"]:
            line_text = ""
            for span in block["lines"]:
                line_text += span["spans"][0]["text"].encode("utf-8")
            if last_text != line_text:
                is_recording, main_class = add_line(line_text, is_recording, main_class)
                last_text = line_text

writer = csv.writer(open("output.csv", 'w'), delimiter=',', lineterminator='\n')
for my_class in table:
    writer.writerow([my_class.main_class, my_class.sub_class, my_class.page])
    # print(my_class.main_class, my_class.sub_class, my_class.page)

Here is the CSV output of the file provided: enter image description here

like image 31
Jonathan Gagne Avatar answered Oct 26 '22 03:10

Jonathan Gagne