Parsing Index page in a PDF text book with Python

Tags:

I have to extract text from PDF pages as it is with the indentation into a CSV file.

Index page from PDF text book:

I should split the text into class and subclass type hierarchy along with the page numbers. For example in the image, Application server is the class and Apache Tomcat is the subclass in the page number 275

This is the expected output of the CSV:

I have used Tika parser to parse the PDF, but the indentation is not maintained properly (not unique) in the parsed content for splitting the text into class and subclasses.

This is how the parsed text looks like:

Can anyone suggest me the right approach for this requirement?

293

asked Mar 03 '18 18:03

Aryan

2 Answers

despite I have no knowledge of pdf extraction, but it is possible to reconstruct the hierarchy from "the parsed text", because the "subclass" part always starts and ends with an extra newline character.

with following test text:

app architect . 50
app logic . 357
app server . 275

tomcat . 275
websphere . 275
jboss . 164

architect

acceptance . 303
development path . 304

architecting . 48
architectural activity . 25, 320

following code:

import csv
import sys
import re


def gen():
    is_subclass = False
    p_class = None

    with open('test.data') as f:
        s = f.read()
    lines = re.findall(r'[^\n]+\n+', s)
    for line in lines:
        if ' . ' in line:
            class_name, page_no = map(lambda s: s.strip(), line.split('.'))
        else:
            class_name, page_no = line.strip(), ''

        if line.endswith('\n\n'):
            if not is_subclass:
                p_class = class_name
                is_subclass = True
                continue

        if is_subclass:
            yield (p_class, class_name, page_no)
        else:
            yield (class_name, '', page_no)

        if line.endswith('\n\n'):
            is_subclass = False


writer = csv.writer(sys.stdout)
writer.writerows(gen())

yields:

app architect,,50
app logic,,357
app server,tomcat,275
app server,websphere,275
app server,jboss,164
architect,acceptance,303
architect,development path,304
architecting,,48
architectural activity,,"25, 320"

hope this helps.

187

answered Oct 26 '22 03:10

georgexsh

So here is the solution:

Install Fitz(PyMuPDF) https://github.com/rk700/PyMuPDF
Run the code below in the same folder than your PDF file with Python 2.7
Compare the result

Code:

import fitz
import json
import re
import csv

class MyClass:
    def __init__(self, text, main_class):
        my_arr = re.split("[.]*", text)
        if main_class != my_arr[0].strip():
            main_class = my_arr[0].strip()
        self.main_class = main_class
        self.sub_class = my_arr[0].strip()
        try:
            self.page = my_arr[1].strip()
        except:
            self.page = ""

def add_line(text, is_recording, main_class):
    if(is_recording):
        obj = MyClass(text, main_class)
        if obj.sub_class == "Glossary":
            return False, main_class
        table.append(obj)
        return True, obj.main_class
    elif text == "Contents":
        return True, main_class
    return False, main_class

last_text = ""
is_recording = False
main_class = ""
table = []

doc = fitz.open("TCS_1.pdf")
page = doc.getPageText(2, output="json")
blocks = json.loads(page)["blocks"]
for block in blocks:
    if "lines" in block:
        for line in block["lines"]:
            line_text = ""
            for span in block["lines"]:
                line_text += span["spans"][0]["text"].encode("utf-8")
            if last_text != line_text:
                is_recording, main_class = add_line(line_text, is_recording, main_class)
                last_text = line_text

writer = csv.writer(open("output.csv", 'w'), delimiter=',', lineterminator='\n')
for my_class in table:
    writer.writerow([my_class.main_class, my_class.sub_class, my_class.page])
    # print(my_class.main_class, my_class.sub_class, my_class.page)

Here is the CSV output of the file provided: enter image description here

answered Oct 26 '22 03:10

Jonathan Gagne

Related questions
                            
                                Python annotations: difference between Tuple and ()
                            
                                extracting Bottleneck features using pretrained Inceptionv3 - differences between Keras' implementation and Native Tensorflow implementation
                            
                                Get the attributes of the selected item in a GeoJSONDataSource
                            
                                Expected tensorflow model size from learned variables
                            
                                How to feed into LSTM with 4 dimensional input?
                            
                                Asyncio exception handler: not getting called until event loop thread stopped
                            
                                How to send a django signal from other signal
                            
                                how to load a tensorflow model and continue training
                            
                                cv2.aruco.detectMarkers doesn't detect markers in python
                            
                                Define a pytest fixture providing multiple arguments to test function
                            
                                how do I safely write data from a single hdf5 file to multiple files in parallel in python?
                            
                                GridSearchCV - save result each iteration
                            
                                Purpose of __name__ in TypeVar, NewType
                            
                                Python requests module doesn't return full page during get request
                            
                                Exception " There is no current event loop in thread 'MainThread' " while running over new loop
                            
                                Only one line of SimpleHTTPServer output does not appear while running container without '-it'
                            
                                [Tensorflow][Object detection] ValueError when try to train with --num_clones=2
                            
                                Understanding multi-label classifier using confusion matrix
                            
                                marshmallow flatten nested objects
                            
                                Returning mutiple values in the input function for `tf.py_func`

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parsing Index page in a PDF text book with Python

Tags:

python

natural-language-processing

named-entity-recognition

pdfminer

pdftotext

Aryan

People also ask

2 Answers

georgexsh

Jonathan Gagne

Recent Activity

Donate For Us