I have to extract text from PDF pages as it is with the indentation into a CSV file.
Index page from PDF text book:
I should split the text into class and subclass type hierarchy along with the page numbers. For example in the image, Application server is the class and Apache Tomcat is the subclass in the page number 275
This is the expected output of the CSV:
I have used Tika parser to parse the PDF, but the indentation is not maintained properly (not unique) in the parsed content for splitting the text into class and subclasses.
This is how the parsed text looks like:
Can anyone suggest me the right approach for this requirement?
Step 1: Import all libraries. Step 2: Convert PDF file to txt format and read data. Step 3: Use “. findall()” function of regular expressions to extract keywords.
despite I have no knowledge of pdf extraction, but it is possible to reconstruct the hierarchy from "the parsed text", because the "subclass" part always starts and ends with an extra newline character.
with following test text:
app architect . 50
app logic . 357
app server . 275
tomcat . 275
websphere . 275
jboss . 164
architect
acceptance . 303
development path . 304
architecting . 48
architectural activity . 25, 320
following code:
import csv
import sys
import re
def gen():
is_subclass = False
p_class = None
with open('test.data') as f:
s = f.read()
lines = re.findall(r'[^\n]+\n+', s)
for line in lines:
if ' . ' in line:
class_name, page_no = map(lambda s: s.strip(), line.split('.'))
else:
class_name, page_no = line.strip(), ''
if line.endswith('\n\n'):
if not is_subclass:
p_class = class_name
is_subclass = True
continue
if is_subclass:
yield (p_class, class_name, page_no)
else:
yield (class_name, '', page_no)
if line.endswith('\n\n'):
is_subclass = False
writer = csv.writer(sys.stdout)
writer.writerows(gen())
yields:
app architect,,50
app logic,,357
app server,tomcat,275
app server,websphere,275
app server,jboss,164
architect,acceptance,303
architect,development path,304
architecting,,48
architectural activity,,"25, 320"
hope this helps.
So here is the solution:
Code:
import fitz
import json
import re
import csv
class MyClass:
def __init__(self, text, main_class):
my_arr = re.split("[.]*", text)
if main_class != my_arr[0].strip():
main_class = my_arr[0].strip()
self.main_class = main_class
self.sub_class = my_arr[0].strip()
try:
self.page = my_arr[1].strip()
except:
self.page = ""
def add_line(text, is_recording, main_class):
if(is_recording):
obj = MyClass(text, main_class)
if obj.sub_class == "Glossary":
return False, main_class
table.append(obj)
return True, obj.main_class
elif text == "Contents":
return True, main_class
return False, main_class
last_text = ""
is_recording = False
main_class = ""
table = []
doc = fitz.open("TCS_1.pdf")
page = doc.getPageText(2, output="json")
blocks = json.loads(page)["blocks"]
for block in blocks:
if "lines" in block:
for line in block["lines"]:
line_text = ""
for span in block["lines"]:
line_text += span["spans"][0]["text"].encode("utf-8")
if last_text != line_text:
is_recording, main_class = add_line(line_text, is_recording, main_class)
last_text = line_text
writer = csv.writer(open("output.csv", 'w'), delimiter=',', lineterminator='\n')
for my_class in table:
writer.writerow([my_class.main_class, my_class.sub_class, my_class.page])
# print(my_class.main_class, my_class.sub_class, my_class.page)
Here is the CSV output of the file provided:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With