Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract text under specific headings from a pdf?

I want to extract text under specific headings from a pdf using python.

For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Summary'.

How can I do this?

sample-image

like image 949
AlfyFaisy Avatar asked Jan 05 '18 05:01

AlfyFaisy


People also ask

How do I filter data in a PDF?

Open the Comments pane by going under View > Panels > Comments Panel or by clicking on the Comments panel icon on the right side of the frame above the scrollbar. Click Filter icon.


1 Answers

This scenario is exactly what I am working on in my current company. We need to extract text lying under a heading. I'm personally using a rule based system i.e, using regex to identify all the numbered headings after reading the entire document line by line. Once I have the headings I enter the name of the heading for which I want to find the corresponding paragraph. This input is matched with the pre-existing list of headings and using universal sentence encoder I find the nearest match. After that I just display all the contents that is present from that heading upto the immediate next heading.

like image 80
PrafulPrasad Avatar answered Sep 28 '22 09:09

PrafulPrasad