I want to extract text under specific headings from a pdf using python.
For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Summary'.
How can I do this?
Open the Comments pane by going under View > Panels > Comments Panel or by clicking on the Comments panel icon on the right side of the frame above the scrollbar. Click Filter icon.
This scenario is exactly what I am working on in my current company. We need to extract text lying under a heading. I'm personally using a rule based system i.e, using regex to identify all the numbered headings after reading the entire document line by line. Once I have the headings I enter the name of the heading for which I want to find the corresponding paragraph. This input is matched with the pre-existing list of headings and using universal sentence encoder I find the nearest match. After that I just display all the contents that is present from that heading upto the immediate next heading.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With