How to extract text under specific headings from a pdf?

1 Answers

This scenario is exactly what I am working on in my current company. We need to extract text lying under a heading. I'm personally using a rule based system i.e, using regex to identify all the numbered headings after reading the entire document line by line. Once I have the headings I enter the name of the heading for which I want to find the corresponding paragraph. This input is matched with the pre-existing list of headings and using universal sentence encoder I find the nearest match. After that I just display all the contents that is present from that heading upto the immediate next heading.

answered Sep 28 '22 09:09

PrafulPrasad

Related questions
                            
                                What kind of python magic does dir() perform with __getattr__?
                            
                                py.test's monkeypatch.setattr(...) not working in some cases
                            
                                Python manager.dict() is very slow compared to regular dict
                            
                                unable to install JQ via PIP
                            
                                AttributeError: 'tuple' object has no attribute 'shape'
                            
                                Testing that a method in instance has been called in mock
                            
                                Executing a script that is loading libcrypto in an unsafe way on macOS 10.15.1
                            
                                Python converting latin1 to UTF8 [duplicate]
                            
                                Why does my Sieve of Eratosthenes work faster with integers than with booleans?
                            
                                pandas iterrows changes ints into floats
                            
                                GDB Error Installation error: gdb.execute_unwinders function is missing
                            
                                How to parse .ttl files with RDFLib?
                            
                                how to fix Scapy "Warning pcapy API does not permit to get capure file descriptor"?
                            
                                Python argparse: command-line argument that can be either named or positional
                            
                                Error loading python27.dll error for pyinstaller
                            
                                Pyinstaller error ImportError: No module named 'requests.packages.chardet.sys
                            
                                Inheriting a patched class
                            
                                PyCharm - Auto Completion for matplotlib (and other imported modules)
                            
                                Python - asynchronous logging
                            
                                from utils import label_map_util Import Error: No module named utils

How to extract text under specific headings from a pdf?

Tags:

pdf

python-2.7

text-extraction

document

pdf-extraction

AlfyFaisy

People also ask

1 Answers

PrafulPrasad

Recent Activity

Donate For Us