How to scrape tables in thousands of PDF files?

1 Answers

I didn't know this before, but less has this magical ability to read pdf files. I was able to extract the table data from your example pdf with this script:

import subprocess
import re

output = subprocess.check_output(["less","BAG_15m_kzh_2012_de.pdf"])

re_data_prefix = re.compile("^[0-9]+[.].*$")
re_data_fields = re.compile("(([^ ]+[ ]?)+)")
for line in output.splitlines():
    if re_data_prefix.match(line):
        print [l[0].strip() for l in re_data_fields.findall(line)]

answered Sep 20 '22 04:09

Andrew Johnson

Related questions
                            
                                Pass another object to the main flask application
                            
                                Loading huge XML files and dealing with MemoryError
                            
                                py.test - how to use a context manager in a funcarg/fixture
                            
                                Python regex search for string at beginning of line in file
                            
                                In-place QuickSort in Python
                            
                                How to use @pytest.mark with base classes?
                            
                                pandas: Filling missing values within a group
                            
                                Python: How do I save generator output into text file?
                            
                                vary the color of each bar in bargraph using particular value
                            
                                How to select next node using scrapy
                            
                                Equivalent of transform in R/ddply in Python/pandas?
                            
                                djangorestframework serializer errors: {u'non_field_errors': [u'No input provided']}
                            
                                Python 3 tkinter iconbitmap error in ubuntu
                            
                                Convert list of rgb codes to matplotlib colormap
                            
                                What could cause an open file dialog window in Tkinter/Python to be really slow to close after the user selects a file?
                            
                                Always execute Code and the end of a python script
                            
                                Mixing file.readline() and file.next()
                            
                                How to add weekly timedeltas with regards to daylight saving timezones
                            
                                Very simple user input in django
                            
                                label empty or too long - python urllib2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to scrape tables in thousands of PDF files?

Tags:

python

node.js

parsing

web-scraping

pdf-parsing

grssnbchr

People also ask

1 Answers

Andrew Johnson

Recent Activity

Donate For Us