I have about 1'500 PDFs consisting of only 1 page each, and exhibiting the same structure (see http://files.newsnetz.ch/extern/interactive/downloads/BAG_15m_kzh_2012_de.pdf for an example).
What I am looking for is a way to iterate over all these files (locally, if possible) and extract the actual contents of the table (as CSV, stored into a SQLite DB, whatever).
I would love to do this in Node.js, but couldn't find any suitable libraries for parsing such stuff. Do you know of any?
If not possible in Node.js, I could also code it in Python, if there are better methods available.
In Adobe Acrobat, go to Tools -> Text Recognition -> In This File. Adobe Acrobat should start to OCR the PDF file. If you have multiple PDF files, we can set up an “Action Wizard” to automate the process and OCR all the PDF files.
Docparser is a PDF scraper software that allows you to automatically pull data from recurring PDF documents on scale. Like web-scraping (collecting data by crawling the internet), scraping PDF documents is a powerful method to automatically convert semi-structured text documents into structured data.
I didn't know this before, but less
has this magical ability to read pdf files. I was able to extract the table data from your example pdf with this script:
import subprocess
import re
output = subprocess.check_output(["less","BAG_15m_kzh_2012_de.pdf"])
re_data_prefix = re.compile("^[0-9]+[.].*$")
re_data_fields = re.compile("(([^ ]+[ ]?)+)")
for line in output.splitlines():
if re_data_prefix.match(line):
print [l[0].strip() for l in re_data_fields.findall(line)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With