Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDF scraping using R

I have been using the XML package successfully for extracting HTML tables but want to extend to PDF's. From previous questions it does not appear that there is a simple R solution but wondered if there had been any recent developments

Failing that, is there some way in Python (in which I am a complete Novice) to obtain and manipulate pdfs so that I could finish the job off with the R XML package

like image 929
pssguy Avatar asked Oct 27 '11 15:10

pssguy


People also ask

Is it possible to scrape data from PDF?

Docparser is a PDF scraper software that allows you to automatically pull data from recurring PDF documents on scale. Like web-scraping (collecting data by crawling the internet), scraping PDF documents is a powerful method to automatically convert semi-structured text documents into structured data.

Can you do web scraping with R?

The commonly used web Scraping tools for R is rvest. Install the package rvest in your R Studio using the following code. Having, knowledge of HTML and CSS will be an added advantage. It's observed that most of the Data Scientists are not very familiar with technical knowledge of HTML and CSS.

Can we read PDF file in R?

We need to install and load pdftools package to do the extraction. To read pdf as textfile, use pdf_text(). Then we can extract a particular page. The pdf file contains a table.


1 Answers

Extracting text from PDFs is hard, and nearly always requires lots of care.

I'd start with the command line tools such as pdftotext and see what they spit out. The problem is that PDFs can store the text in any order, can use awkward font encodings, and can do things like use ligature characters (the joined up 'ff' and 'ij' that you see in proper typesetting) to throw you.

pdftotext is installable on any Linux system...

like image 102
Spacedman Avatar answered Oct 15 '22 13:10

Spacedman