Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading data from PDF files into R

Is that even possible!?!

I have a bunch of legacy reports that I need to import into a database. However, they're all in pdf format. Are there any R packages that can read pdf? Or should I leave that to a command line tool?

The reports were made in excel and then pdfed, so they have regular structure, but many blank "cells".

like image 913
Justin Avatar asked Feb 07 '12 23:02

Justin


People also ask

Can R read data from PDF?

PDE is a R package that easily extracts information and tables from PDF files. The PDE_analyzer_i() performs the sentence and table extraction while the included PDE_reader_i() allows the user-friendly visualization and quick-processing of the obtained results.

How do I extract a PDF in R?

Extract text from pdf in R, first we need to install pdftools package from cran. Let's install the pdftools package from cran. The pdf file needs to save in local directory or get it from online. Here we are extracting one sample document from online.


1 Answers

So... this gets me close even on a fairly complex table.

Download a sample pdf from bmi pdf

library(tm)  pdf <- readPDF(PdftotextOptions = "-layout")  dat <- pdf(elem = list(uri='bmi_tbl.pdf'), language='en', id='id1')  dat <- gsub(' +', ',', dat) out <- read.csv(textConnection(dat), header=FALSE) 
like image 168
Justin Avatar answered Sep 19 '22 20:09

Justin