Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read pdf file in R

Tags:

r

pdf

Someone can help me to let me know how to read the pdf file, which is including some tables. I want to extract the data in the table, and arrange to csv file.

Thanks a lot

like image 845
許曉雯 Avatar asked Jul 26 '16 14:07

許曉雯


2 Answers

I realize this question is older, but i thought reproducible examples might not hurt:

library(pdftools)
pdftools::pdf_text(pdf = "http://arxiv.org/pdf/1403.2805.pdf")

Offline version:

pdf(file = "tmp.pdf")
plot(1, main = "mytext")
dev.off()
pdftools::pdf_text(pdf = "tmp.pdf")

I come back to this question from time to time and even though the current answer is great, i always hope to find reproducible code. So i thought i add it. It can be removed if not needed.

like image 101
Tonio Liebrand Avatar answered Oct 19 '22 13:10

Tonio Liebrand


A well described step-by-step from University of Virginia you'll find at Reading PDF files into R for text mining. Some information I extracted below.

Please follow the installation notes described in the link above.

With that done, you’re ready to use readPDF to create your function to read in PDF files. You can name the function whatever you like e.g Rpdf.

Rpdf <- readPDF(control = list(text = "-layout"))

The readPDF function has a control argument which we use to pass options to our PDF extraction engine. This has to be in the form of a list, so we wrap our options in the list function. There are two control parameters for the xpdf engine: info and text. info passes parameters to pdfinfo.exe and text passes parameters to pdftotext.exe. We only pass one parameter setting to pdftotext: “-layout”. This tells pdftptext.exe to maintain (as best as possible) the original physical layout of the text.

Using the Rpdf function we can proceed to read in the text of the opinions. What we want to do is convert the PDF files to text and store them in a corpus, which is basically a database for text. We can do all that with the following code:

opinions <- Corpus(URISource(files), readerControl = list(reader = Rpdf))
like image 32
help-info.de Avatar answered Oct 19 '22 13:10

help-info.de