Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recognize PDF table using R

Tags:

I'm trying to extract data from tables inside some pdf reports.

I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to extract the tables.

Is there a way to use R to recognize and extract only tables?

like image 945
RCS Avatar asked May 23 '17 17:05

RCS


People also ask

Can R extract data from PDF?

Thankfully, there's an R library called tabulizer that can help you to extract tables or texts from PDF files automatically and in a short amount of time. In this article, I will introduce you to the tabulizer library using the R library.

How do I read a PDF in R?

You can name the function whatever you like, e.g, Rpdf. The readPDF function has a control argument that we use to pass options to our PDF extraction engine. This has to be in the form of a list, so we wrap our options in the list function. There are two control parameters for the xpdf engine: info and text.


2 Answers

Awsome question, I wondered about the same thing recently, thanks!

I did it, with tabulizer ‘0.2.2’ as @hrbrmstr also suggests. If you are using R > 3.5.x, I'm providing following solution. Install the three packages in specific order:

# install.packages("rJava") # library(rJava) # load and attach 'rJava' now # install.packages("devtools") # devtools::install_github("ropensci/tabulizer", args="--no-multiarch") 

Update: After just testing the approach again, it looks like it's enough to just do install.packages("tabulizer") now. rJava will be installed automatically as a dependency.

Now you are ready to extract tables from your PDF reports.

library(tabulizer)  ## load report l <- "https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf"  m <- extract_tables(l, encoding="UTF-8")[[2]]  ## comes as a character matrix ## Note: peep into `?extract_tables` for further specs (page, location etc.)!  ## use first row as column names dat <- setnames(type.convert(as.data.frame(m[-1, ]), as.is=TRUE), m[1, ]) ## example-specific date conversion dat$Date <- as.POSIXlt(dat$Date, format="%m/%d/%y") dat <- within(dat, Date$year <- ifelse(Date$year > 120, Date$year - 100, Date$year))  dat ## voilà #    Speed (mph)          Driver                        Car    Engine       Date # 1      407.447 Craig Breedlove          Spirit of America    GE J47 1963-08-05 # 2      413.199       Tom Green           Wingfoot Express    WE J46 1964-10-02 # 3      434.220      Art Arfons              Green Monster    GE J79 1964-10-05 # 4      468.719 Craig Breedlove          Spirit of America    GE J79 1964-10-13 # 5      526.277 Craig Breedlove          Spirit of America    GE J79 1965-10-15 # 6      536.712      Art Arfons              Green Monster    GE J79 1965-10-27 # 7      555.127 Craig Breedlove Spirit of America, Sonic 1    GE J79 1965-11-02 # 8      576.553      Art Arfons              Green Monster    GE J79 1965-11-07 # 9      600.601 Craig Breedlove Spirit of America, Sonic 1    GE J79 1965-11-15 # 10     622.407   Gary Gabelich                 Blue Flame    Rocket 1970-10-23 # 11     633.468   Richard Noble                   Thrust 2 RR RG 146 1983-10-04 # 12     763.035      Andy Green                 Thrust SSC   RR Spey 1997-10-15 

Hope it works for you.

Limitations: Of course, the table in this example is quite simple and maybe you have to mess around with gsub and this kind of stuff.

like image 151
jay.sf Avatar answered Sep 20 '22 19:09

jay.sf


I would love to know the answer to this as well. But from my experience, you need to use regular expressions to get the data in a format that you want. You can see the following as an example:

library(pdftools) dat <- pdftools::pdf_text("https://s3-eu-central-1.amazonaws.com/de-hrzg-khl/kh-ffe/public/artikel-pdfs/Free_PDF/BF_LISTE_20016.pdf") dat <- paste0(dat, collapse = " ") pattern <- "Berufsfeuerwehr\\s+Straße(.)*02366.39258" extract <- regmatches(dat, regexpr(pattern, dat)) extract <- gsub('\n', "  ", extract) strsplit(extract, "\\s{2,}") 

From here the data can then be looped to create the table as desired. But as you can see in the link, the PDF is not only a table.

like image 39
Charl Francois Marais Avatar answered Sep 20 '22 19:09

Charl Francois Marais