I'm trying to extract data from tables inside some pdf reports.
I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to extract the tables.
Is there a way to use R to recognize and extract only tables?
Thankfully, there's an R library called tabulizer that can help you to extract tables or texts from PDF files automatically and in a short amount of time. In this article, I will introduce you to the tabulizer library using the R library.
You can name the function whatever you like, e.g, Rpdf. The readPDF function has a control argument that we use to pass options to our PDF extraction engine. This has to be in the form of a list, so we wrap our options in the list function. There are two control parameters for the xpdf engine: info and text.
Awsome question, I wondered about the same thing recently, thanks!
I did it, with tabulizer ‘0.2.2’
as @hrbrmstr also suggests. If you are using R > 3.5.x, I'm providing following solution. Install the three packages in specific order:
# install.packages("rJava") # library(rJava) # load and attach 'rJava' now # install.packages("devtools") # devtools::install_github("ropensci/tabulizer", args="--no-multiarch")
Update: After just testing the approach again, it looks like it's enough to just do install.packages("tabulizer")
now. rJava
will be installed automatically as a dependency.
Now you are ready to extract tables from your PDF reports.
library(tabulizer) ## load report l <- "https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf" m <- extract_tables(l, encoding="UTF-8")[[2]] ## comes as a character matrix ## Note: peep into `?extract_tables` for further specs (page, location etc.)! ## use first row as column names dat <- setnames(type.convert(as.data.frame(m[-1, ]), as.is=TRUE), m[1, ]) ## example-specific date conversion dat$Date <- as.POSIXlt(dat$Date, format="%m/%d/%y") dat <- within(dat, Date$year <- ifelse(Date$year > 120, Date$year - 100, Date$year)) dat ## voilà # Speed (mph) Driver Car Engine Date # 1 407.447 Craig Breedlove Spirit of America GE J47 1963-08-05 # 2 413.199 Tom Green Wingfoot Express WE J46 1964-10-02 # 3 434.220 Art Arfons Green Monster GE J79 1964-10-05 # 4 468.719 Craig Breedlove Spirit of America GE J79 1964-10-13 # 5 526.277 Craig Breedlove Spirit of America GE J79 1965-10-15 # 6 536.712 Art Arfons Green Monster GE J79 1965-10-27 # 7 555.127 Craig Breedlove Spirit of America, Sonic 1 GE J79 1965-11-02 # 8 576.553 Art Arfons Green Monster GE J79 1965-11-07 # 9 600.601 Craig Breedlove Spirit of America, Sonic 1 GE J79 1965-11-15 # 10 622.407 Gary Gabelich Blue Flame Rocket 1970-10-23 # 11 633.468 Richard Noble Thrust 2 RR RG 146 1983-10-04 # 12 763.035 Andy Green Thrust SSC RR Spey 1997-10-15
Hope it works for you.
Limitations: Of course, the table in this example is quite simple and maybe you have to mess around with gsub
and this kind of stuff.
I would love to know the answer to this as well. But from my experience, you need to use regular expressions to get the data in a format that you want. You can see the following as an example:
library(pdftools) dat <- pdftools::pdf_text("https://s3-eu-central-1.amazonaws.com/de-hrzg-khl/kh-ffe/public/artikel-pdfs/Free_PDF/BF_LISTE_20016.pdf") dat <- paste0(dat, collapse = " ") pattern <- "Berufsfeuerwehr\\s+Straße(.)*02366.39258" extract <- regmatches(dat, regexpr(pattern, dat)) extract <- gsub('\n', " ", extract) strsplit(extract, "\\s{2,}")
From here the data can then be looped to create the table as desired. But as you can see in the link, the PDF is not only a table.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With