Recognize PDF table using R

2 Answers

Awsome question, I wondered about the same thing recently, thanks!

I did it, with tabulizer ‘0.2.2’ as @hrbrmstr also suggests. If you are using R > 3.5.x, I'm providing following solution. Install the three packages in specific order:

# install.packages("rJava") # library(rJava) # load and attach 'rJava' now # install.packages("devtools") # devtools::install_github("ropensci/tabulizer", args="--no-multiarch")

Update: After just testing the approach again, it looks like it's enough to just do install.packages("tabulizer") now. rJava will be installed automatically as a dependency.

Now you are ready to extract tables from your PDF reports.

library(tabulizer)  ## load report l <- "https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf"  m <- extract_tables(l, encoding="UTF-8")[[2]]  ## comes as a character matrix ## Note: peep into `?extract_tables` for further specs (page, location etc.)!  ## use first row as column names dat <- setnames(type.convert(as.data.frame(m[-1, ]), as.is=TRUE), m[1, ]) ## example-specific date conversion dat$Date <- as.POSIXlt(dat$Date, format="%m/%d/%y") dat <- within(dat, Date$year <- ifelse(Date$year > 120, Date$year - 100, Date$year))  dat ## voilà #    Speed (mph)          Driver                        Car    Engine       Date # 1      407.447 Craig Breedlove          Spirit of America    GE J47 1963-08-05 # 2      413.199       Tom Green           Wingfoot Express    WE J46 1964-10-02 # 3      434.220      Art Arfons              Green Monster    GE J79 1964-10-05 # 4      468.719 Craig Breedlove          Spirit of America    GE J79 1964-10-13 # 5      526.277 Craig Breedlove          Spirit of America    GE J79 1965-10-15 # 6      536.712      Art Arfons              Green Monster    GE J79 1965-10-27 # 7      555.127 Craig Breedlove Spirit of America, Sonic 1    GE J79 1965-11-02 # 8      576.553      Art Arfons              Green Monster    GE J79 1965-11-07 # 9      600.601 Craig Breedlove Spirit of America, Sonic 1    GE J79 1965-11-15 # 10     622.407   Gary Gabelich                 Blue Flame    Rocket 1970-10-23 # 11     633.468   Richard Noble                   Thrust 2 RR RG 146 1983-10-04 # 12     763.035      Andy Green                 Thrust SSC   RR Spey 1997-10-15

Hope it works for you.

Limitations: Of course, the table in this example is quite simple and maybe you have to mess around with gsub and this kind of stuff.

151

answered Sep 20 '22 19:09

jay.sf

I would love to know the answer to this as well. But from my experience, you need to use regular expressions to get the data in a format that you want. You can see the following as an example:

library(pdftools) dat <- pdftools::pdf_text("https://s3-eu-central-1.amazonaws.com/de-hrzg-khl/kh-ffe/public/artikel-pdfs/Free_PDF/BF_LISTE_20016.pdf") dat <- paste0(dat, collapse = " ") pattern <- "Berufsfeuerwehr\\s+Straße(.)*02366.39258" extract <- regmatches(dat, regexpr(pattern, dat)) extract <- gsub('\n', "  ", extract) strsplit(extract, "\\s{2,}")

From here the data can then be looped to create the table as desired. But as you can see in the link, the PDF is not only a table.

answered Sep 20 '22 19:09

Charl Francois Marais

Related questions
                            
                                Disable beta testing Android
                            
                                What is the correct file extension I should use for a newly created PostgreSQL schema that I created?
                            
                                REST API response in partial success
                            
                                Monad laws expressed in terms of join instead of bind?
                            
                                Chart.js: Make part of labels bold
                            
                                Limit number of cores used in Keras
                            
                                In Java 8, how do I get my hostname without hard-coding it in my environment?
                            
                                Chrome Headless puppeteer too much CPU
                            
                                Facebook Graph API does not return Page Events
                            
                                How to change the figure caption format in bookdown
                            
                                Does constraint subsumption only apply to concepts?
                            
                                Logging in .NET Core without DI?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Recognize PDF table using R

Tags:

RCS

People also ask

2 Answers

jay.sf

Charl Francois Marais

Recent Activity

Donate For Us