Is that even possible!?! I have a bunch of legacy reports that I need to import into a database. However, they're all in pdf format. Are there any <code>R</code> packages that can read pdf? Or should I leave that to a command line tool? The reports were made in excel and then pdfed, so they have regular structure, but many blank "cells".

So... this gets me close even on a fairly complex table. Download a sample pdf from bmi pdf <pre class="prettyprint"><code>library(tm) pdf <- readPDF(PdftotextOptions = "-layout") dat <- pdf(elem = list(uri='bmi_tbl.pdf'), language='en', id='id1') dat <- gsub(' +', ',', dat) out <- read.csv(textConnection(dat), header=FALSE) </code></pre>

Reading data from PDF files into R

1 Answers

So... this gets me close even on a fairly complex table.

Download a sample pdf from bmi pdf

Click to copy

library(tm)  pdf <- readPDF(PdftotextOptions = "-layout")  dat <- pdf(elem = list(uri='bmi_tbl.pdf'), language='en', id='id1')  dat <- gsub(' +', ',', dat) out <- read.csv(textConnection(dat), header=FALSE)

168

answered Sep 19 '22 20:09

Justin

Related questions
                            
                                How do I delete virtual interface in Linux? [closed]
                            
                                PHP exec - check if enabled or disabled
                            
                                Is the UNIX `time` command accurate enough for benchmarks? [closed]
                            
                                How to edit a text file in my terminal
                            
                                Hierarchical ldd(1)
                            
                                After forking, are global variables shared?
                            
                                how to Validate a XML file with XSD through xmllint [duplicate]
                            
                                find and copy file using Bash [duplicate]
                            
                                Count occurrences of character per line/field on Unix
                            
                                Bash: How to tokenize a string variable?
                            
                                How to schedule tcpdump to run for a specific period of time?
                            
                                Uppercasing First Letter of Words Using SED
                            
                                How can I set up autocompletion for Git commands?
                            
                                Can't find out where does a node.js app running and can't kill it
                            
                                Uninstall python built from source?
                            
                                Python spawn off a child subprocess, detach, and exit
                            
                                How to scp back to local when I've already sshed into remote machine?
                            
                                Difference between Resident Set Size (RSS) and Java total committed memory (NMT) for a JVM running in Docker container
                            
                                How can I pipe initial input into process which will then be interactive?
                            
                                Why do we need a swapper task in linux?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reading data from PDF files into R

Tags:

linux

r

pdf

pdf-scraping

scrape

Justin

People also ask

1 Answers

Justin

Recent Activity

Donate For Us