Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting data from a specific position of a PDF?

Tags:

r

pdf

I'm attempting to extract data from a pdf, which can be located at https://www.dol.gov/ui/data.pdf. The data I'm interested in are on page 4 of the PDF and are the 3 observations of the Initial Claims (NSA), the 3 observations of the Insured Unemployment (NSA), and the most recent week used covered employment (footnote 2).

I've read the PDF into R using pdftools, but the text output which is generated is quite ugly (kind of to be expected - due to the nature of PDFs). Is there any way I can extract specific data from this text output? I believe the data will always be in the same place in the output, which is helpful.

The output I'm looking at can be seen with the following script:

library(pdftools)

download.file("https://www.dol.gov/ui/data.pdf", "data.pdf", mode="wb")

uidata <- pdf_text("data.pdf")
uidata[4]

I've searched people with similar questions and fiddled around with scan() and grep(), but can't seem to figure out a way to isolate and extract the data I need from the text output. Thanks in advance if anyone stumbles upon this and can point me in the right direction - if not I'll be trying to figure this out!

like image 777
Teeb Avatar asked Dec 27 '16 21:12

Teeb


1 Answers

With grep and a little regex, you can get everything you need into a usable structure:

library(magrittr)

x <- pdftools::pdf_text('https://www.dol.gov/ui/data.pdf')
x2 <- readLines(textConnection(x[4]))
r <- grep('WEEK ENDING', x2)

l <- lapply(seq_along(r), function(i){
    x2[r[i]:(na.omit(c(r[i + 1], grep('FOOTNOTE', x2)))[1] - 1)] %>% 
        trimws() %>% 
        gsub('\\s{2,}', ';', .) %>% 
        paste(collapse = '\n') %>% 
        read.csv2(text = ., dec = '.')
    })

from_footnote <- as.numeric(gsub('^2|\\D', '', x2[grep('2\\.', x2)]))

l[[1]][3,]
#>                      WEEK.ENDING December.17 December.10  Change
#> Initial Claims (NSA)     315,613     305,333     +10,280 352,534
#>                      December.3
#> Initial Claims (NSA)    319,641

from_footnote
#> [1] 138322138

You'll still need to parse the numbers, but at least it's usable.

like image 149
alistaire Avatar answered Oct 01 '22 11:10

alistaire