I'm attempting to extract data from a pdf, which can be located at https://www.dol.gov/ui/data.pdf. The data I'm interested in are on page 4 of the PDF and are the 3 observations of the Initial Claims (NSA), the 3 observations of the Insured Unemployment (NSA), and the most recent week used covered employment (footnote 2).
I've read the PDF into R using pdftools, but the text output which is generated is quite ugly (kind of to be expected - due to the nature of PDFs). Is there any way I can extract specific data from this text output? I believe the data will always be in the same place in the output, which is helpful.
The output I'm looking at can be seen with the following script:
library(pdftools)
download.file("https://www.dol.gov/ui/data.pdf", "data.pdf", mode="wb")
uidata <- pdf_text("data.pdf")
uidata[4]
I've searched people with similar questions and fiddled around with scan() and grep(), but can't seem to figure out a way to isolate and extract the data I need from the text output. Thanks in advance if anyone stumbles upon this and can point me in the right direction - if not I'll be trying to figure this out!
With grep
and a little regex, you can get everything you need into a usable structure:
library(magrittr)
x <- pdftools::pdf_text('https://www.dol.gov/ui/data.pdf')
x2 <- readLines(textConnection(x[4]))
r <- grep('WEEK ENDING', x2)
l <- lapply(seq_along(r), function(i){
x2[r[i]:(na.omit(c(r[i + 1], grep('FOOTNOTE', x2)))[1] - 1)] %>%
trimws() %>%
gsub('\\s{2,}', ';', .) %>%
paste(collapse = '\n') %>%
read.csv2(text = ., dec = '.')
})
from_footnote <- as.numeric(gsub('^2|\\D', '', x2[grep('2\\.', x2)]))
l[[1]][3,]
#> WEEK.ENDING December.17 December.10 Change
#> Initial Claims (NSA) 315,613 305,333 +10,280 352,534
#> December.3
#> Initial Claims (NSA) 319,641
from_footnote
#> [1] 138322138
You'll still need to parse the numbers, but at least it's usable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With