I am trying to create a data.frame with all USA County Unemployment Data from the US Bureau of Labor Statistics:
http://www.bls.gov/lau/#cntyaa
The data has one result per year from 1990 thru 2013.
I had originally planned to use the BLS API, but it looks like they consider each county to be a separate query and the total number of queries would be over their threshold. I now think it might be easier to simply scrape the data from the TXT files they put online, but I am having trouble parsing the data with R. Here is an example page with the data:
http://www.bls.gov/lau/laucnty90.txt # 90 = 1990
I originally tried parsing the file with the rvest
package. But because the data is all in one <p>
tag I think that there is not enough HTML structure to the table for that tool.
I then tried download.file
and read.table
. But again, the data seems to not be in the right format for those tools - extra lines at the top and bottom, and the "separating character" is just a space, which confuses R when county names contain spaces.
At the end of the day I just need a data.frame with 3 pieces of data from this file: state FIPS code, county FIPS code and unemployment rate.
I'm now thinking that the easiest way to create this data.frame might be to download the excel files, delete the columns I don't need, remove the extra text at the top and bottom, export to CSV, and then read it into R.
I can certainly do this for each of the 14 years. But I lose some reproducbility with this - other people won't be able to easily verify that I didn't make a mistake in the import process.
Does anyone see an easier way to create this data.frame?
That's just a fixed width data file. I don't have time to give you perfect code, but tweaking this should give you what you need:
url = 'http://www.bls.gov/lau/laucnty90.txt'
w = c(17, 8, 6, 50, 12, 13, 12, 11)
dat = read.fwf(url, w, skip=3)
Need a "ruler" to figure out which columns to split at:
cat(">",paste0(rep(c(1:9,"+"),14),collapse=""))
cat(">",paste0(sprintf("%08s0/",1:14),collapse=""))
> 123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+
cat(">",paste0(sprintf("%08s0/",1:14),collapse=""))
> 000000010/000000020/000000030/000000040/000000050/000000060/000000070/000000080/000000090/000000100/000000110/000000120/000000130/000000140/
# and paste in the first line of data
> CN0100100000000 01 001 Autauga County, AL 1990 16,875 15,853 1,022 6.1
That lets you figure out where to put the splits andtaking the differnse and shifting by one value gets the widths. Skip first six rows and then process the data read in as character to avoid the hassles of factors. Remove the commas before coercing to numeric.
> dat = read.fwf(url,
widths=diff(c(0,16,21,29,80,86,100,115,125,132)+1),
skip=6,colClasses="character")
> str(dat)
'data.frame': 3219 obs. of 9 variables:
$ V1: chr "CN0100100000000 " "CN0100300000000 " "CN0100500000000 " "CN0100700000000 " ...
$ V2: chr " 01 " " 01 " " 01 " " 01 " ...
$ V3: chr " 001 " " 003 " " 005 " " 007 " ...
$ V4: chr " Autauga County, AL " " Baldwin County, AL " " Barbour County, AL " " Bibb County, AL " ...
$ V5: chr " 1990 " " 1990 " " 1990 " " 1990 " ...
$ V6: chr " 16,875 " " 46,773 " " 11,458 " " 7,408 " ...
$ V7: chr " 15,853 " " 44,492 " " 10,619 " " 6,776 " ...
$ V8: chr " 1,022 " " 2,281 " " 839 " " 632 " ...
$ V9: chr " 6.1" " 4.9" " 7.3" " 8.5" ...
dat[6:8] <- lapply( dat[6:8],
function(col) as.numeric( gsub("[,]", "", col)) )
> str(dat)
'data.frame': 3219 obs. of 9 variables:
$ V1: chr "CN0100100000000 " "CN0100300000000 " "CN0100500000000 " "CN0100700000000 " ...
$ V2: chr " 01 " " 01 " " 01 " " 01 " ...
$ V3: chr " 001 " " 003 " " 005 " " 007 " ...
$ V4: chr " Autauga County, AL " " Baldwin County, AL " " Barbour County, AL " " Bibb County, AL " ...
$ V5: chr " 1990 " " 1990 " " 1990 " " 1990 " ...
$ V6: num 16875 46773 11458 7408 19130 ...
$ V7: num 15853 44492 10619 6776 18001 ...
$ V8: num 1022 2281 839 632 1129 ...
$ V9: chr " 6.1" " 4.9" " 7.3" " 8.5" ...
dat[[9]] <- as.numeric( dat[[9]])
This could probably be improved by using some "NULL"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With