Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a data.frame with all USA County Unemployment Data

Tags:

r

rvest

I am trying to create a data.frame with all USA County Unemployment Data from the US Bureau of Labor Statistics:

http://www.bls.gov/lau/#cntyaa

The data has one result per year from 1990 thru 2013.

I had originally planned to use the BLS API, but it looks like they consider each county to be a separate query and the total number of queries would be over their threshold. I now think it might be easier to simply scrape the data from the TXT files they put online, but I am having trouble parsing the data with R. Here is an example page with the data:

http://www.bls.gov/lau/laucnty90.txt # 90 = 1990

I originally tried parsing the file with the rvest package. But because the data is all in one <p> tag I think that there is not enough HTML structure to the table for that tool.

I then tried download.file and read.table. But again, the data seems to not be in the right format for those tools - extra lines at the top and bottom, and the "separating character" is just a space, which confuses R when county names contain spaces.

At the end of the day I just need a data.frame with 3 pieces of data from this file: state FIPS code, county FIPS code and unemployment rate.

I'm now thinking that the easiest way to create this data.frame might be to download the excel files, delete the columns I don't need, remove the extra text at the top and bottom, export to CSV, and then read it into R.

I can certainly do this for each of the 14 years. But I lose some reproducbility with this - other people won't be able to easily verify that I didn't make a mistake in the import process.

Does anyone see an easier way to create this data.frame?

like image 798
Ari Avatar asked Sep 11 '25 19:09

Ari


2 Answers

That's just a fixed width data file. I don't have time to give you perfect code, but tweaking this should give you what you need:

url = 'http://www.bls.gov/lau/laucnty90.txt'
w = c(17, 8, 6, 50, 12, 13, 12, 11) 
dat = read.fwf(url, w, skip=3)
like image 67
Vincent Avatar answered Sep 13 '25 11:09

Vincent


Need a "ruler" to figure out which columns to split at:

cat(">",paste0(rep(c(1:9,"+"),14),collapse=""))
cat(">",paste0(sprintf("%08s0/",1:14),collapse=""))
> 123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+123456789+
cat(">",paste0(sprintf("%08s0/",1:14),collapse=""))
> 000000010/000000020/000000030/000000040/000000050/000000060/000000070/000000080/000000090/000000100/000000110/000000120/000000130/000000140/
# and paste in the first line of data
> CN0100100000000   01     001   Autauga County, AL                                1990        16,875       15,853      1,022      6.1

That lets you figure out where to put the splits andtaking the differnse and shifting by one value gets the widths. Skip first six rows and then process the data read in as character to avoid the hassles of factors. Remove the commas before coercing to numeric.

> dat = read.fwf(url, 
                 widths=diff(c(0,16,21,29,80,86,100,115,125,132)+1), 
                 skip=6,colClasses="character")
> str(dat)
'data.frame':   3219 obs. of  9 variables:
 $ V1: chr  "CN0100100000000 " "CN0100300000000 " "CN0100500000000 " "CN0100700000000 " ...
 $ V2: chr  "  01 " "  01 " "  01 " "  01 " ...
 $ V3: chr  "    001 " "    003 " "    005 " "    007 " ...
 $ V4: chr  "  Autauga County, AL                               " "  Baldwin County, AL                               " "  Barbour County, AL                               " "  Bibb County, AL                                  " ...
 $ V5: chr  " 1990 " " 1990 " " 1990 " " 1990 " ...
 $ V6: chr  "       16,875 " "       46,773 " "       11,458 " "        7,408 " ...
 $ V7: chr  "      15,853   " "      44,492   " "      10,619   " "       6,776   " ...
 $ V8: chr  "   1,022  " "   2,281  " "     839  " "     632  " ...
 $ V9: chr  "    6.1" "    4.9" "    7.3" "    8.5" ...

dat[6:8] <- lapply( dat[6:8], 
                    function(col) as.numeric( gsub("[,]", "", col)) )

> str(dat)
'data.frame':   3219 obs. of  9 variables:
 $ V1: chr  "CN0100100000000 " "CN0100300000000 " "CN0100500000000 " "CN0100700000000 " ...
 $ V2: chr  "  01 " "  01 " "  01 " "  01 " ...
 $ V3: chr  "    001 " "    003 " "    005 " "    007 " ...
 $ V4: chr  "  Autauga County, AL                               " "  Baldwin County, AL                               " "  Barbour County, AL                               " "  Bibb County, AL                                  " ...
 $ V5: chr  " 1990 " " 1990 " " 1990 " " 1990 " ...
 $ V6: num  16875 46773 11458 7408 19130 ...
 $ V7: num  15853 44492 10619 6776 18001 ...
 $ V8: num  1022 2281 839 632 1129 ...
 $ V9: chr  "    6.1" "    4.9" "    7.3" "    8.5" ...

dat[[9]] <- as.numeric( dat[[9]])

This could probably be improved by using some "NULL"

like image 41
IRTFM Avatar answered Sep 13 '25 11:09

IRTFM