Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read SAS sas7bdat data into R

Tags:

r

r-faq

People also ask

Can R read sas7bdat files?

In R, there are a couple of that can read SAS files into dataframes. In this post, we are going to use the r-packages haven, sas7bdat, and the GUI of RStudio to load a SAS file as well.

How do I read .sas7bdat files?

To read in a SAS dataset just use a set statement. You could first use a LIBNAME statement define a libref that points to the folder with the SAS dataset in it and use that libref in your code. libname learn '/home/XXX/learn/'; data Sales; set learn.

Why are SAS files so large?

A typical SAS dataset is made up of observations and variables. If we prefix “LARGE” before a SAS dataset, it implies that a SAS dataset may consist of numerous observations and variables thus resulting in increase in its overall size.


sas7bdat worked fine for all but one of the files I was looking at (specifically, this one); in reporting the error to the sas7bdat developer, Matthew Shotwell, he also pointed me in the direction of Hadley's haven package in R which also has a read_sas method.

This method is superior for two reasons:

1) It didn't have any trouble reading the above-linked file 2) It is much (I'm talking much) faster than read.sas7bdat. Here's a quick benchmark (on this file, which is smaller than the others) for evidence:

microbenchmark(times=10L,
               read.sas7bdat("psu97ai.sas7bdat"),
               read_sas("psu97ai.sas7bdat"))

Unit: milliseconds
                              expr        min         lq       mean     median         uq        max neval cld
 read.sas7bdat("psu97ai.sas7bdat") 66696.2955 67587.7061 71939.7025 68331.9600 77225.1979 82836.8152    10   b
      read_sas("psu97ai.sas7bdat")   397.9955   402.2627   410.4015   408.5038   418.1059   425.2762    10  a 

That's right--haven::read_sas takes (on average) 99.5% less time than sas7bdat::read.sas7bdat.

minor update

I previously wasn't able to figure out whether the two methods produced the same data (i.e., that both have equal levels of fidelity with respect to reading the data), but have finally done so:

# Keep as data.tables
sas7bdat <- setDT(read.sas7bdat("psu97ai.sas7bdat"))
haven <- setDT(read_sas("psu97ai.sas7bdat"))

# read.sas7bdat prefers strings as factors,
#   and as of now has no stringsAsFactors argument
#   with which to prevent this
idj_factor <- sapply(haven, is.factor)

# Reset all factor columns as characters
sas7bdat[ , (idj_factor) := lapply(.SD, as.character), .SDcols = idj_factor]

# Check equality of the tables
all.equal(sas7bdat, haven, check.attributes = FALSE)
# [1] TRUE

However, note that read.sas7bdat has kept a massive list of attributes for the file, presumably a holdover from SAS:

str(sas7bdat)
# ...
# - attr(*, "column.info")=List of 70
#   ..$ :List of 12
#   .. ..$ name  : chr "NCESSCH"
#   .. ..$ offset: int 200
#   .. ..$ length: int 12
#   .. ..$ type  : chr "character"
#   .. ..$ format: chr "$"
#   .. ..$ fhdr  : int 0
#   .. ..$ foff  : int 76
#   .. ..$ flen  : int 1
#   .. ..$ label : chr "UNIQUE SCHOOL ID (NCES ASSIGNED)"
#   .. ..$ lhdr  : int 0
#   .. ..$ loff  : int 44
#   .. ..$ llen  : int 32
# ...

So, if by any chance you need these attributes (I know some people are particularly keen on the labels, for instance), perhaps read.sas7bdat is the option for you after all.


As of January 18, 2018, the haven R library will load sas and stata datasets into the R environment. In R, simply:

library(haven)
data <- read_sas("C:/temp/mysasdataset.sas7bdat")
View(data)

You can also load the data manually within R studio. In the environment pane, choose

Import Dataset > From SAS...

Select the file location and click "Import"


Problem

The problem looks like the files you're trying to use are poorly formatted. Specifically, blank cells are not coded (R uses NA) but are simply left empty. When trying to load the tab-delimited file this creates problems for R which thinks there are incorrect numbers of columns.

Workaround using SAS files

I've found a workaround by loading the SAS file using the sas7bdat package and then recoding blank cells ("") as NA:

install.packages("sas7bdat")
require("sas7bdat")
download.file("http://nces.ed.gov/ccd/Data/zip/ag121a_supp_sas.zip",
              destfile = "sas.zip")
unzip("sas.zip")
sas <- read.sas7bdat(file = "ag121a_supp.sas7bdat", debug = FALSE)
sas[sas == ""] <- NA

There are two issues with this method to be aware of, though:

  1. It's slow (see comments)
  2. sas7bdat package is currently considered experimental at time of writing by its author. It therefore might not load all sas files, and I would check the ones it does thoroughly for inconsistencies before use.

Non-R solution

It's not exactly canonical, but you could also download the tab-delimited files, open them in LibreOffice Calc (Microsoft Excel seems to screw things up), and find and replace all by searching for "" and replacing with NA.