At my company, we are thinking of gradually phasing out SPSS in choice of R. During the transition though we'll still be having the data coming in SPSS data file format (.sav).
I'm having issues importing this SPSS datafile into R. When I import an SPSS file into R, I want to retain both the values and value labels for the variables. The read.spss()
function from foreign
package gives me option to retain either values OR value labels of a variable but not both.
AFAIK, R does allow factor variables to have values (levels) and value labels (level labels). I was just wondering if it's possible to somehow modify the read.spss()
function to incorporate this.
Alternatively, I came across spss.system.file()
function from memisc
package which supposedly allows this to happen, but it asks for a separate syntax file (codes.file), which is not necessarily available to me always.
Here's a sample data file.
I'd appreciate any help resolving this issue.
Thanks.
I do not know how to read in SPSS metadata; I usually read .csv files and add metadata back, or write a small one-off PERL script to do the job. What I wanted to mention is that a recently published R package, Rz, may assist you with bringing SPSS data into R. I have had a quick look at it and seems useful.
There is a solution to read SPSS data file in R by ODBC driver.
1) There is a IBM SPSS Statistics Data File Driver. I could not find the download link. I got it from my SPSS provider. The Standalone Driver is all you need. You do not need SPSS to install or use the driver.
2) Create a DSN for the SPSS data driver.
3) Using RODBC
package you can read in R any SPSS data file. It will be possible to get value labels for each variable as separate tables. Then it is possible to use the labels in R in any way as you wish.
Here is a working example on Windows (I do not have SPSS on my computer now) to read in R your example data file. I have not testted this on Linux. It probably works also on Linux, because there is a SPSS data driver also for Linux.
require(RODBC)
# Create connection
# Change the DSN name and CP_CONNECT_STRING according to your setting
con <- odbcDriverConnect("DSN=spss_ehsis;SDSN=SAVDB;HST=C:\\Program Files\\IBM\\SPSS\\StatisticsDataFileDriver\\20\\Standalone\\cfg\\oadm.ini;PRT=StatisticsSAVDriverStandalone;CP_CONNECT_STRING=C:\\temp\\data_expt.sav")
# List of tables
Tables <- sqlTables(con)
Tables
# List of table names to extract
table.names <- Tables$TABLE_NAME[Tables$TABLE_SCHEM != "SYSTEM"]
# Function to query a table by name
sqlQuery.tab.name <- function(table) {
sqlQuery(con, paste0("SELECT * FROM [", table, "]"))
}
# Retrieve all tables
Data <- lapply(table.names, sqlQuery.tab.name)
# See the data
lapply(Data, head)
# Close connection
close(con)
For example we can that value labels are defined for two variables:
[[5]]
VAR00002 VAR00002_label
1 1 Male
2 2 Female
[[6]]
VAR00003 VAR00003_label
1 2 Student
2 3 Employed
3 4 Unemployed
Additional information
Here is a function that allows to read SPSS data after the connection has been made to the SPSS data file. The function allows to specify the list of variables to be selected. If value.labels=T
the selected variables with value labels in SPSS data file are converted to the R factors with labels attached.
I have to say I am not satisfied with the performance of this solution. It work good for small data files. The RAM limit is reached quite often for large SPSS data files (even the subset of variables is selected).
get.spss <- function(channel, variables = NULL, value.labels = F) {
VarNames <- sqlQuery(channel = channel,
query = "SELECT VarName FROM [Variables]", as.is = T)$VarName
if (is.null(variables)) variables <- VarNames else {
if (any(!variables %in% VarNames)) stop("Wrong variable names")
}
if (value.labels) {
ValueLabelTableName <- sqlQuery(channel = channel,
query = "SELECT VarName FROM [Variables]
WHERE ValueLabelTableName is not null",
as.is = T)$VarName
ValueLabelTableName <- intersect(variables, ValueLabelTableName)
}
variables <- paste(variables, collapse = ", ")
data <- sqlQuery(channel = channel,
query = paste("SELECT", variables, "FROM [Cases]"),
as.is = T)
if (value.labels) {
for (var in ValueLabelTableName) {
VL <- sqlQuery(channel = channel,
query = paste0("SELECT * FROM [VLVAR", var,"]"),
as.is = T)
data[, var] <- factor(data[, var], levels = VL[, 1], labels = VL[, 2])
}
}
return(data)
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With