Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Number of Unique Obs by Variable in a Data Table

I have read in a large data file into R using the following command

data <- as.data.set(spss.system.file(paste(path, file, sep = '/')))

The data set contains columns which should not belong, and contain only blanks. This issue has to do with R creating new variables based on the variable labels attached to the SPSS file (Source).

Unfortunately, I have not been able to determine the options necessary to resolve the problem. I have tried all of: foreign::read.spss, memisc:spss.system.file, and Hemisc::spss.get, with no luck.

Instead, I would like to read in the entire data set (with ghost columns) and remove unnecessary variables manually. Since the ghost columns contain only blank spaces, I would like to remove any variables from my data.table where the number of unique observations is equal to one.

My data are large, so they are stored in data.table format. I would like to determine an easy way to check the number of unique observations in each column, and drop columns which contain only one unique observation.

require(data.table)

### Create a data.table
dt <- data.table(a = 1:10,
                 b = letters[1:10],
                 c = rep(1, times = 10))

### Create a comparable data.frame
df <- data.frame(dt)

### Expected result
unique(dt$a)

### Expected result
length(unique(dt$a))

However, I wish to calculate the number of obs for a large data file, so referencing each column by name is not desired. I am not a fan of eval(parse()).

### I want to determine the number of unique obs in
  # each variable, for a large list of vars
lapply(names(df), function(x) {
    length(unique(df[, x]))
})

### Unexpected result
length(unique(dt[, 'a', with = F]))  # Returns 1

It seems to me the problem is that

dt[, 'a', with = F]

returns an object of class "data.table". It makes sense that the length of this object is 1, since it is a data.table containing 1 variable. We know that data.frames are really just lists of variables, and so in this case the length of the list is just 1.

Here's pseudo code for how I would remedy the solution, using the data.frame way:

for (x in names(data)) {
  unique.obs <- length(unique(data[, x]))
  if (unique.obs == 1) {
    data[, x] <- NULL
  }
}

Any insight as to how I may more efficiently ask for the number of unique observations by column in a data.table would be much appreciated. Alternatively, if you can recommend how to drop observations if there is only one unique observation within a data.table would be even better.

like image 806
Andreas Avatar asked Apr 16 '13 02:04

Andreas


2 Answers

Update: uniqueN

As of version 1.9.6, there is a built in (optimized) version of this solution, the uniqueN function. Now this is as simple as:

dt[ , lapply(.SD, uniqueN)]

If you want to find the number of unique values in each column, something like

 dt[, lapply(.SD, function(x) length(unique(x)))]
##     a  b c
## 1: 10 10 1

To get your function to work you need to use with=FALSE within [.data.table, or simply use [[ instead (read fortune(312) as well...)

lapply(names(df) function(x) length(unique(dt[, x, with = FALSE])))

or

 lapply(names(df) function(x) length(unique(dt[[x]])))

will work

In one step

dt[,names(dt) := lapply(.SD, function(x) if(length(unique(x)) ==1) {return(NULL)} else{return(x)})]


 # or to avoid calling `.SD` 

dt[, Filter(names(dt), f = function(x) length(unique(dt[[x]]))==1) := NULL]
like image 128
mnel Avatar answered Sep 19 '22 01:09

mnel


The approaches in the other answers are good. Another way to add to the mix, just for fun :

for (i in names(DT)) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]

or if there may be duplicate column names :

for (i in ncol(DT):1) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]

NB: (i) on the LHS of := is a trick to use the value of i rather than a column named "i".

like image 32
Matt Dowle Avatar answered Sep 22 '22 01:09

Matt Dowle