Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the R equivalent of SQL "SELECT * FROM table GROUP BY c1, c2"?

I want to reduce my data frame (EDIT: in a cpu-efficient way) to rows with unique values of the pair c3, c4, while keeping all columns. In other words I want to transform my data frame

> df <- data.frame(c1=seq(7), c2=seq(4, 10), c3=c("A", "B", "B", "C", "B", "A", "A"), c4=c(1, 2, 3, 3, 2, 2, 1))
  c1 c2 c3 c4
1  1  4  A  1
2  2  5  B  2
3  3  6  B  3
4  4  7  C  3
5  5  8  B  2
6  6  9  A  2
7  7 10  A  1

to the data frame

  c1 c2 c3 c4
1  1  4  A  1
2  2  5  B  2
3  3  6  B  3
4  4  7  C  3
6  6  9  A  2

where the values of c1 and c2 could be any value which occurs for a unique pair of c3, c4. Also the order of the resulting data frame is not of importance.

EDIT: My data frame has around 250 000 rows and 12 columns and should be grouped by 2 columns – therefore I need a CPU-efficient solution.

Working but unsatisfactory alternative

I solved this problem with

> library(sqldf)
> sqldf("Select * from df Group By c3, c4")

but in order to speed up and parallelize my program I have to eliminate the calls to sqldf.

EDIT: Currently the sqldf solution clocks at 3.5 seconds. I consider this a decent time. The problem is that I cannot start various queries in parallel therefore I am searching for an alternative way.

Not working attempts

duplicate()

> df[duplicated(df, by=c("c3", "c4")),]
[1] c1 c2 c3 c4
<0 rows> (or 0-length row.names)

selects duplicate rows and does not select rows where only columns c3 and c4 are duplicates.

aggregate()

> aggregate(df, by=list(df$c3, df$c4))
Error in match.fun(FUN) : argument "FUN" is missing, with no default

aggregate requires a function applied to all lines with the same values of c3 and c4

data.table's by

> library(data.table)
> dt <- data.table(df)
> dt[,list(c1, c2) ,by=list(c3, c4)]
    c3 c4 c1 c2
1:  A  1  1  4
2:  A  1  7 10
3:  B  2  2  5
4:  B  2  5  8
5:  B  3  3  6
6:  C  3  4  7
7:  A  2  6  9

does not kick out the rows which have non-unique values of c3 and c4, whereas

> dt[ ,length(c1), by=list(c3, c4)]
   c3 c4 V1
1:  A  1  2
2:  B  2  2
3:  B  3  1
4:  C  3  1
5:  A  2  1

does discard the values of c1 and c2 and reduces them to one dimension as specified with the passed function length.

like image 866
Willi Müller Avatar asked Nov 28 '14 17:11

Willi Müller


1 Answers

Here is a data.table solution.

library(data.table)
setkey(setDT(df),c3,c4)   # convert df to a data.table and set the keys.
df[,.SD[1],by=list(c3,c4)]
#    c3 c4 c1 c2
# 1:  A  1  1  4
# 2:  A  2  6  9
# 3:  B  2  2  5
# 4:  B  3  3  6
# 5:  C  3  4  7

The SQL you propose seems to extract the first row having a given combination of (c3,c4) - I assume that's what you want.


EDIT: Response to OP's comments.

The result you cite seems really odd. The benchmarks below, on a dataset with 12 columns and 2.5e5 rows, show that the data.table solution runs in about 25 milliseconds without setting keys, and in about 7 milliseconds with keys set.

set.seed(1)  # for reproducible example
df <- data.frame(c3=sample(LETTERS[1:10],2.5e5,replace=TRUE),
                 c4=sample(1:10,2.5e5,replace=TRUE),
                 matrix(sample(1:10,2.5e6,replace=TRUE),nc=10))
library(data.table)
DT.1 <- as.data.table(df)
DT.2 <- as.data.table(df)
setkey(DT.2,c3,c4)
f.nokeys <- function() DT.1[,.SD[1],by=list(c3,c4)]
f.keys   <- function() DT.2[,.SD[1],by=list(c3,c4)]
library(microbenchmark)
microbenchmark(f.nokeys(),f.keys(),times=10)
# Unit: milliseconds
#        expr      min        lq    median        uq       max neval
#  f.nokeys() 23.73651 24.193129 24.609179 25.747767 26.181288    10
#    f.keys()  5.93546  6.207299  6.395041  6.733803  6.900224    10

In what ways is your dataset different from this one??

like image 122
jlhoward Avatar answered Nov 11 '22 15:11

jlhoward