What is the R equivalent of SQL "SELECT * FROM table GROUP BY c1, c2"?

Question

I want to reduce my data frame (EDIT: in a cpu-efficient way) to rows with unique values of the pair c3, c4, while keeping all columns. In other words I want to transform my data frame

> df <- data.frame(c1=seq(7), c2=seq(4, 10), c3=c("A", "B", "B", "C", "B", "A", "A"), c4=c(1, 2, 3, 3, 2, 2, 1))
  c1 c2 c3 c4
1  1  4  A  1
2  2  5  B  2
3  3  6  B  3
4  4  7  C  3
5  5  8  B  2
6  6  9  A  2
7  7 10  A  1

to the data frame

  c1 c2 c3 c4
1  1  4  A  1
2  2  5  B  2
3  3  6  B  3
4  4  7  C  3
6  6  9  A  2

where the values of c1 and c2 could be any value which occurs for a unique pair of c3, c4. Also the order of the resulting data frame is not of importance.

EDIT: My data frame has around 250 000 rows and 12 columns and should be grouped by 2 columns – therefore I need a CPU-efficient solution.

Working but unsatisfactory alternative

I solved this problem with

> library(sqldf)
> sqldf("Select * from df Group By c3, c4")

but in order to speed up and parallelize my program I have to eliminate the calls to sqldf.

EDIT: Currently the sqldf solution clocks at 3.5 seconds. I consider this a decent time. The problem is that I cannot start various queries in parallel therefore I am searching for an alternative way.

Not working attempts

duplicate()

> df[duplicated(df, by=c("c3", "c4")),]
[1] c1 c2 c3 c4
<0 rows> (or 0-length row.names)

selects duplicate rows and does not select rows where only columns c3 and c4 are duplicates.

aggregate()

> aggregate(df, by=list(df$c3, df$c4))
Error in match.fun(FUN) : argument "FUN" is missing, with no default

aggregate requires a function applied to all lines with the same values of c3 and c4

data.table's by

> library(data.table)
> dt <- data.table(df)
> dt[,list(c1, c2) ,by=list(c3, c4)]
    c3 c4 c1 c2
1:  A  1  1  4
2:  A  1  7 10
3:  B  2  2  5
4:  B  2  5  8
5:  B  3  3  6
6:  C  3  4  7
7:  A  2  6  9

does not kick out the rows which have non-unique values of c3 and c4, whereas

> dt[ ,length(c1), by=list(c3, c4)]
   c3 c4 V1
1:  A  1  2
2:  B  2  2
3:  B  3  1
4:  C  3  1
5:  A  2  1

does discard the values of c1 and c2 and reduces them to one dimension as specified with the passed function length.

jlhoward · Accepted Answer

Here is a data.table solution.

library(data.table)
setkey(setDT(df),c3,c4)   # convert df to a data.table and set the keys.
df[,.SD[1],by=list(c3,c4)]
#    c3 c4 c1 c2
# 1:  A  1  1  4
# 2:  A  2  6  9
# 3:  B  2  2  5
# 4:  B  3  3  6
# 5:  C  3  4  7

The SQL you propose seems to extract the first row having a given combination of (c3,c4) - I assume that's what you want.

EDIT: Response to OP's comments.

The result you cite seems really odd. The benchmarks below, on a dataset with 12 columns and 2.5e5 rows, show that the data.table solution runs in about 25 milliseconds without setting keys, and in about 7 milliseconds with keys set.

set.seed(1)  # for reproducible example
df <- data.frame(c3=sample(LETTERS[1:10],2.5e5,replace=TRUE),
                 c4=sample(1:10,2.5e5,replace=TRUE),
                 matrix(sample(1:10,2.5e6,replace=TRUE),nc=10))
library(data.table)
DT.1 <- as.data.table(df)
DT.2 <- as.data.table(df)
setkey(DT.2,c3,c4)
f.nokeys <- function() DT.1[,.SD[1],by=list(c3,c4)]
f.keys   <- function() DT.2[,.SD[1],by=list(c3,c4)]
library(microbenchmark)
microbenchmark(f.nokeys(),f.keys(),times=10)
# Unit: milliseconds
#        expr      min        lq    median        uq       max neval
#  f.nokeys() 23.73651 24.193129 24.609179 25.747767 26.181288    10
#    f.keys()  5.93546  6.207299  6.395041  6.733803  6.900224    10

In what ways is your dataset different from this one??

What is the R equivalent of SQL "SELECT * FROM table GROUP BY c1, c2"?

Tags:

sql

r

aggregate

data.table

Working but unsatisfactory alternative

Not working attempts

duplicate()

aggregate()

data.table's by

Willi Müller

1 Answers

jlhoward

Recent Activity

Donate For Us

What is the R equivalent of SQL "SELECT * FROM table GROUP BY c1, c2"?

Tags:

sql

r

aggregate

data.table

Working but unsatisfactory alternative

Not working attempts

duplicate()

aggregate()

data.table's by

Willi Müller

1 Answers

jlhoward

Related questions

Recent Activity

Donate For Us