Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.frame Group By column [duplicate]

Tags:

r

aggregate

I have a data frame DF.

Say DF is:

  A B
1 1 2
2 1 3
3 2 3
4 3 5
5 3 6 

Now I want to combine together the rows by the column A and to have the sum of the column B.

For example:

  A B
1 1 5
2 2 3
3 3 11

I am doing this currently using an SQL query with the sqldf function. But for some reason it is very slow. Is there any more convenient way to do that? I could do it manually too using a for loop but it is again slow. My SQL query is " Select A,Count(B) from DF group by A".

In general whenever I don't use vectorized operations and I use for loops the performance is extremely slow even for single procedures.

like image 749
nikosdi Avatar asked Sep 14 '13 08:09

nikosdi


People also ask

Can DataFrame have duplicate columns?

You can use DataFrame. duplicated () without any arguments to drop columns with the same values on all columns.

How do you find duplicate rows in a DataFrame based on all or a list of columns?

Find Duplicate Rows based on all columns To find & select the duplicate all rows based on all columns call the Daraframe. duplicate() without any subset argument. It will return a Boolean series with True at the place of each duplicated rows except their first occurrence (default value of keep argument is 'first').

Can we group by two columns in DataFrame?

Grouping by Multiple ColumnsYou can do this by passing a list of column names to groupby instead of a single string value.


2 Answers

This is a common question. In base, the option you're looking for is aggregate. Assuming your data.frame is called "mydf", you can use the following.

> aggregate(B ~ A, mydf, sum)
  A  B
1 1  5
2 2  3
3 3 11

I would also recommend looking into the "data.table" package.

> library(data.table)
> DT <- data.table(mydf)
> DT[, sum(B), by = A]
   A V1
1: 1  5
2: 2  3
3: 3 11
like image 100
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 21 '22 22:10

A5C1D2H2I1M1N2O1R2T1


Using dplyr:

require(dplyr)    
df <- data.frame(A = c(1, 1, 2, 3, 3), B = c(2, 3, 3, 5, 6))
df %>% group_by(A) %>% summarise(B = sum(B))

## Source: local data frame [3 x 2]
## 
##   A  B
## 1 1  5
## 2 2  3
## 3 3 11

With sqldf:

library(sqldf)
sqldf('SELECT A, SUM(B) AS B FROM df GROUP BY A')
like image 27
mpalanco Avatar answered Oct 22 '22 00:10

mpalanco