I have a data frame DF.
Say DF is:
A B
1 1 2
2 1 3
3 2 3
4 3 5
5 3 6
Now I want to combine together the rows by the column A and to have the sum of the column B.
For example:
A B
1 1 5
2 2 3
3 3 11
I am doing this currently using an SQL query with the sqldf function. But for some reason it is very slow. Is there any more convenient way to do that? I could do it manually too using a for loop but it is again slow. My SQL query is " Select A,Count(B) from DF group by A".
In general whenever I don't use vectorized operations and I use for loops the performance is extremely slow even for single procedures.
You can use DataFrame. duplicated () without any arguments to drop columns with the same values on all columns.
Find Duplicate Rows based on all columns To find & select the duplicate all rows based on all columns call the Daraframe. duplicate() without any subset argument. It will return a Boolean series with True at the place of each duplicated rows except their first occurrence (default value of keep argument is 'first').
Grouping by Multiple ColumnsYou can do this by passing a list of column names to groupby instead of a single string value.
This is a common question. In base, the option you're looking for is aggregate
. Assuming your data.frame
is called "mydf", you can use the following.
> aggregate(B ~ A, mydf, sum)
A B
1 1 5
2 2 3
3 3 11
I would also recommend looking into the "data.table" package.
> library(data.table)
> DT <- data.table(mydf)
> DT[, sum(B), by = A]
A V1
1: 1 5
2: 2 3
3: 3 11
Using dplyr
:
require(dplyr)
df <- data.frame(A = c(1, 1, 2, 3, 3), B = c(2, 3, 3, 5, 6))
df %>% group_by(A) %>% summarise(B = sum(B))
## Source: local data frame [3 x 2]
##
## A B
## 1 1 5
## 2 2 3
## 3 3 11
With sqldf
:
library(sqldf)
sqldf('SELECT A, SUM(B) AS B FROM df GROUP BY A')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With