Suppose I have a data frame such like:
df<-data.frame(A.1=1:5,B.1=2:6,C.1=3:7,
D.2=4:8,E.2=5:9,F.2=6:10)
df
A.1 B.1 C.1 D.2 E.2 F.2
1 1 2 3 4 5 6
2 2 3 4 5 6 7
3 3 4 5 6 7 8
4 4 5 6 7 8 9
5 5 6 7 8 9 10
What I wanted is to rowSums()
by a group vector which is the column names of df
without Letters (e.g. c(1,1,1,2,2,2)
) and the output would be:
1 2
[1,] 6 15
[2,] 9 18
[3,] 12 21
[4,] 15 24
[5,] 18 27
My real data set has more than 110K cols
from 18 groups and would find an elegant and easy way to realize it.
rowsum
can do this:
t(rowsum(t(df), c(1,1,1,2,2,2)))
1 2
[1,] 6 15
[2,] 9 18
[3,] 12 21
[4,] 15 24
[5,] 18 27
(For whatever reason, there is no colsum
.)
Because of the way data.frame
s are structured internally, row-wise operations are generally much slower than column-wise operations. Given your comment about how large this data.frame
actually is, I would probably use data.table
to convert it to long, isolate the group as its own variable, and perform a group-wise sum.
df <- data.frame(
A.1 = 1:5,
B.1 = 6:10,
C.2 = 11:15,
D.2 = 16:20
)
First, convert the data.frame
to data.table
using setDT
.
library(data.table)
setDT(df)
Then, add a row_number
column (:=
creates a new column; .N
is a special variable containing the number of rows in the table).
df[, row_number := 1:.N]
Convert this to a "long" data.table
, using row_number
as the unique ID column.
df_long <- melt(df, id.vars = "row_number")
df_long
#> row_number variable value
#> 1: 1 A.1 1
#> 2: 2 A.1 2
#> 3: 3 A.1 3
#> 4: 4 A.1 4
#> 5: 5 A.1 5
#> 6: 1 B.1 6
#> 7: 2 B.1 7
#> 8: 3 B.1 8
#> 9: 4 B.1 9
#> 10: 5 B.1 10
#> 11: 1 C.2 11
#> 12: 2 C.2 12
#> 13: 3 C.2 13
#> 14: 4 C.2 14
#> 15: 5 C.2 15
#> 16: 1 D.2 16
#> 17: 2 D.2 17
#> 18: 3 D.2 18
#> 19: 4 D.2 19
#> 20: 5 D.2 20
Create a new group
column that is everything after the "." in the new "variable" column (^.*?\\.
is everything from the beginning of the string to the first "."; gsub(pattern, "", variable)
removes pattern
from variable
).
df_long[, group := as.integer(gsub("^.*?\\.", "", variable))]
df_long
#> row_number variable value group
#> 1: 1 A.1 1 1
#> 2: 2 A.1 2 1
#> 3: 3 A.1 3 1
#> 4: 4 A.1 4 1
#> 5: 5 A.1 5 1
#> 6: 1 B.1 6 1
#> 7: 2 B.1 7 1
#> 8: 3 B.1 8 1
#> 9: 4 B.1 9 1
#> 10: 5 B.1 10 1
#> 11: 1 C.2 11 2
#> 12: 2 C.2 12 2
#> 13: 3 C.2 13 2
#> 14: 4 C.2 14 2
#> 15: 5 C.2 15 2
#> 16: 1 D.2 16 2
#> 17: 2 D.2 17 2
#> 18: 3 D.2 18 2
#> 19: 4 D.2 19 2
#> 20: 5 D.2 20 2
Finally, do sum(value)
grouping by row_number
and group
.
These sorts of operations are blisteringly fast in data.table
, significantly faster and more efficient than base R.
df_long[, sum(value), by = list(row_number, group)]
#> row_number group V1
#> 1: 1 1 7
#> 2: 2 1 9
#> 3: 3 1 11
#> 4: 4 1 13
#> 5: 5 1 15
#> 6: 1 2 27
#> 7: 2 2 29
#> 8: 3 2 31
#> 9: 4 2 33
#> 10: 5 2 35
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With