I would like to count all combinations in a data.frame. The data look like this <pre class="prettyprint"><code> 9 10 11 12 1 1 1 1 1 2 0 0 0 0 3 0 0 0 0 4 1 1 1 1 5 1 1 1 1 6 0 0 0 0 7 1 0 0 1 8 1 0 0 1 9 1 1 1 1 10 1 1 1 1 </code></pre> The output I want is simply <pre class="prettyprint"><code>comb n 1 1 1 1 5 0 0 0 0 3 1 0 0 1 2 </code></pre> Do you know any simple function to do that ? Thanks <pre class="prettyprint"><code>dt = structure(list(`9` = c(1, 0, 0, 1, 1, 0, 1, 1, 1, 1), `10` = c(1, 0, 0, 1, 1, 0, 0, 0, 1, 1), `11` = c(1, 0, 0, 1, 1, 0, 0, 0, 1, 1), `12` = c(1, 0, 0, 1, 1, 0, 1, 1, 1, 1)), .Names = c("9", "10", "11", "12"), class = "data.frame", row.names = c(NA, -10L )) </code></pre>

We can either use <code>data.table</code> or <code>dplyr</code>. These are very efficient. We convert the 'data.frame' to 'data.table' (<code>setDT(dt)</code>), grouped by all the columns of 'dt' (<code>names(dt)</code>), we get the nrow (<code>.N</code>) as the 'Count' <pre class="prettyprint"><code>library(data.table) setDT(dt)[,list(Count=.N) ,names(dt)] </code></pre> <hr> Or we can use a similar methodology using <code>dplyr</code>. <pre class="prettyprint"><code>library(dplyr) names(dt) <- make.names(names(dt)) dt %>% group_by_(.dots=names(dt)) %>% summarise(count= n()) </code></pre> <h3>Benchmarks</h3> In case somebody wants to look at some metrics (and also to backup my claim earlier (<code>efficient!</code>)), <pre class="prettyprint"><code>set.seed(24) df1 <- as.data.frame(matrix(sample(0:1, 1e6*6, replace=TRUE), ncol=6)) akrunDT <- function() { as.data.table(df1)[,list(Count=.N) ,names(df1)] } akrunDplyr <- function() { df1 %>% group_by_(.dots=names(df1)) %>% summarise(count= n()) } cathG <- function() { aggregate(cbind(n = 1:nrow(df1))~., df1, length) } docendoD <- function() { as.data.frame(table(comb = do.call(paste, df1))) } deena <- function() { table(apply(df1, 1, paste, collapse = ",")) } </code></pre> Here are the <code>microbenchmark</code> results <pre class="prettyprint"><code>library(microbenchmark) microbenchmark(akrunDT(), akrunDplyr(), cathG(), docendoD(), deena(), unit='relative', times=20L) # Unit: relative # expr min lq mean median uq max neval cld # akrunDT() 1.000000 1.000000 1.000000 1.00000 1.000000 1.0000000 20 a # akrunDplyr() 1.512354 1.523357 1.307724 1.45907 1.365928 0.7539773 20 a # cathG() 43.893946 43.592062 37.008677 42.10787 38.556726 17.9834245 20 c # docendoD() 18.778534 19.843255 16.560827 18.85707 17.296812 8.2688541 20 b # deena() 90.391417 89.449547 74.607662 85.16295 77.316143 34.6962954 20 d </code></pre>

The dplyr solution above could have been done easier with group_by_all()... <pre class="prettyprint"><code>dt %>% group_by_all %>% count </code></pre> ...which as I understand has been superseded by the across() method. Adding in a bit of sorting, and you get: <pre class="prettyprint"><code>dt %>% group_by(across()) %>% count %>% arrange(desc(n)) > dt %>% group_by(across()) %>% count %>% arrange(desc(n)) # A tibble: 3 x 5 # Groups: 9, 10, 11, 12 [3] `9` `10` `11` `12` n <dbl> <dbl> <dbl> <dbl> <int> 1 1 1 1 1 5 2 0 0 0 0 3 3 1 0 0 1 2 </code></pre> Which you could cast to a matrix if you wished.

R - count all combinations

Tags:

r

count

combinations

I would like to count all combinations in a data.frame.

The data look like this

   9 10 11 12
1  1  1  1  1
2  0  0  0  0
3  0  0  0  0
4  1  1  1  1
5  1  1  1  1
6  0  0  0  0
7  1  0  0  1
8  1  0  0  1
9  1  1  1  1
10 1  1  1  1

The output I want is simply

Do you know any simple function to do that ?

Thanks

dt = structure(list(`9` = c(1, 0, 0, 1, 1, 0, 1, 1, 1, 1), `10` = c(1, 
0, 0, 1, 1, 0, 0, 0, 1, 1), `11` = c(1, 0, 0, 1, 1, 0, 0, 0, 
1, 1), `12` = c(1, 0, 0, 1, 1, 0, 1, 1, 1, 1)), .Names = c("9", 
"10", "11", "12"), class = "data.frame", row.names = c(NA, -10L
))

274

asked Dec 16 '15 12:12

giac

2 Answers

We can either use data.table or dplyr. These are very efficient. We convert the 'data.frame' to 'data.table' (setDT(dt)), grouped by all the columns of 'dt' (names(dt)), we get the nrow (.N) as the 'Count'

library(data.table)
setDT(dt)[,list(Count=.N) ,names(dt)]

Or we can use a similar methodology using dplyr.

library(dplyr)
names(dt) <- make.names(names(dt))
dt %>%
   group_by_(.dots=names(dt)) %>%
   summarise(count= n())

Benchmarks

In case somebody wants to look at some metrics (and also to backup my claim earlier (efficient!)),

set.seed(24)
df1 <- as.data.frame(matrix(sample(0:1, 1e6*6, replace=TRUE), ncol=6))

akrunDT <-  function() {
  as.data.table(df1)[,list(Count=.N) ,names(df1)]
 }

akrunDplyr <- function() {
  df1 %>%
    group_by_(.dots=names(df1)) %>%
    summarise(count= n())
}

cathG <- function() {
 aggregate(cbind(n = 1:nrow(df1))~., df1, length)
  }

docendoD <- function() {
  as.data.frame(table(comb = do.call(paste, df1)))
}

deena <- function() {
   table(apply(df1, 1, paste, collapse = ","))
}

Here are the microbenchmark results

library(microbenchmark)
microbenchmark(akrunDT(), akrunDplyr(), cathG(), docendoD(),  deena(),
  unit='relative', times=20L)
#   Unit: relative
#        expr       min        lq      mean   median        uq        max neval  cld
#     akrunDT()  1.000000  1.000000  1.000000  1.00000  1.000000  1.0000000    20     a   
#  akrunDplyr()  1.512354  1.523357  1.307724  1.45907  1.365928  0.7539773    20     a   
#       cathG() 43.893946 43.592062 37.008677 42.10787 38.556726 17.9834245    20    c 
#    docendoD() 18.778534 19.843255 16.560827 18.85707 17.296812  8.2688541    20    b  
#       deena() 90.391417 89.449547 74.607662 85.16295 77.316143 34.6962954    20    d

143

answered Oct 30 '22 01:10

akrun

The dplyr solution above could have been done easier with group_by_all()...

dt %>% group_by_all %>% count

...which as I understand has been superseded by the across() method. Adding in a bit of sorting, and you get:

dt %>% group_by(across()) %>% count %>% arrange(desc(n))

> dt %>% group_by(across()) %>% count %>% arrange(desc(n))
# A tibble: 3 x 5
# Groups:   9, 10, 11, 12 [3]
    `9`  `10`  `11`  `12`     n
  <dbl> <dbl> <dbl> <dbl> <int>
1     1     1     1     1     5
2     0     0     0     0     3
3     1     0     0     1     2

Which you could cast to a matrix if you wished.

answered Oct 30 '22 00:10

Mike Dolan Fliss

Related questions
                            
                                tidyr separate only first n instances [duplicate]
                            
                                ggplot2: Changing the layout of the legend
                            
                                How to create a pivot table in R with multiple (3+) variables
                            
                                Enriching a ggplot2 plot with multiple geom_segment in a loop?
                            
                                Error bars for barplot only in one direction
                            
                                Replace NA values by row means
                            
                                Select only rows if its value in a particular column is 'NA' in R
                            
                                How to sum over diagonals of data frame
                            
                                how to cumulatively add values in one vector in R
                            
                                Round vector of numerics to integer while preserving their sum
                            
                                Classification - Usage of factor levels
                            
                                R count number of commas and string
                            
                                regex to pickout some text between parenthesis [duplicate]
                            
                                ggplot multiple grouping bar
                            
                                How to get week starting date from a date in R [duplicate]
                            
                                R error "could not find function 'multiplot' " using Cookbook example
                            
                                Find which interval row in a data frame that each element of a vector belongs in
                            
                                Splitting String based on letters case
                            
                                What is the difference between these two comparisons? [duplicate]
                            
                                Implementation of skyline query or efficient frontier

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With