I have the following data set: <pre class="prettyprint"><code>dat = structure(list(C86_1981 = c("Outer London", "Buckinghamshire", NA, "Ross and Cromarty", "Cornwall and Isles of Scilly", NA, "Kirkcaldy", "Devon", "Kent", "Renfrew"), C96_1981 = c("Outer London", "Buckinghamshire", NA, "Ross and Cromarty", "Not known/missing", NA, "Kirkcaldy", NA, NA, NA), C00_1981 = c("Outer London", "Inner London", "Lancashire", "Ross and Cromarty", NA, "Humberside", "Kirkcaldy", NA, NA, NA), C04_1981 = c("Kent", NA, NA, "Ross and Cromarty", NA, "Humberside", "Not known/missing", NA, NA, "Renfrew"), C08_1981 = c("Kent", "Oxfordshire", NA, "Ross and Cromarty", "Cornwall and Isles of Scilly", "Humberside", "Dunfermline", NA, NA, "Renfrew"), C12_1981 = c("Kent", NA, NA, "Ross and Cromarty", "Cornwall and Isles of Scilly", "Humberside", "Dunfermline", NA, NA, "Renfrew")), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("C86_1981", "C96_1981", "C00_1981", "C04_1981", "C08_1981", "C12_1981")) </code></pre> I want to <code>dplyr::count()</code> each column. Expected output: <pre class="prettyprint"><code># A tibble: 10 x 3 C86_1981 dat86_n dat96_n ... <chr> <int> <int> 1 Buckinghamshire 1 1 2 Cornwall and Isles of Scilly 1 NA 3 Devon 1 NA 4 Kent 1 NA 5 Kirkcaldy 1 1 6 Outer London 1 1 7 Renfrew 1 NA 8 Ross and Cromarty 1 1 9 <NA> 2 5 10 Not known/missing NA 1 </code></pre> Currently I'm doing this manually then <code>dplyr::full_join()</code>ing the result: <pre class="prettyprint"><code>library("tidyverse") dat86_n = dat %>% count(C86_1981) %>% rename(dat86_n = n) dat96_n = dat %>% count(C96_1981) %>% rename(dat96_n = n) # ... dat_counts = dat86_n %>% full_join(dat96_n, by = c("C86_1981" = "C96_1981")) # ... </code></pre> Which works, but is not exactly robust if any of my data changes later. I had hoped to do this programmatically. I've tried a loop: <pre class="prettyprint"><code>lapply(dat, count) # Error in UseMethod("groups") : # no applicable method for 'groups' applied to an object of class "character" </code></pre> (<code>purrr::map()</code> gives the same error). I think this error is because <code>count()</code> expects a <code>tbl</code> and a variable as separate arguments, so I tried that too: <pre class="prettyprint"><code>lapply(dat, function(x) { count(dat, x) }) # Error in grouped_df_impl(data, unname(vars), drop) : # Column `x` is unknown </code></pre> Again, <code>purrr::map()</code> gives the same error. I've also tried variants of <code>summarise_all()</code>: <pre class="prettyprint"><code>dat %>% summarise_all(count) # Error in summarise_impl(.data, dots) : # Evaluation error: no applicable method for 'groups' applied to an object of class "character". </code></pre> I feel like I'm missing something obvious and the solution should be straightforward. <code>dplyr</code> solutions particularly welcome as this is what I tend to use most.

Using also the tidyr package, the following code will do the trick: <pre class="prettyprint"><code>dat %>% tidyr::gather(name, city) %>% dplyr::group_by(name, city) %>% dplyr::count() %>% dplyr::ungroup %>% tidyr::spread(name, n) </code></pre> Result: <pre class="prettyprint"><code># A tibble: 15 x 7 city C00_1981 C04_1981 C08_1981 C12_1981 C86_1981 C96_1981 * <chr> <int> <int> <int> <int> <int> <int> 1 Buckinghamshire NA NA NA NA 1 1 2 Cornwall and Isles of Scilly NA NA 1 1 1 NA 3 Devon NA NA NA NA 1 NA 4 Dunfermline NA NA 1 1 NA NA 5 Humberside 1 1 1 1 NA NA 6 Inner London 1 NA NA NA NA NA 7 Kent NA 1 1 1 1 NA 8 Kirkcaldy 1 NA NA NA 1 1 9 Lancashire 1 NA NA NA NA NA 10 Not known/missing NA 1 NA NA NA 1 11 Outer London 1 NA NA NA 1 1 12 Oxfordshire NA NA 1 NA NA NA 13 Renfrew NA 1 1 1 1 NA 14 Ross and Cromarty 1 1 1 1 1 1 15 <NA> 4 5 3 4 2 5 </code></pre>

The previous answera with <code>gather +count+spread</code> work well, yet not for very large datasets (either large groups or many variables). Here is an alternative, using <code>map-count + join</code>, on a very large data, it seems to be 2 times faster: <pre class="prettyprint lang-r prettyprint-override"><code>library(tidyverse) N <- 1000000 df <- tibble(x1=sample(letters, N, replace = TRUE), x2=sample(letters, N, replace = TRUE), x3=sample(letters, N, replace = TRUE), x4=sample(letters, N, replace = TRUE), x5=sample(letters, N, replace = TRUE)) res1 <- map(c("x1", "x2", "x3", "x4", "x5"), function(x) select_at(df, x) %>% count(!!rlang::sym(x)) %>% rename(value=!!rlang::sym(x), !!rlang::sym(x):=n)) %>% reduce(full_join, by = "value") res2 <- df %>% tidyr::gather(variable, value) %>% dplyr::group_by(variable, value) %>% dplyr::count() %>% dplyr::ungroup()%>% tidyr::spread(variable, n) all.equal(res1, res2) #> [1] TRUE library(microbenchmark) microbenchmark(s1=map(c("x1", "x2", "x3", "x4", "x5"), function(x) select_at(df, x) %>% count(!!rlang::sym(x)) %>% rename(value=!!rlang::sym(x), !!rlang::sym(x):=n)) %>% reduce(full_join, by = "value"), s2= df %>% tidyr::gather(variable, value) %>% dplyr::group_by(variable, value) %>% dplyr::count() %>% dplyr::ungroup()%>% tidyr::spread(variable, n), times = 50, check = "equal") #> Unit: milliseconds #> expr min lq mean median uq max neval #> s1 214.9027 220.2292 241.8811 229.0913 242.2507 368.5147 50 #> s2 412.8934 447.5347 515.2612 528.0221 561.7649 692.5999 50 </code></pre> Created on 2020-05-19 by the reprex package (v0.3.0)

dplyr::count() multiple columns

Tags:

r

dplyr

I have the following data set:

dat = structure(list(C86_1981 = c("Outer London", "Buckinghamshire", 
NA, "Ross and Cromarty", "Cornwall and Isles of Scilly", NA, 
"Kirkcaldy", "Devon", "Kent", "Renfrew"), C96_1981 = c("Outer London", 
"Buckinghamshire", NA, "Ross and Cromarty", "Not known/missing", 
NA, "Kirkcaldy", NA, NA, NA), C00_1981 = c("Outer London", "Inner London", 
"Lancashire", "Ross and Cromarty", NA, "Humberside", "Kirkcaldy", 
NA, NA, NA), C04_1981 = c("Kent", NA, NA, "Ross and Cromarty", 
NA, "Humberside", "Not known/missing", NA, NA, "Renfrew"), C08_1981 = c("Kent", 
"Oxfordshire", NA, "Ross and Cromarty", "Cornwall and Isles of Scilly", 
"Humberside", "Dunfermline", NA, NA, "Renfrew"), C12_1981 = c("Kent", 
NA, NA, "Ross and Cromarty", "Cornwall and Isles of Scilly", 
"Humberside", "Dunfermline", NA, NA, "Renfrew")), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("C86_1981", 
"C96_1981", "C00_1981", "C04_1981", "C08_1981", "C12_1981"))

I want to dplyr::count() each column. Expected output:

# A tibble: 10 x 3
                       C86_1981 dat86_n dat96_n ...
                          <chr>   <int>   <int>
 1              Buckinghamshire       1       1
 2 Cornwall and Isles of Scilly       1      NA
 3                        Devon       1      NA
 4                         Kent       1      NA
 5                    Kirkcaldy       1       1
 6                 Outer London       1       1
 7                      Renfrew       1      NA
 8            Ross and Cromarty       1       1
 9                         <NA>       2       5
10            Not known/missing      NA       1

Currently I'm doing this manually then dplyr::full_join()ing the result:

library("tidyverse")

dat86_n = dat %>%
  count(C86_1981) %>%
  rename(dat86_n = n)
dat96_n = dat %>%
  count(C96_1981) %>%
  rename(dat96_n = n)
# ...

dat_counts = dat86_n %>%
  full_join(dat96_n, by = c("C86_1981" = "C96_1981"))
  # ...

Which works, but is not exactly robust if any of my data changes later. I had hoped to do this programmatically.

I've tried a loop:

lapply(dat, count)
# Error in UseMethod("groups") : 
# no applicable method for 'groups' applied to an object of class "character"

(purrr::map() gives the same error). I think this error is because count() expects a tbl and a variable as separate arguments, so I tried that too:

lapply(dat, function(x) {
  count(dat, x)
})
# Error in grouped_df_impl(data, unname(vars), drop) : 
# Column `x` is unknown

Again, purrr::map() gives the same error. I've also tried variants of summarise_all():

dat %>% 
  summarise_all(count)
  # Error in summarise_impl(.data, dots) : 
  # Evaluation error: no applicable method for 'groups' applied to an object of class "character".

I feel like I'm missing something obvious and the solution should be straightforward. dplyr solutions particularly welcome as this is what I tend to use most.

925

asked Sep 21 '17 08:09

Phil

3 Answers

Using also the tidyr package, the following code will do the trick:

dat %>% tidyr::gather(name, city) %>% dplyr::group_by(name, city) %>% dplyr::count() %>% dplyr::ungroup %>% tidyr::spread(name, n)

Result:

# A tibble: 15 x 7
                           city C00_1981 C04_1981 C08_1981 C12_1981 C86_1981 C96_1981
 *                        <chr>    <int>    <int>    <int>    <int>    <int>    <int>
 1              Buckinghamshire       NA       NA       NA       NA        1        1
 2 Cornwall and Isles of Scilly       NA       NA        1        1        1       NA
 3                        Devon       NA       NA       NA       NA        1       NA
 4                  Dunfermline       NA       NA        1        1       NA       NA
 5                   Humberside        1        1        1        1       NA       NA
 6                 Inner London        1       NA       NA       NA       NA       NA
 7                         Kent       NA        1        1        1        1       NA
 8                    Kirkcaldy        1       NA       NA       NA        1        1
 9                   Lancashire        1       NA       NA       NA       NA       NA
10            Not known/missing       NA        1       NA       NA       NA        1
11                 Outer London        1       NA       NA       NA        1        1
12                  Oxfordshire       NA       NA        1       NA       NA       NA
13                      Renfrew       NA        1        1        1        1       NA
14            Ross and Cromarty        1        1        1        1        1        1
15                         <NA>        4        5        3        4        2        5

answered Oct 21 '22 07:10

You-leee

@You-leee just beat me to it ;)

Using the tidyverse;

library(tidyverse)

df <- 
  dat %>% 
  gather (year, county) %>% 
  group_by(year, county) %>% 
  summarise(no = n()) %>% 
  spread (year, no)

# A tibble: 15 x 7
                         county C00_1981 C04_1981 C08_1981 C12_1981 C86_1981 C96_1981
 *                        <chr>    <int>    <int>    <int>    <int>    <int>    <int>
 1              Buckinghamshire       NA       NA       NA       NA        1        1
 2 Cornwall and Isles of Scilly       NA       NA        1        1        1       NA
 3                        Devon       NA       NA       NA       NA        1       NA
 4                  Dunfermline       NA       NA        1        1       NA       NA
 5                   Humberside        1        1        1        1       NA       NA  
 6                 Inner London        1       NA       NA       NA       NA       NA
 7                         Kent       NA        1        1        1        1       NA
 8                    Kirkcaldy        1       NA       NA       NA        1        1
 9                   Lancashire        1       NA       NA       NA       NA       NA
10            Not known/missing       NA        1       NA       NA       NA        1
11                 Outer London        1       NA       NA       NA        1        1
12                  Oxfordshire       NA       NA        1       NA       NA       NA
13                      Renfrew       NA        1        1        1        1       NA
14            Ross and Cromarty        1        1        1        1        1        1
15                         <NA>        4        5        3        4        2        5

answered Oct 21 '22 08:10

sorearm

The previous answera with gather +count+spread work well, yet not for very large datasets (either large groups or many variables). Here is an alternative, using map-count + join, on a very large data, it seems to be 2 times faster:

library(tidyverse)
N <-  1000000
df <- tibble(x1=sample(letters, N, replace = TRUE),
             x2=sample(letters, N, replace = TRUE),
             x3=sample(letters, N, replace = TRUE),
             x4=sample(letters, N, replace = TRUE),
             x5=sample(letters, N, replace = TRUE))


res1 <- map(c("x1", "x2", "x3", "x4", "x5"), function(x) select_at(df, x) %>%  count(!!rlang::sym(x)) %>% 
         rename(value=!!rlang::sym(x),
                !!rlang::sym(x):=n)) %>% 
  reduce(full_join, by = "value")

res2 <- df %>% 
  tidyr::gather(variable, value) %>% 
  dplyr::group_by(variable, value) %>%
  dplyr::count() %>% dplyr::ungroup()%>%
  tidyr::spread(variable, n)

all.equal(res1, res2)
#> [1] TRUE

library(microbenchmark)
microbenchmark(s1=map(c("x1", "x2", "x3", "x4", "x5"), function(x) select_at(df, x) %>%  count(!!rlang::sym(x)) %>% 
                     rename(value=!!rlang::sym(x),
                            !!rlang::sym(x):=n)) %>% 
                 reduce(full_join, by = "value"),
               s2= df %>% 
                 tidyr::gather(variable, value) %>% 
                 dplyr::group_by(variable, value) %>%
                 dplyr::count() %>% dplyr::ungroup()%>%
                 tidyr::spread(variable, n),
               times = 50, check = "equal")
#> Unit: milliseconds
#>  expr      min       lq     mean   median       uq      max neval
#>    s1 214.9027 220.2292 241.8811 229.0913 242.2507 368.5147    50
#>    s2 412.8934 447.5347 515.2612 528.0221 561.7649 692.5999    50

^{Created on 2020-05-19 by the reprex package (v0.3.0)}

answered Oct 21 '22 09:10

Matifou

Related questions
                            
                                Need to plot a curve with standard error in R
                            
                                How can one list pairs of perfectly collinear numeric vectors in a data.frame?
                            
                                purrr map a t.test onto a split df
                            
                                summarize groups into intervals using dplyr
                            
                                How to order a data.frame based on row.names in another data frame?
                            
                                R - Compute Cross Product of Vectors (Physics)
                            
                                Getting a matrix ordered
                            
                                diff on data.table column
                            
                                Unable to append to SQL Server table using sqlSave in R
                            
                                R: show ALL rows with duplicated elements in a column [duplicate]
                            
                                Tidyr how to spread into count of occurrence [duplicate]
                            
                                Check when R session have been started?
                            
                                Barplot with multiple columns in R
                            
                                list unique values for each column in a data frame
                            
                                Grouping of R dataframe by connected values
                            
                                Difference between mean and manual calculation in R?
                            
                                Extra statistics with summarize_at in dplyr
                            
                                use dplyr mutate() in programming
                            
                                Is it possible to add a third dummy variable using ifelse() in R?
                            
                                insert rows between dates by group

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With