I have the following data set:
dat = structure(list(C86_1981 = c("Outer London", "Buckinghamshire",
NA, "Ross and Cromarty", "Cornwall and Isles of Scilly", NA,
"Kirkcaldy", "Devon", "Kent", "Renfrew"), C96_1981 = c("Outer London",
"Buckinghamshire", NA, "Ross and Cromarty", "Not known/missing",
NA, "Kirkcaldy", NA, NA, NA), C00_1981 = c("Outer London", "Inner London",
"Lancashire", "Ross and Cromarty", NA, "Humberside", "Kirkcaldy",
NA, NA, NA), C04_1981 = c("Kent", NA, NA, "Ross and Cromarty",
NA, "Humberside", "Not known/missing", NA, NA, "Renfrew"), C08_1981 = c("Kent",
"Oxfordshire", NA, "Ross and Cromarty", "Cornwall and Isles of Scilly",
"Humberside", "Dunfermline", NA, NA, "Renfrew"), C12_1981 = c("Kent",
NA, NA, "Ross and Cromarty", "Cornwall and Isles of Scilly",
"Humberside", "Dunfermline", NA, NA, "Renfrew")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("C86_1981",
"C96_1981", "C00_1981", "C04_1981", "C08_1981", "C12_1981"))
I want to dplyr::count()
each column. Expected output:
# A tibble: 10 x 3
C86_1981 dat86_n dat96_n ...
<chr> <int> <int>
1 Buckinghamshire 1 1
2 Cornwall and Isles of Scilly 1 NA
3 Devon 1 NA
4 Kent 1 NA
5 Kirkcaldy 1 1
6 Outer London 1 1
7 Renfrew 1 NA
8 Ross and Cromarty 1 1
9 <NA> 2 5
10 Not known/missing NA 1
Currently I'm doing this manually then dplyr::full_join()
ing the result:
library("tidyverse")
dat86_n = dat %>%
count(C86_1981) %>%
rename(dat86_n = n)
dat96_n = dat %>%
count(C96_1981) %>%
rename(dat96_n = n)
# ...
dat_counts = dat86_n %>%
full_join(dat96_n, by = c("C86_1981" = "C96_1981"))
# ...
Which works, but is not exactly robust if any of my data changes later. I had hoped to do this programmatically.
I've tried a loop:
lapply(dat, count)
# Error in UseMethod("groups") :
# no applicable method for 'groups' applied to an object of class "character"
(purrr::map()
gives the same error). I think this error is because count()
expects a tbl
and a variable as separate arguments, so I tried that too:
lapply(dat, function(x) {
count(dat, x)
})
# Error in grouped_df_impl(data, unname(vars), drop) :
# Column `x` is unknown
Again, purrr::map()
gives the same error. I've also tried variants of summarise_all()
:
dat %>%
summarise_all(count)
# Error in summarise_impl(.data, dots) :
# Evaluation error: no applicable method for 'groups' applied to an object of class "character".
I feel like I'm missing something obvious and the solution should be straightforward. dplyr
solutions particularly welcome as this is what I tend to use most.
count() lets you quickly count the unique values of one or more variables: df %>% count(a, b) is roughly equivalent to df %>% group_by(a, b) %>% summarise(n = n()) .
The ncol() function in R programming That is, ncol() function returns the total number of columns present in the object.
count() function in dplyr can be used to count observations by multiple groups.
Using also the tidyr package, the following code will do the trick:
dat %>% tidyr::gather(name, city) %>% dplyr::group_by(name, city) %>% dplyr::count() %>% dplyr::ungroup %>% tidyr::spread(name, n)
Result:
# A tibble: 15 x 7
city C00_1981 C04_1981 C08_1981 C12_1981 C86_1981 C96_1981
* <chr> <int> <int> <int> <int> <int> <int>
1 Buckinghamshire NA NA NA NA 1 1
2 Cornwall and Isles of Scilly NA NA 1 1 1 NA
3 Devon NA NA NA NA 1 NA
4 Dunfermline NA NA 1 1 NA NA
5 Humberside 1 1 1 1 NA NA
6 Inner London 1 NA NA NA NA NA
7 Kent NA 1 1 1 1 NA
8 Kirkcaldy 1 NA NA NA 1 1
9 Lancashire 1 NA NA NA NA NA
10 Not known/missing NA 1 NA NA NA 1
11 Outer London 1 NA NA NA 1 1
12 Oxfordshire NA NA 1 NA NA NA
13 Renfrew NA 1 1 1 1 NA
14 Ross and Cromarty 1 1 1 1 1 1
15 <NA> 4 5 3 4 2 5
@You-leee just beat me to it ;)
Using the tidyverse;
library(tidyverse)
df <-
dat %>%
gather (year, county) %>%
group_by(year, county) %>%
summarise(no = n()) %>%
spread (year, no)
# A tibble: 15 x 7
county C00_1981 C04_1981 C08_1981 C12_1981 C86_1981 C96_1981
* <chr> <int> <int> <int> <int> <int> <int>
1 Buckinghamshire NA NA NA NA 1 1
2 Cornwall and Isles of Scilly NA NA 1 1 1 NA
3 Devon NA NA NA NA 1 NA
4 Dunfermline NA NA 1 1 NA NA
5 Humberside 1 1 1 1 NA NA
6 Inner London 1 NA NA NA NA NA
7 Kent NA 1 1 1 1 NA
8 Kirkcaldy 1 NA NA NA 1 1
9 Lancashire 1 NA NA NA NA NA
10 Not known/missing NA 1 NA NA NA 1
11 Outer London 1 NA NA NA 1 1
12 Oxfordshire NA NA 1 NA NA NA
13 Renfrew NA 1 1 1 1 NA
14 Ross and Cromarty 1 1 1 1 1 1
15 <NA> 4 5 3 4 2 5
The previous answera with gather +count+spread
work well, yet not for very large datasets (either large groups or many variables). Here is an alternative, using map-count + join
, on a very large data, it seems to be 2 times faster:
library(tidyverse)
N <- 1000000
df <- tibble(x1=sample(letters, N, replace = TRUE),
x2=sample(letters, N, replace = TRUE),
x3=sample(letters, N, replace = TRUE),
x4=sample(letters, N, replace = TRUE),
x5=sample(letters, N, replace = TRUE))
res1 <- map(c("x1", "x2", "x3", "x4", "x5"), function(x) select_at(df, x) %>% count(!!rlang::sym(x)) %>%
rename(value=!!rlang::sym(x),
!!rlang::sym(x):=n)) %>%
reduce(full_join, by = "value")
res2 <- df %>%
tidyr::gather(variable, value) %>%
dplyr::group_by(variable, value) %>%
dplyr::count() %>% dplyr::ungroup()%>%
tidyr::spread(variable, n)
all.equal(res1, res2)
#> [1] TRUE
library(microbenchmark)
microbenchmark(s1=map(c("x1", "x2", "x3", "x4", "x5"), function(x) select_at(df, x) %>% count(!!rlang::sym(x)) %>%
rename(value=!!rlang::sym(x),
!!rlang::sym(x):=n)) %>%
reduce(full_join, by = "value"),
s2= df %>%
tidyr::gather(variable, value) %>%
dplyr::group_by(variable, value) %>%
dplyr::count() %>% dplyr::ungroup()%>%
tidyr::spread(variable, n),
times = 50, check = "equal")
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> s1 214.9027 220.2292 241.8811 229.0913 242.2507 368.5147 50
#> s2 412.8934 447.5347 515.2612 528.0221 561.7649 692.5999 50
Created on 2020-05-19 by the reprex package (v0.3.0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With