I have a very large (too big to open in Excel) biological dataset that looks something like this
year <- c(1990, 1980, 1985, 1980, 1990, 1990, 1980, 1985, 1985,1990,
1980, 1985, 1980, 1990, 1990, 1980, 1985, 1985,
1990, 1980, 1985, 1980, 1990, 1990, 1980, 1985, 1985)
species <- c('A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'A','A', 'A',
'B', 'B', 'B', 'C', 'C', 'C', 'A', 'A', 'A', 'B', 'B', 'B',
'C', 'C', 'C', 'A')
region <- c(1, 1, 1, 3, 2, 3, 3, 2, 1, 1, 3, 3, 3, 2, 2, 1, 1, 1,1, 3, 3,
3, 2, 2, 1, 1, 1)
df <- data.frame(year, species, region)
df
year species region
1 1990 A 1
2 1980 A 1
3 1985 B 1
4 1980 B 3
5 1990 B 2
6 1990 C 3
7 1980 C 3
8 1985 C 2
9 1985 A 1
10 1990 A 1
11 1980 A 3
12 1985 B 3
13 1980 B 3
14 1990 B 2
15 1990 C 2
16 1980 C 1
17 1985 C 1
18 1985 A 1
19 1990 A 1
20 1980 A 3
21 1985 B 3
22 1980 B 3
23 1990 B 2
24 1990 C 2
25 1980 C 1
26 1985 C 1
27 1985 A 1
What I am looking to do is figure out how many of each species (A, B, or C) exist in each region (1, 2, or 3) in each of the three years I have (1980, 1985, or 1990).
I'm looking to end up with a dataset that looks something along the lines of this,
region A_1980 B_1980 C_1980 A_1985 B_1985 C_1985 A_1990 B_1990 C_1990
1 1 0 0 0 0 0 0 0 0 0
2 2 1 1 1 1 1 1 1 1 1
3 3 2 2 2 2 2 2 2 2 2
such that each row represents a region, and each column represents the count of each species, in a particular year. I've tried to do this using the spread
function in conjunction with the group_by
dplyr function, but I couldn't get it to do anything close to what I want.
Does anyone have any suggestions?
Something like this?
library(dplyr)
df2 <- df %>%
mutate(sp_year = paste(species, year, sep = "_")) %>%
group_by(region) %>%
count(sp_year) %>%
spread(sp_year,n)
df2
Which gives this:
# A tibble: 3 x 10
# Groups: region [3]
region A_1980 A_1985 A_1990 B_1980 B_1985 B_1990 C_1980 C_1985 C_1990
<dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 3 3 NA 1 NA 2 2 NA
2 2 NA NA NA NA NA 3 NA 1 2
3 3 2 NA NA 3 2 NA 1 NA 1
Similar to wl1234's answer but more concise. We can use unite
to combine columns. We can also use count
without group_by
the variable. Finally, we can set fill = 0
in the spread
function to replace NA
with 0.
library(tidyverse)
df2 <- df %>%
unite(sp_year, species, year, sep = "_") %>%
count(sp_year, region) %>%
spread(sp_year, n, fill = 0)
df2
# # A tibble: 3 x 10
# region A_1980 A_1985 A_1990 B_1980 B_1985 B_1990 C_1980 C_1985 C_1990
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 3 3 0 1 0 2 2 0
# 2 2 0 0 0 0 0 3 0 1 2
# 3 3 2 0 0 3 2 0 1 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With