I want to merge the information from 18 different databases providing information on the status of many species in different locations of the world. However, I only want to merge the rows where establishmentMeans for a specific taxonID and locationIDare both introduced and uncertain. For these rows, I would then like to concatenate the values in the remaining columns of the data frame: degreeOfEstablishment, pathway, and source.
I have been trying different solutions, for I end up either missing some values or adding values where they do not exist.
Here is an example of the data:
df1 <- data.frame(
taxonID = c(1, 1, 1, 1, 2, 2, 2),
locationID = c(1, 1, 1, 2, 3, 3, 4),
establishmentMeans = c("introduced", "uncertain", "vagrant", "uncertain", "introduced", "uncertain", "introduced"),
degreeOfEstablishment = c("established", "reproducing", NA, NA, "invasive", "failing", NA),
pathway = c("releasedForUse", "otherEscape", NA, NA, "unaided", "unaided", NA),
source = c("x", "y", "y", "x", "x", "x", "y"),
stringsAsFactors = FALSE
)
df2 <- data.frame(
taxonID = c(1, 1, 2, 2),
locationID = c(1, 2, 3, 5),
establishmentMeans = c("native", "native", "native", "native"),
source = c("z", "z", "z", "z"),
stringsAsFactors = FALSE
)
My code:
# merge data
dat <- merge(df1, df2, by = c("locationID", "taxonID","establishmentMeans"), all = TRUE)
# merge rows where a taxon is reported as introduced and uncertain in the same location
dat2 <- dat |>
group_by(locationID, taxonID) |>
mutate(
establishmentMeans = if ("introduced" %in% establishmentMeans & "uncertain" %in% establishmentMeans) {
"introduced; uncertain"
} else {
establishmentMeans
}
)
# merge the remaining information corresponding to the status of a species when introduced and uncertain
dat3 <- dat2 |>
group_by(locationID, taxonID, establishmentMeans) |>
mutate(
across(
c(starts_with("degreeOfEstablishment"), starts_with("pathway"), starts_with("source")),
~ paste(unique(na.omit(.)), collapse = "; "),
.names = "{.col}"
)
)|>
ungroup()
And how the output should look like
out <- data.frame(
taxonID = c(1, 1, 1, 1, 1, 2, 2, 2, 2),
locationID = c(1, 1, 1, 2, 2, 3, 3, 4, 5),
establishmentMeans = c("introduced; uncertain", "native", "vagrant", "uncertain", "native", "introduced; uncertain", "native", "introduced", "native"),
degreeOfEstablishment = c("established; reproducing", NA, NA, NA, NA, "invasive; failing", NA, NA, NA),
pathway = c("releasedForUse; otherEscape", NA, NA, NA, NA, "unaided", NA, NA, NA),
source = c("x; y", "z", "y", "z", "x", "x", "z", "y", "z"),
stringsAsFactors = FALSE
)
The first step when merging the rows for a species reported as introduced and uncertain is already causing issues, as some values go missing.
Another issue is that I am explicitly saying that I want to concatenate the values in the columns degreeOfEstablishment, pathway, and source. I should be concatenating all remaining values from the columns from both these rows, I am just not sure how to do this. Maybe using mutate(across(everything))? There are some of the original databases to merge which have columns that other databases do not have, so I am not sure if this would cause issues at some point.
EDIT: fixed expected output, keeping the row with taxonID=1, locationID=2, establishmentMeans="uncertain"
Here's a base R solution, first using ave to flag the observations, next split and aggregate respective observations, and, finally, rbind and sorting.
tmp <- merge(df1, df2, all=TRUE) |>
transform(foo=ave(establishmentMeans, taxonID, locationID, FUN=\(x) {
sum(grepl('introduced|uncertain', x))
})) |>
split(~foo == 2 & grepl('introduced|uncertain', establishmentMeans))
tmp[['TRUE']] <- aggregate(. ~ taxonID + locationID, tmp[['TRUE']], \(x) {
toString(unique(x))
})
do.call('rbind', tmp) |>
sort_by(~list(taxonID, locationID, establishmentMeans)) |>
subset(select=-foo)
# taxonID locationID establishmentMeans source degreeOfEstablishment pathway
# TRUE.1 1 1 introduced, uncertain x, y established, reproducing releasedForUse, otherEscape
# FALSE.2 1 1 native z <NA> <NA>
# FALSE.4 1 1 vagrant y <NA> <NA>
# FALSE.5 1 2 native z <NA> <NA>
# FALSE.6 1 2 uncertain x <NA> <NA>
# TRUE.2 2 3 introduced, uncertain x invasive, failing unaided
# FALSE.8 2 3 native z <NA> <NA>
# FALSE.10 2 4 introduced y <NA> <NA>
# FALSE.11 2 5 native z <NA> <NA>
If you insist on the semicolon you can do paste(unique(x), collapse='; ') instead of the toString(unique(x)) part.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With