Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Concatenate values depending on variable value [closed]

Tags:

dataframe

r

I want to merge the information from 18 different databases providing information on the status of many species in different locations of the world. However, I only want to merge the rows where establishmentMeans for a specific taxonID and locationIDare both introduced and uncertain. For these rows, I would then like to concatenate the values in the remaining columns of the data frame: degreeOfEstablishment, pathway, and source.

I have been trying different solutions, for I end up either missing some values or adding values where they do not exist.

Here is an example of the data:

df1 <- data.frame(
  taxonID = c(1, 1, 1, 1, 2, 2, 2),
  locationID = c(1, 1, 1, 2, 3, 3, 4),
  establishmentMeans = c("introduced", "uncertain", "vagrant", "uncertain", "introduced", "uncertain", "introduced"),
  degreeOfEstablishment = c("established", "reproducing", NA, NA, "invasive", "failing", NA),
  pathway = c("releasedForUse", "otherEscape", NA, NA, "unaided", "unaided", NA),
  source = c("x", "y", "y", "x", "x", "x", "y"),
  stringsAsFactors = FALSE 
)
df2 <- data.frame(
  taxonID = c(1, 1, 2, 2),
  locationID = c(1, 2, 3, 5),
  establishmentMeans = c("native", "native", "native", "native"),
  source = c("z", "z", "z", "z"),
  stringsAsFactors = FALSE
)

My code:

# merge data 
dat <- merge(df1, df2, by = c("locationID", "taxonID","establishmentMeans"), all = TRUE)

# merge rows where a taxon is reported as introduced and uncertain in the same location
dat2 <- dat |> 
  group_by(locationID, taxonID) |> 
  mutate(
    establishmentMeans = if ("introduced" %in% establishmentMeans & "uncertain" %in% establishmentMeans) {
      "introduced; uncertain"
    } else {
      establishmentMeans
    }
  )

# merge the remaining information corresponding to the status of a species when introduced and uncertain
dat3 <- dat2 |> 
  group_by(locationID, taxonID, establishmentMeans) |> 
  mutate(
    across(
      c(starts_with("degreeOfEstablishment"), starts_with("pathway"), starts_with("source")), 
      ~ paste(unique(na.omit(.)), collapse = "; "),
      .names = "{.col}"
    )
  )|> 
  ungroup()

And how the output should look like

out <- data.frame(
  taxonID = c(1, 1, 1, 1, 1, 2, 2, 2, 2),
  locationID = c(1, 1, 1, 2, 2, 3, 3, 4, 5),
  establishmentMeans = c("introduced; uncertain", "native", "vagrant", "uncertain", "native", "introduced; uncertain", "native", "introduced", "native"),
  degreeOfEstablishment = c("established; reproducing", NA, NA, NA, NA, "invasive; failing", NA, NA, NA),
  pathway = c("releasedForUse; otherEscape", NA, NA, NA, NA, "unaided", NA, NA, NA),
  source = c("x; y", "z", "y", "z", "x", "x", "z", "y", "z"),
  stringsAsFactors = FALSE
)

The first step when merging the rows for a species reported as introduced and uncertain is already causing issues, as some values go missing. Another issue is that I am explicitly saying that I want to concatenate the values in the columns degreeOfEstablishment, pathway, and source. I should be concatenating all remaining values from the columns from both these rows, I am just not sure how to do this. Maybe using mutate(across(everything))? There are some of the original databases to merge which have columns that other databases do not have, so I am not sure if this would cause issues at some point.

EDIT: fixed expected output, keeping the row with taxonID=1, locationID=2, establishmentMeans="uncertain"

like image 763
msug Avatar asked May 16 '26 21:05

msug


1 Answers

Here's a base R solution, first using ave to flag the observations, next split and aggregate respective observations, and, finally, rbind and sorting.

tmp <- merge(df1, df2, all=TRUE) |> 
  transform(foo=ave(establishmentMeans, taxonID, locationID, FUN=\(x) {
    sum(grepl('introduced|uncertain', x))
  })) |> 
  split(~foo == 2 & grepl('introduced|uncertain', establishmentMeans))
tmp[['TRUE']] <- aggregate(. ~ taxonID + locationID, tmp[['TRUE']], \(x) {
  toString(unique(x))
})

do.call('rbind', tmp) |>
  sort_by(~list(taxonID, locationID, establishmentMeans)) |>
  subset(select=-foo)
#          taxonID locationID    establishmentMeans source    degreeOfEstablishment                     pathway
# TRUE.1         1          1 introduced, uncertain   x, y established, reproducing releasedForUse, otherEscape
# FALSE.2        1          1                native      z                     <NA>                        <NA>
# FALSE.4        1          1               vagrant      y                     <NA>                        <NA>
# FALSE.5        1          2                native      z                     <NA>                        <NA>
# FALSE.6        1          2             uncertain      x                     <NA>                        <NA>
# TRUE.2         2          3 introduced, uncertain      x        invasive, failing                     unaided
# FALSE.8        2          3                native      z                     <NA>                        <NA>
# FALSE.10       2          4            introduced      y                     <NA>                        <NA>
# FALSE.11       2          5                native      z                     <NA>                        <NA>

If you insist on the semicolon you can do paste(unique(x), collapse='; ') instead of the toString(unique(x)) part.

like image 161
jay.sf Avatar answered May 20 '26 02:05

jay.sf



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!