Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

convert duplicates into NA

Tags:

r

duplicates

na

I would like to replace duplicated values per site to NA and to keep the first repeated value that appears from left to right.

For example, on site "Alz-Ettelbruck" the value "7" repeats in columns 4 and 5.That means the column 5 of only that site should return NA. The number 12 repeats in all columns on site "Our-Gemund/Vianden" so I would like to keep the 12 in column 2 but the rest should be replaced to NA.

For that I have used the duplicated function but it returns "NULL".

To reproduce the issue, I have used the following data frame and I have indicated at the end the desired output.

Any help would be truly appreciated. Thank you in advance.


df <- data.frame(stringsAsFactors = FALSE,
                 check.names = FALSE,
                        Site = c("Att-Bissen","Alz-Ettelbruck","Our-Gemund/Vianden",
                                 "Syre Felsmuhle/Mertert","Ernz Blanche-Larochette"),
  `2001-12-01 to 2021-12-01` = c(12, 1, 12, 1, 8),
  `1991-12-01.to 2021-12-01` = c(5, 4, 12, 6, 14),
  `1981-12-01 to 2021-12-01` = c(12, 7, 12, 20, 14),
  `1971-12-01 to 2021-12-01` = c(19, 7, 12, 13, 14))


# Replace repeated values with NA per row
data <- for (i in 1:nrow(df)) {
  df[i, -1][duplicated(df[i, -1])] <- NA
}

The following is what I would like the script to return:

enter image description here

like image 446
Drop Avatar asked Sep 03 '25 17:09

Drop


2 Answers

(Edit: the first version of both base-R and dplyr+tidyr code used duplicated, which would falsely-remove the 12 in row 1 column 4. It has been edited to fix that to not use duplicated.)

base R

A reduction comparing column to updated-column.

df[,-1] <- Reduce(
  function(prev, this) replace(this, is.na(prev) | this == prev, this[NA][1]),
  df[,-1], accumulate = TRUE)
df
#                      Site 2001-12-01 to 2021-12-01 1991-12-01.to 2021-12-01 1981-12-01 to 2021-12-01 1971-12-01 to 2021-12-01
# 1              Att-Bissen                       12                        5                       12                       19
# 2          Alz-Ettelbruck                        1                        4                        7                       NA
# 3      Our-Gemund/Vianden                       12                       NA                       NA                       NA
# 4  Syre Felsmuhle/Mertert                        1                        6                       20                       13
# 5 Ernz Blanche-Larochette                        8                       14                       NA                       NA

I hard-coded df[,-1] in both places, it could easily also be df[,2:5], it just needs to be the same in both places (LHS of <- and within the Reduce).

dplyr+tidyr

This loses some efficiency because it double pivots.

library(dplyr)
library(tidyr) # pivot_*
df %>%
  pivot_longer(cols = -Site) %>%
  arrange(Site, desc(name)) %>%
  mutate(.by = "Site", value = if_else(value == lag(value, default=-1L), value[NA], value)) %>%
  pivot_wider(id_cols = Site) %>%
  slice(match(Site, df$Site)) %>%
  select(match(names(.), names(df)))
# # A tibble: 5 × 5
#   Site                    `2001-12-01 to 2021-12-01` `1991-12-01.to 2021-12-01` `1981-12-01 to 2021-12-01` `1971-12-01 to 2021-12-01`
#   <chr>                                        <dbl>                      <dbl>                      <dbl>                      <dbl>
# 1 Att-Bissen                                      12                          5                         12                         19
# 2 Alz-Ettelbruck                                   1                          4                          7                         NA
# 3 Syre Felsmuhle/Mertert                           1                          6                         20                         13
# 4 Ernz Blanche-Larochette                          8                         14                         NA                         NA
# 5 Our-Gemund/Vianden                              12                         NA                         NA                         NA

One side-effect of the pivoting is that the order of rows and columns is not guaranteed to be restored, so I added the mostly-aesthetic slice(.) %>% select(.) to the end to marry up with your input data. (It is not at all required.)


Data

df <- structure(list(Site = c("Att-Bissen", "Alz-Ettelbruck", "Our-Gemund/Vianden", "Syre Felsmuhle/Mertert", "Ernz Blanche-Larochette"), "2001-12-01 to 2021-12-01" = c(12, 1, 12, 1, 8), "1991-12-01.to 2021-12-01" = c(5, 4, 12, 6, 14), "1981-12-01 to 2021-12-01" = c(12, 7, 12, 20, 14), "1971-12-01 to 2021-12-01" = c(19, 7, 12, 13, 14)), class = "data.frame", row.names = c(NA, -5L))
like image 126
r2evans Avatar answered Sep 05 '25 07:09

r2evans


You were close, df[i, -1] gives a list, so you want to unlist it.

> df2 <- df  ## copy to avoid overwriting
> for (i in 1:nrow(df2)) {
+   df2[i, -1][duplicated(unlist(df2[i, -1]))] <- NA
+ }
> df2
                     Site 2001-12-01 to 2021-12-01 1991-12-01.to 2021-12-01 1981-12-01 to 2021-12-01 1971-12-01 to 2021-12-01
1              Att-Bissen                       12                        5                       NA                       19
2          Alz-Ettelbruck                        1                        4                        7                       NA
3      Our-Gemund/Vianden                       12                       NA                       NA                       NA
4  Syre Felsmuhle/Mertert                        1                        6                       20                       13
5 Ernz Blanche-Larochette                        8                       14                       NA                       NA

However, if you consider only subsequent identical values as dupes as shown in your desired output, consider using diff.

> df2 <- df  ## copy to avoid overwriting
> for (i in 1:nrow(df2)) {
+   df2[i, -1][c(Inf, diff(unlist(df[i, -1]))) == 0] <- NA
+ }
> df2
                     Site 2001-12-01 to 2021-12-01 1991-12-01.to 2021-12-01 1981-12-01 to 2021-12-01 1971-12-01 to 2021-12-01
1              Att-Bissen                       12                        5                       12                       19
2          Alz-Ettelbruck                        1                        4                        7                       NA
3      Our-Gemund/Vianden                       12                       NA                       NA                       NA
4  Syre Felsmuhle/Mertert                        1                        6                       20                       13
5 Ernz Blanche-Larochette                        8                       14                       NA                       NA
like image 40
jay.sf Avatar answered Sep 05 '25 07:09

jay.sf