I have four ID columns in my table (called MGIT): Ext_ID_1 through _4. Sometimes the same number is present in different rows in them and I need to filter these rows for further analysis, example:
Starting state:
Ext_ID Ext_ID_4
1111 2222
3333 4444
5555 1111
6666 7777
8888 9999
9999 1010
Desired filtered outcome:
Ext_ID Ext_ID_4
1111 2222
5555 1111
8888 9999
9999 1010
I would only like to compare them two at a time for clearer management.
Through previous questions in StackOverflow I have found the following code:
Dupl = MGIT[,c('Ext_ID','Ext_ID_4')]
Result <- MGIT[duplicated(Dupl) | duplicated(Dupl, fromLast=TRUE),]
But it's giving back duplicated only values inside the columns (actually, only inside Ext_ID, but I believe ID_4 doesn't have duplicated values within itself).
I am a beginner at programming and know nothing about the language.
I assume you want this to work for all your four columns, and not just the two in your example.
Here is a solution that will work no matter how many "Ext_ID" columns you have:
library(dplyr)
# Recreate the data from your example
df <- tibble::tribble(
~Ext_ID, ~Ext_ID_4,
1111, 2222,
3333, 4444,
5555, 1111,
6666, 7777,
8888, 9999,
9999, 1010
)
# The actual code you need - just replace `df` with the name of your table
df %>%
filter(
if_any(
starts_with("Ext_ID"),
~ .x %>% purrr::map_lgl(~sum(df == .x) > 1)
)
)
# The output:
#> # A tibble: 4 x 2
#> Ext_ID Ext_ID_4
#> <dbl> <dbl>
#> 1 1111 2222
#> 2 5555 1111
#> 3 8888 9999
#> 4 9999 1010
Explaination:
~ .x %>% purrr::map_lgl(~sum(df == .x) > 1) checks whether each value in a column shows up more than once in the dataframe.starts_with("Ext_ID") makes sure that this is done for all columns starting with "Ext_ID"if_any() makes filter() keep rows where there are at least 1 duplicate value in the df for any of the columns.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With