As a follow-up question to this one: Remove duplicated rows using dplyr, I have the following:
How do you randomly remove duplicated rows using dplyr() (among others)?
My command now is:
data.uniques <- distinct(data, KEYVARIABLE, .keep_all = TRUE)
But it returns the first occurrence of the KEYVARIABLE. I want that behaviour to be random: so anywhere between 1
and n
occurrences of that KEYVARIABLE.
For instance:
KEYVARIABLE BMI
1 24.2
2 25.3
2 23.2
3 18.9
4 19
4 20.1
5 23.0
Currently my command returns:
KEYVARIABLE BMI
1 24.2
2 25.3
3 18.9
4 19
5 23.0
I want it to randomly return one of the n
duplicated rows, for instance:
KEYVARIABLE BMI
1 24.2
2 23.2
3 18.9
4 19
5 23.0
One option would be to group by 'KEYVARIABLE' and then sample
the sequence of rows to select the row and Subset the dataset
library(data.table)
setDT(df1)[, .SD[sample(.N)[1]], KEYVARIABLE]
Or using dplyr
library(dplyr)
df1 %>%
group_by(KEYVARIABLE) %>%
sample_n(1)
Just shuffle rows before selecting first occurrence (using distinct
).
library(dplyr)
distinct(df[sample(1:nrow(df)), ],
KEYVARIABLE,
.keep_all = TRUE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With