I have a database in which some of the observations have an identifier ident, and some not. I want to create a new database in which I have dropped the observations which are duplicates of my ident variable, but to keep the observations where ident is missing.
If I simply do a proc sort nodupkey
proc sort nodupkey data=have;
by ident;
run;
Then it also eliminates the missing values. Is there a simple way to do that (that is not break the dataset, proc sort nodupkey one partn, then assemble it again)
You have a couple of options when removing duplicates.
First off, dupout=<dataset> on the proc sort will send all of your duplicates to another dataset, and if you want to then do something with them you can. But this is a back-end version of your 'break the dataset', just probably faster as it only breaks the smaller part.
Simpler is to do the dedup yourself.
proc sort data=have;
by ident;
run;
data want;
set have;
by ident;
if (first.ident) or missing(ident);
run;
That keeps the first record for each ident, plus any record with ident missing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With