I have a large dataset in SAS which I know is almost sorted; I know the first and second levels are sorted, but the third level is not. Furthermore, the first and second levels contain a large number of distinct values and so it is even less desirable to sort the first two columns again when I know it is already in the correct order. An example of the data is shown below:
ID Label Frequency 1 Jon 20 1 John 5 2 Mathieu 2 2 Mathhew 7 2 Matt 5 3 Nat 1 3 Natalie 4
Using the "presorted" option on a proc sort seems to only check if the data is sorted on every key, otherwise it does a full sort of the data. Is there any way to tell SAS that the first two columns are already sorted?
If you've previously sorted the dataset by the first 2 variables, then regardless of the sortedby
information on the dataset, SAS will take less CPU time to sort it *. This is a natural property of most decent sorting algorithms - it's much less work to sort something that's already nearly sorted.
* As long as you don't use the force
option in the proc sort
statement, which forces it to do redundant sorting.
Here's a little test I ran:
option fullstimer;
/*Make sure we have plenty of rows with the same 1 + 2 values, so that sorting by 1 + 2 doesn't imply that the dataset is already sorted by 1 + 2 + 3*/
data test;
do _n_ = 1 to 10000000;
var1 = round(rand('uniform'),0.0001);
var2 = round(rand('uniform'),0.0001);
var3 = round(rand('uniform'),0.0001);
output;
end;
run;
/*Sort by all 3 vars at once*/
proc sort data = test out = sort_all;
by var1 var2 var3;
run;
/*Create a baseline dataset already sorted by 2/3 vars*/
/*N.B. proc sort adds sortedby information to the output dataset*/
proc sort data = test out = baseline;
by var1 var2;
run;
/*Sort baseline by all 3 vars*/
proc sort data = baseline out = sort_3a;
by var1 var2 var3;
run;
/*Remove sort information from baseline dataset (leaving the order of observations unchanged)*/
proc datasets lib = work nolist nodetails;
modify baseline (sortedby = _NULL_);
run;
quit;
/*Sort baseline dataset again*/
proc sort data = baseline out = sort_3b;
by var1 var2 var3;
run;
The relevant results I got were as follows:
The relevant metric from the log output is the amount of user CPU time.
Of course, if the almost-sorted dataset is very large and contains lots of other variables, you may wish to avoid the sort due to the write overhead when replacing it. Another approach you could take would be to create a composite index - this would allow you to do things involving by group processing, for example.
/*Alternative option - index the 2/3 sorted dataset on all 3 vars rather than sorting it*/
proc datasets lib = work nolist nodetails;
/*Replace the sort information*/
modify baseline(sortedby = var1 var2);
run;
/*Create composite index*/
modify baseline;
index create index1 = (var1 var2 var3);
run;
quit;
Creating an index requires a read of the whole dataset, as does the sort, but only a fraction of the work involved in writing it out again, and might be faster than a 2/3 to 3/3 sort in some situations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With