Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting an almost sorted dataset in SAS

I have a large dataset in SAS which I know is almost sorted; I know the first and second levels are sorted, but the third level is not. Furthermore, the first and second levels contain a large number of distinct values and so it is even less desirable to sort the first two columns again when I know it is already in the correct order. An example of the data is shown below:

ID  Label    Frequency
1   Jon      20
1   John     5
2   Mathieu  2
2   Mathhew  7
2   Matt     5
3   Nat      1
3   Natalie  4

Using the "presorted" option on a proc sort seems to only check if the data is sorted on every key, otherwise it does a full sort of the data. Is there any way to tell SAS that the first two columns are already sorted?

like image 580
Nick Avatar asked Nov 25 '14 16:11

Nick


1 Answers

If you've previously sorted the dataset by the first 2 variables, then regardless of the sortedby information on the dataset, SAS will take less CPU time to sort it *. This is a natural property of most decent sorting algorithms - it's much less work to sort something that's already nearly sorted.

* As long as you don't use the force option in the proc sort statement, which forces it to do redundant sorting.

Here's a little test I ran:

option fullstimer;
/*Make sure we have plenty of rows with the same 1 + 2 values, so that sorting by 1 + 2 doesn't imply that the dataset is already sorted by 1 + 2 + 3*/
data test;
    do _n_ = 1 to 10000000;
        var1 = round(rand('uniform'),0.0001);
        var2 = round(rand('uniform'),0.0001);
        var3 = round(rand('uniform'),0.0001);
        output;
    end;
run;

/*Sort by all 3 vars at once*/
proc sort data = test out = sort_all;
    by var1 var2 var3;
run;

/*Create a baseline dataset already sorted by 2/3 vars*/
/*N.B. proc sort adds sortedby information to the output dataset*/
proc sort data = test out = baseline;
    by var1 var2;
run;

/*Sort baseline by all 3 vars*/
proc sort data = baseline out = sort_3a;
    by var1 var2 var3;
run;

/*Remove sort information from baseline dataset (leaving the order of observations unchanged)*/
proc datasets lib = work nolist nodetails;
    modify baseline (sortedby = _NULL_);
    run;
quit;

/*Sort baseline dataset again*/
proc sort data = baseline out = sort_3b;
    by var1 var2 var3;
run;

The relevant results I got were as follows:

  • SAS took 8 seconds to sort the original completely unsorted dataset by all 3 variables.
  • SAS took 4 seconds to sort by 3/3 starting from the baseline dataset already sorted by 2/3 variables.
  • SAS took 4 seconds to sort by 3/3 starting from the same baseline dataset after removing the sort information from it.

The relevant metric from the log output is the amount of user CPU time.

Of course, if the almost-sorted dataset is very large and contains lots of other variables, you may wish to avoid the sort due to the write overhead when replacing it. Another approach you could take would be to create a composite index - this would allow you to do things involving by group processing, for example.

/*Alternative option - index the 2/3 sorted dataset on all 3 vars rather than sorting it*/
proc datasets lib = work nolist nodetails;
    /*Replace the sort information*/
    modify baseline(sortedby = var1 var2);
    run;
    /*Create composite index*/
    modify baseline;
    index create index1 = (var1 var2 var3);
    run;
quit;

Creating an index requires a read of the whole dataset, as does the sort, but only a fraction of the work involved in writing it out again, and might be faster than a 2/3 to 3/3 sort in some situations.

like image 164
user667489 Avatar answered Nov 22 '22 23:11

user667489