The question is really straight forward, which one is faster?
Considering we are using data step with two datasets in the set statement and the datasets have the same variables in them.
From What Ive heard and read, if we subset them using the same condition, say date = "10jan2014"d, the result will be exactly the same using the statement above or the following data set option in the two datasets (where=(date="10jan2014"d)). Because the where is executed before anything comes to the PDV.
Is that right?
For better understanding of the question, I created the following code:
Lets say we have these two datasets.
data people1;
format birth date9.;
input name $ birth :date9.;
datalines;
John 18jan1980
Mary 20feb1980
;
run;
data people2;
format birth date9.;
input name $ birth :date9.;
datalines;
Peter 18mar1980
Judas 18jan1980
;
run;
I want to create a new dataset with those two, but only for people with birth date equals to 18jan1980. Now I can use the where statement or the where data set option.
Using the where (data set option):
data everybody1;
set people1 (where=(birth="18jan1980"d))
people2 (where=(birth="18jan1980"d));
run;
Using the where statement:
data everybody2;
set people1
people2;
where birth="18jan1980"d;
run;
The new datasets now have the exact same output. But whith one is faster?
Thanks.
They are identical in terms of performance. And yes, the where will only allow matching results to be loaded to the PDV.
According to the documentation (http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000202951.htm):
The WHERE statement selects observations before they are brought into the program data vector, making it a more efficient programming technique.
The above quote is talking about where
vs if
statements but it does confirm that where
clauses don't bring the data into the PDV.
EDIT : Thanks to Joe's link, I'm going to provide a quote a better excerpt from the docs (http://support.sas.com/documentation/cdl/en/lrcon/67885/HTML/default/viewer.htm#p04fy20d8il3nfn1ssywe4f25k27.htm):
You can use a WHERE expression in both a DATA step and SAS procedures, as well as in a windowing environment, SCL programs, and as a data set option. A WHERE expression tests the condition before an observation is read into the PDV. If the condition is true, the observation is read into the PDV and processed. If the condition is false, the observation is not read into the PDV, and processing continues with the next observation. This can yield substantial savings when observations contain many variables or very long character variables (up to 32K bytes).
The difference between the where
statement and the where
data set option is how they take effect: the former affects all input datasets that do not have a where
dataset option, and the latter affects only the dataset that it is an option on (and overrides the where
statement).
See the documentation for the where data set option and the where statement for more information, as well as the Where Expression Processing concept. In particular, the latter does not differentiate between the two kinds of where
expression in terms of performance; it only specifies the difference in terms of how they take effect.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With