I got a short question - If we are creating a SAS dataset say - Sample.sas7bdat which already exists, will the code take more time to execute (because here the code has to overwrite the existing dataset) than the case when this dataset was not already there?
data sample;
.....
.....
run;
I did some reasearch on the internet but could not find a satisfactory answer. To me it seems like the code should take a little bit extra time, though not sure how much of impact it would make on a 10GB of dataset.
To improve the performance of a SAS job, we must reduce the number of times SAS accesses disk or tape devices. We can reduce the number of data accesses by processing more data each time a device is accessed by setting the BUFNO=, BUFSIZE=, CATCACHE=, and COMPRESS= system options.
1) Read only data that is needed from external data files. 2) Minimize the number of times a large dataset is read by subsetting in a single DATA step. 3) Use KEEP= or DROP= data set options to retain only desired variables. 4) Use WHERE statements to subset data.
In SAS, you can use either the MERGE statement or the UPDATE statement in a DATA step to update the values of observations in a master data set. Both statements should be followed by the BY statement, which specifies the primary key (variable) sorted by the SORT procedure.
You could test this yourself fairly easily. A few caveats:
Here's an example of my test. This is a 100 million row dataset with two 8 byte numerics, so 1.6 GB.
First, the results. I see a few second difference. Why? SAS takes a few operations when replacing a dataset:
Write dataset to temporary file
Delete the old dataset
Rename temporary dataset to new dataset
On some OSs this seems to be faster than others; I've found Windows desktop to be fairly slow about this, compared to unix or even Windows Server OS which is pretty quick. I'm guessing Windows is more careful about deleting than simply changing a file system pointer, but I don't really know. It's certainly not copying the whole file over from the utility directory (it's not nearly enough time for that). I also suspect write caching is still giving a bit of a boost to the new datasets, particularly as time for all datasets is growing as I write. The difference is probably only about a second or so - the difference between _REP iteration 2 and _NEW iteration 3 seems the most reasonable to me.
Iteration 1 _NEW=7.26999998099927 _REP=12.9079999922978
Iteration 2 _NEW=10.0119998454974 _REP=11.0789999961998
Iteration 3 _NEW=10.1360001564025 _REP=15.3819999695042
Iteration 4 _NEW=14.7720000743938 _REP=17.4649999142056
Iteration 5 _NEW=16.2560000418961 _REP=19.2009999752044
Notice the first iteration new is far faster than the others, and overall time increases as you go (as the write caching is less and less able to keep up). I suspect if you allow it to continue (or use a still larger file, which I don't have time for right now) you might see even more consistent times. I'm also not sure what happens with write caching when a file that is write cached is deleted; it's possible it has to wait for the write caching to write out to disk before doing the delete op or something similar. You could perform a test where you waited 30 seconds between _NEW and _REP to verify that.
The code:
%macro test_me(iter=1);
%do _i=1 %to &iter.;
%let start = %sysfunc(time());
data test&_i.;
do x = 1 to 1e8;
y=x**2;
output;
end;
run;
%let mid=%sysfunc(time());
data test&_i.;
do x = 1 to 1e8;
y=x**2;
output;
end;
run;
%let end=%sysfunc(time());
%let _new = %sysevalf(&mid.-&start.);
%let _rep = %sysevalf(&end.-&mid.);
%put Iteration &_i. &=_new. &=_rep.;
%end;
proc datasets nolist kill;
quit;
%mend test_me;
options nosource nonotes nomprint nosymbolgen;
%test_me(iter=5);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With