Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Would overwriting the existing SAS dataset take more time?

Tags:

sas

I got a short question - If we are creating a SAS dataset say - Sample.sas7bdat which already exists, will the code take more time to execute (because here the code has to overwrite the existing dataset) than the case when this dataset was not already there?

data sample;
.....
.....
run;

I did some reasearch on the internet but could not find a satisfactory answer. To me it seems like the code should take a little bit extra time, though not sure how much of impact it would make on a 10GB of dataset.

like image 869
in_user Avatar asked Dec 08 '14 10:12

in_user


People also ask

How do I reduce processing time in SAS?

To improve the performance of a SAS job, we must reduce the number of times SAS accesses disk or tape devices. We can reduce the number of data accesses by processing more data each time a device is accessed by setting the BUFNO=, BUFSIZE=, CATCACHE=, and COMPRESS= system options.

How can I improve my SAS performance?

1) Read only data that is needed from external data files. 2) Minimize the number of times a large dataset is read by subsetting in a single DATA step. 3) Use KEEP= or DROP= data set options to retain only desired variables. 4) Use WHERE statements to subset data.

How do you update a dataset in SAS?

In SAS, you can use either the MERGE statement or the UPDATE statement in a DATA step to update the values of observations in a master data set. Both statements should be followed by the BY statement, which specifies the primary key (variable) sorted by the SORT procedure.


1 Answers

You could test this yourself fairly easily. A few caveats:

  • Make sure you have a large enough dataset such that you won't miss the differences in simple random cpu activity. 100+MB is usually a good target.
  • Make sure you perform the test multiple times - the more the better, with no time in between if possible. One test will always be insufficient and will always tend to show the first dataset as faster, because it benefits from write caching (basically the OS saying that it's done writing when it's not, but simply has the write queued up in memory).

Here's an example of my test. This is a 100 million row dataset with two 8 byte numerics, so 1.6 GB.

First, the results. I see a few second difference. Why? SAS takes a few operations when replacing a dataset:

Write dataset to temporary file
Delete the old dataset
Rename temporary dataset to new dataset

On some OSs this seems to be faster than others; I've found Windows desktop to be fairly slow about this, compared to unix or even Windows Server OS which is pretty quick. I'm guessing Windows is more careful about deleting than simply changing a file system pointer, but I don't really know. It's certainly not copying the whole file over from the utility directory (it's not nearly enough time for that). I also suspect write caching is still giving a bit of a boost to the new datasets, particularly as time for all datasets is growing as I write. The difference is probably only about a second or so - the difference between _REP iteration 2 and _NEW iteration 3 seems the most reasonable to me.

Iteration 1 _NEW=7.26999998099927 _REP=12.9079999922978
Iteration 2 _NEW=10.0119998454974 _REP=11.0789999961998
Iteration 3 _NEW=10.1360001564025 _REP=15.3819999695042
Iteration 4 _NEW=14.7720000743938 _REP=17.4649999142056
Iteration 5 _NEW=16.2560000418961 _REP=19.2009999752044

Notice the first iteration new is far faster than the others, and overall time increases as you go (as the write caching is less and less able to keep up). I suspect if you allow it to continue (or use a still larger file, which I don't have time for right now) you might see even more consistent times. I'm also not sure what happens with write caching when a file that is write cached is deleted; it's possible it has to wait for the write caching to write out to disk before doing the delete op or something similar. You could perform a test where you waited 30 seconds between _NEW and _REP to verify that.

The code:

%macro test_me(iter=1);
%do _i=1 %to &iter.;
%let start = %sysfunc(time());
data test&_i.;
  do x = 1 to 1e8;
    y=x**2;
    output;
  end;
run;
%let mid=%sysfunc(time());
data test&_i.;
  do x = 1 to 1e8;
    y=x**2;
    output;
  end;
run;
%let end=%sysfunc(time());
%let _new = %sysevalf(&mid.-&start.);
%let _rep = %sysevalf(&end.-&mid.);

%put Iteration &_i. &=_new. &=_rep.;
%end;

proc datasets nolist kill;
quit;
%mend test_me;

options nosource nonotes nomprint nosymbolgen;

%test_me(iter=5);
like image 53
Joe Avatar answered Oct 07 '22 12:10

Joe