Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SAS: Hidden retain statement inside set statement?

Tags:

sas

Consider the following example:

/* Create two not too interesting datasets: */
Data ones (keep = A);
Do i = 1 to 3;
A = 1;
output;
End;
run;

Data numbers;
Do B = 1 to 5;
output;
End;
Run;

/* The interesting step: */
Data together;
Set ones numbers;
if B = 2 then A = 2;
run;

So dataset ones contains one variable A with 3 observations, all ones and dataset numbers contains one variable (B) with 5 observations: the numbers 1 to 5. I expect the resulting dataset together to have two columns (A and B) and the A column to read (vertically) 1, 1, 1, . , 2, . , . , .

However, when executing the code I find that column A reads 1, 1, 1, . , 2, 2, 2 , 2

Apparently the 2 created in the fifth observation is retained all the way down for no apparent reason. What is going on here?

(For the sake of completeness: when I split the last data step into two as below:

Data together;
set ones numbers;
run;
Data together;
set together;
if B = 2 then A = 2;
run;

it does do what I expect.)

like image 827
Vincent Avatar asked Oct 14 '13 14:10

Vincent


People also ask

How do you write a retain statement in SAS?

Example 1: Basic Usageretain month1-month5 1 year 0 a b c 'XYZ'; The values of MONTH1 through MONTH5 are set initially to 1; YEAR is set to 0; variables A, B, and C are each set to the character value XYZ . This RETAIN statement assigns the initial value 1 to the variable MONTH1 only: retain month1-month5 (1);

What is the difference between retain and keep in SAS?

The RETAIN statement causes SAS to hold the value of a variable from one iteration of the DATA step to the next iteration. The KEEP statement does not affect the value of variables but only specifies which variables to include in any output data sets.

How do I keep only certain variables in SAS?

If you place the KEEP= option on the SET statement, SAS keeps the specified variables when it reads the input data set. On the other hand, if you place the KEEP= option on the DATA statement, SAS keeps the specified variables when it writes to the output data set.

What is _N_ 1 in SAS?

_N_ is an automatic SAS counter that can be used to find out how many times the DATA step has looped. The purpose for it in your example is to only create and load the hash table once, at the start of the first loop through the step.


Video Answer


1 Answers

Yes, any variable that is defined in a SET, MERGE, or UPDATE statement is automatically retained (not set to missing at the top of the data step loop). You can effectively ignore that with

output;
call missing(of <list of variables to clear out>);
run;

at the end of your data step.

This is how MERGE works for many-to-one merges, by the way, and the reason that many-to-many merges don't usually work the way you want them to.


The difference between the 'together' and the 'separate' cases is that in the separate case, you have two data sets with different variables. If you are running this in interactive mode, ie SAS Program Editor or Enhanced Editor (not EG or batch mode), you can use the data step debugger to see this a little more clearly. You would see the following:

At the end of the last row of the ones dataset:

i A B
3 1 .

Notice B exists, but is missing. Then it goes back to the top of the data step loop. All three variables are left alone since they're all from the data sets. Then it attempts to read from ones one more time, which generates:

i A B
. . .

Then it realizes it cannot read from ones, and starts to read from numbers. At the end of the first row of the numbers dataset:

i A B
. . 1

Then it goes to the top, again changes nothing; then it reads in a 2 for B.

i A B
. . 2

Then it sets A to 2, per your program:

i A B
. 2 2

Then it returns to the start of the data step loop again.

i A B
. 2 2

Then it reads in B=3:

i A B
. 2 3

Then it continues looping, for B=4, 5.

Now, compare that to the single dataset. It will be nearly the same (with a small difference at the switch between datasets that does not yield a different result). Now we go to the step where A=2 B=2:

i A B
. 2 2

Now when the data step reads in the next row, it has all three variables on it. So it yields:

i A B
. . 3

Since it read in A=. from the row, it is setting it to missing. In the one-data-step version, it didn't have a value for A to read in, so it didn't replace the 2 with missing.

like image 140
Joe Avatar answered Oct 17 '22 05:10

Joe