Here is a sample code that was derived from actual application. There are two datasets - "aa" for a query and "bb" for subquery. Column "m" from datasets "aa" matches column "y" from datasets "bb". Also, there is "yy" column on "aa" table has a value of 30. Column "m" from datasets "aa" contains value "30" in one of its rows, and column "y" from datasets "bb" does not. First proc sql uses values from "y" column of "bb" table to subset table "aa" based on matching values in column "m". It is a correct query and produces results as expected. Second proc sql block has column "y" intentionally misspelled as "yy" in subquery in a row that stars with where statement. Otherwise the whole proc sql block is the same as the first one. Given that there is no column "yy" on dataset bb, I would expect an error message to appear and the whole query to fail. However, it does return one row without failing or error messages. Closer look would suggest that it actually uses "yy" column from table "aa" (see tree in the log output). I do not think this is a correct behavior. If you would have some comments or explanations, I would greatly appreciate it. Otherwise, I maybe should report it to SAS as a bug. Thank you!
Here is the code:
options
msglevel = I
;
data aa;
do i=1 to 20;
m=i*5;
yy=30;
output;
end;
run;
data bb;
do i=10 to 20;
y=i*5;
output;
end;
run;
option DEBUG=JUNK ;
/*Correct sql command*/
proc sql _method
_tree
;
create table cc as
select *
from aa
where m in (select y from bb)
;quit;
/*Incorrect sql command - column "yy" in not on "bb" table"*/
proc sql _method
_tree;
create table dd as
select *
from aa
where m in (select yy from bb)
;quit;
Here is log with sql tree:
119 options
120 msglevel = I
121 ;
122 data aa;
123 do i=1 to 20;
124 m=i*5;
125 yy=30;
126 output;
127 end;
128 run;
NOTE: The data set WORK.AA has 20 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
129
130 data bb;
131 do i=10 to 20;
132 y=i*5;
133 output;
134 end;
135 run;
NOTE: The data set WORK.BB has 11 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
136 option DEBUG=JUNK ;
137
138 /*Correct sql command*/
139 proc sql _method
140 _tree
141 ;
142 create table cc as
143 select *
144 from aa
145 where m in (select y from bb)
146 ;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxsrc( WORK.AA )
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxsrc( WORK.BB )
Tree as planned.
/-SYM-V-(aa.i:1 flag=0001)
/-OBJ----|
| |--SYM-V-(aa.m:2 flag=0001)
| \-SYM-V-(aa.yy:3 flag=0001)
/-FIL----|
| | /-SYM-V-(aa.i:1 flag=0001)
| | /-OBJ----|
| | | |--SYM-V-(aa.m:2 flag=0001)
| | | \-SYM-V-(aa.yy:3 flag=0001)
| |--SRC----|
| | \-TABL[WORK].aa opt=''
| | /-SYM-V-(aa.m:2)
| \-IN-----|
| | /-SYM-V-(bb.y:2 flag=0001)
| | /-OBJ----|
| | /-SRC----|
| | | \-TABL[WORK].bb opt=''
| \-SUBC---|
--SSEL---|
NOTE: Table WORK.CC created, with 11 rows and 3 columns.
146! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
147
148
149 /*Incorrect sql command - column "yy" in not on "bb" table"*/
150 proc sql _method
151 _tree;
152 create table dd as
153 select *
154 from aa
155 where m in (select yy from bb)
156 ;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxsrc( WORK.AA )
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxreps
sqxsrc( WORK.BB )
Tree as planned.
/-SYM-V-(aa.i:1 flag=0001)
/-OBJ----|
| |--SYM-V-(aa.m:2 flag=0001)
| \-SYM-V-(aa.yy:3 flag=0001)
/-FIL----|
| | /-SYM-V-(aa.i:1 flag=0001)
| | /-OBJ----|
| | | |--SYM-V-(aa.m:2 flag=0001)
| | | \-SYM-V-(aa.yy:3 flag=0001)
| |--SRC----|
| | \-TABL[WORK].aa opt=''
| | /-SYM-V-(aa.m:2)
| \-IN-----|
| | /-SYM-A-(#TEMA001:1 flag=0035)
| | /-OBJ----|
| | /-REPS---|
| | | |--empty-
| | | |--empty-
| | | | /-OBJ----|
| | | |--SRC----|
| | | | \-TABL[WORK].bb opt=''
| | | |--empty-
| | | |--empty-
| | | | /-SYM-A-(#TEMA001:1 flag=
0035)
| | | | /-ASGN---|
| | | | | \-SUBP(1)
| | | \-OBJE---|
| \-SUBC---|
| \-SYM-V-(aa.yy:3)
--SSEL---|
NOTE: Table WORK.DD created, with 1 rows and 3 columns.
156! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
Here are datasets:
aa:
i m yy
1 5 30
2 10 30
3 15 30
4 20 30
5 25 30
6 30 30
7 35 30
8 40 30
9 45 30
10 50 30
11 55 30
12 60 30
13 65 30
14 70 30
15 75 30
16 80 30
17 85 30
18 90 30
19 95 30
20 100 30
bb:
i y
10 50
11 55
12 60
13 65
14 70
15 75
16 80
17 85
18 90
19 95
20 100
I agree, this looks pretty weird and may well be a bug. I was able to reproduce this from the code you provided in SAS 9.4 and in SAS 9.1.3, which would make it at least ~12 years old.
In particular, I'm interested in this bit of the output you got from the _method option when creating the DD table but not when creating the CC table:
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxreps <--- What is this doing?
sqxsrc( WORK.BB )
Similarly, the corresponding section from the _tree output is highly obscure:
| | /-SYM-A-(#TEMA001:1 flag=0035)
| | /-OBJ----|
| | /-REPS---|
| | | |--empty-
| | | |--empty-
| | | | /-OBJ----|
| | | |--SRC----|
| | | | \-TABL[WORK].bb opt=''
| | | |--empty-
| | | |--empty-
| | | | /-SYM-A-(#TEMA001:1 flag= 0035)
| | | | /-ASGN---|
| | | | | \-SUBP(1)
| | | \-OBJE---|
| \-SUBC---|
| \-SYM-V-(aa.yy:3)
I have never seen sqxreps or reps in the respective bits of output before. Neither of them is listed in any of the papers I was able to find based on a brief bit of googling (in fact, this question is currently the only hit on Google for sas + sqxreps):
http://support.sas.com/resources/papers/proceedings10/139-2010.pdf http://www2.sas.com/proceedings/sugi30/101-30.pdf
Quoting the first of these:
Codes Description
sqxcrta Create table as Select
Sqxslct Select
sqxjsl Step loop join (Cartesian)
sqxjm Merge join
sqxjndx Index join
sqxjhsh Hash join
sqxsort Sort
sqxsrc Source rows from table
sqxfil Filter rows
sqxsumg Summary stats with GROUP BY
sqxsumn Summary stats with no GROUP BY
Based on a bit of quick testing, this seems to happen regardless of the variable and tables names used, provided that the variable name from AA is repeated multiple times in the subquery referencing table BB. It also happens if you have a variable named e.g. YYY in AA but one named YY in BB, or more generally whenever you have a variable in BB whose name is initially the same as the name of the corresponding variable in AA but then continues for one or more characters.
From this, I'm guessing at some point in the SQL parser, someone used a like operator rather than checking for equality of variable names, and somehow as a result this syntax is triggering an undocumented or incomplete 'feature' in proc sql.
An example of the more general case:
options
msglevel = I
;
data aa;
do i=1 to 20;
m=i*5;
myvar_plus_suffix=30;
output;
end;
run;
data bb;
do i=10 to 20;
myvar=i*5;
output;
end;
run;
option DEBUG=JUNK ;
/*Incorrect sql command - column "yy" in not on "bb" table"*/
proc sql _method
_tree;
create table dd as
select *
from aa
where m in (select myvar_plus_suffix from bb)
;quit;
Here is a response from SAS support. What you are seeing is related to column scoping in PROC SQL.
PROC SQL supports Corellated Subqueries. A Correlated Subquery references a column in the "outer" table which can then be compared to columns in the "inner" table. PROC SQL does not require that a fully qualified column name is used. As a result, if it sees a column in the subquery that does not exist in the inner table (the table referenced in the subquery), it looks for that column in the "outer" table and uses the value if it finds one.
If a fully qualified column name is used, the error you are expecting will occur such as the following:
proc sql;
create table dd as
select *
from aa as outer
where outer.m in (select inner.yyy from bb as inner);
quit;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With