EDIT: Here is a more complete set of code that shows exactly what's going on per the answer below. <pre class="prettyprint"><code>libname output '/data/files/jeff' %let DateStart = '01Jan2013'd; %let DateEnd = '01Jun2013'd; proc sql; CREATE TABLE output.id AS ( SELECT DISTINCT id FROM mydb.sale_volume AS sv WHERE sv.category IN ('a', 'b', 'c') AND sv.trans_date BETWEEN &DateStart AND &DateEnd ) CREATE TABLE output.sums AS ( SELECT id, SUM(sales) FROM mydb.sale_volue AS sv INNER JOIN output.id AS ids ON ids.id = sv.id WHERE sv.trans_date BETWEEN &DateStart AND &DateEnd GROUP BY id ) run; </code></pre> The goal is to simply query the table for some id's based on category membership. Then I sum these members' activity across all categories. The above approach is far slower than: <ol> <li>Running the first query to get the subset</li> <li>Running a second query the sums every ID</li> <li>Running a third query that inner joins the two result sets.</li> </ol> If I'm understanding correctly, it may be more efficient to make sure that all of my code is completely passed through rather than cross-loading. <hr> After posting a question yesterday, a member suggested I might benefit from asking a separate question on performance that was more specific to my situation. I'm using SAS Enterprise Guide to write some programs/data queries. I don't have permissions to modify the underlying data, which is stored in 'Teradata'. My basic problem is writing efficient SQL queries in this environment. For example, I query a large table (with tens of millions of records) for a small subset of ID's. Then, I use this subset to query the larger table again: <pre class="prettyprint"><code>proc sql; CREATE TABLE subset AS ( SELECT id FROM bigTable WHERE someValue = x AND date BETWEEN a AND b ) </code></pre> This works in a matter of seconds and returns 90k ID's. Next, I want to query this set of ID's against the big table, and problems ensue. I'm wanting to sum values over time for the ID's: <pre class="prettyprint"><code>proc sql; CREATE TABLE subset_data AS ( SELECT bigTable.id, SUM(bigTable.value) AS total FROM bigTable INNER JOIN subset ON subset.id = bigTable.id WHERE bigTable.date BETWEEN a AND b GROUP BY bigTable.id ) </code></pre> For whatever reason, this takes a really long time. The difference is that the first query flags 'someValue'. The second looks at all activity, regardless of what's in 'someValue'. For example, I could flag every customer who orders a pizza. Then I would look at every purchase for all customers who ordered pizza. I'm not overly familiar with SAS so I'm looking for any advice on how to do this more efficiently or speed things up. I'm open to any thoughts or suggestions and please let me know if I can offer more detail. I guess I'm just surprised the second query takes so long to process.

The most critical thing to understand when using SAS to access data in Teradata (or any other external database for that matter) is that the SAS software prepares SQL and submits it to the database. The idea is to try and relieve you (the user) from all the database specific details. SAS does this using a concept called "implict pass-through", which just means that SAS does the translation from SAS code into DBMS code. Among the many things that occur is data type conversion: SAS only has two (and only two) data types, numeric and character. SAS deals with translating things for you but it can be confusing. For example, I've seen "lazy" database tables defined with VARCHAR(400) columns having values that never exceed some smaller length (like column for a person's name). In the data base this isn't much of a problem, but since SAS does not have a VARCHAR data type, it creates a variable 400 characters wide for each row. Even with data set compression, this can really make the resulting SAS dataset unnecessarily large. The alternative way is to use "explicit pass-through", where you write native queries using the actual syntax of the DBMS in question. These queries execute entirely on the DBMS and return results back to SAS (which still does the data type conversion for you. For example, here is a "pass-through" query that performs a join to two tables and creates a SAS dataset as a result: <pre class="prettyprint"><code>proc sql; connect to teradata (user=userid password=password mode=teradata); create table mydata as select * from connection to teradata ( select a.customer_id , a.customer_name , b.last_payment_date , b.last_payment_amt from base.customers a join base.invoices b on a.customer_id=b.customer_id where b.bill_month = date '2013-07-01' and b.paid_flag = 'N' ); quit; </code></pre> Notice that everything inside the pair of parentheses is native Teradata SQL and that the join operation itself is running inside the database. The example code you have shown in your question is NOT a complete, working example of a SAS/Teradata program. To better assist, you need to show the real program, including any library references. For example, suppose your real program looks like this: <pre class="prettyprint"><code>proc sql; CREATE TABLE subset_data AS SELECT bigTable.id, SUM(bigTable.value) AS total FROM TDATA.bigTable bigTable JOIN TDATA.subset subset ON subset.id = bigTable.id WHERE bigTable.date BETWEEN a AND b GROUP BY bigTable.id ; </code></pre> That would indicate a previously assigned LIBNAME statement through which SAS was connecting to Teradata. The syntax of that WHERE clause would be very relevant to if SAS is even able to pass the complete query to Teradata. (You example doesn't show what "a" and "b" refer to. It is very possible that the only way SAS can perform the join is to drag both tables back into a local work session and perform the join on your SAS server. One thing I can strongly suggest is that you try to convince your Teradata administrators to allow you to create "driver" tables in some utility database. The idea is that you would create a relatively small table inside Teradata containing the ID's you want to extract, then use that table to perform explicit joins. I'm sure you would need a bit more formal database training to do that (like how to define a proper index and how to "collect statistics"), but with that knowledge and ability, your work will just fly. I could go on and on but I'll stop here. I use SAS with Teradata extensively every day against what I'm told is one of the largest Teradata environments on the planet. I enjoy programming in both.

You imply an assumption that the 90k records in your first query are all unique <code>id</code>s. Is that definite? I ask because the implication from your second query is that they're not unique. - One <code>id</code> can have multiple values over time, and have different <code>somevalue</code>s If the <code>id</code>s are not unique in the first dataset, you need to <code>GROUP BY id</code> or use <code>DISTINCT</code>, in the first query. Imagine that the 90k rows consists of 30k unique <code>id</code>s, and so have an average of 3 rows per <code>id</code>. And then imagine those 30k unique <code>id</code>s actually have 9 records in your time window, including rows where <code>somevalue <> x</code>. You will then get 3x9 records back per <code>id</code>. And as those two numbers grow, the number of records in your second query grows geometrically. Alternative Query If that's not the problem, an alternative query (which is not ideal, but possible) would be... <pre class="prettyprint"><code>SELECT bigTable.id, SUM(bigTable.value) AS total FROM bigTable WHERE bigTable.date BETWEEN a AND b GROUP BY bigTable.id HAVING MAX(CASE WHEN bigTable.somevalue = x THEN 1 ELSE 0 END) = 1 </code></pre>

Writing Efficient Queries in SAS Using Proc sql with Teradata

Tags:

sql

sas

teradata

EDIT: Here is a more complete set of code that shows exactly what's going on per the answer below.

libname output '/data/files/jeff'
%let DateStart = '01Jan2013'd;
%let DateEnd = '01Jun2013'd;
proc sql;
CREATE TABLE output.id AS (
  SELECT DISTINCT id
  FROM mydb.sale_volume AS sv
  WHERE sv.category IN ('a', 'b', 'c') AND
    sv.trans_date BETWEEN &DateStart AND &DateEnd
)
CREATE TABLE output.sums AS (
  SELECT id, SUM(sales)
  FROM mydb.sale_volue AS sv
  INNER JOIN output.id AS ids
    ON ids.id = sv.id
  WHERE sv.trans_date BETWEEN &DateStart AND &DateEnd
  GROUP BY id
)
run;

The goal is to simply query the table for some id's based on category membership. Then I sum these members' activity across all categories.

The above approach is far slower than:

Running the first query to get the subset
Running a second query the sums every ID
Running a third query that inner joins the two result sets.

If I'm understanding correctly, it may be more efficient to make sure that all of my code is completely passed through rather than cross-loading.

After posting a question yesterday, a member suggested I might benefit from asking a separate question on performance that was more specific to my situation.

I'm using SAS Enterprise Guide to write some programs/data queries. I don't have permissions to modify the underlying data, which is stored in 'Teradata'.

My basic problem is writing efficient SQL queries in this environment. For example, I query a large table (with tens of millions of records) for a small subset of ID's. Then, I use this subset to query the larger table again:

proc sql;
CREATE TABLE subset AS (
  SELECT
    id
  FROM
    bigTable
  WHERE
    someValue = x AND
    date BETWEEN a AND b

)

This works in a matter of seconds and returns 90k ID's. Next, I want to query this set of ID's against the big table, and problems ensue. I'm wanting to sum values over time for the ID's:

proc sql;
CREATE TABLE subset_data AS (
  SELECT
    bigTable.id,
    SUM(bigTable.value) AS total
  FROM
    bigTable
  INNER JOIN subset
    ON subset.id = bigTable.id
  WHERE
    bigTable.date BETWEEN a AND b
  GROUP BY
    bigTable.id
)

For whatever reason, this takes a really long time. The difference is that the first query flags 'someValue'. The second looks at all activity, regardless of what's in 'someValue'. For example, I could flag every customer who orders a pizza. Then I would look at every purchase for all customers who ordered pizza.

I'm not overly familiar with SAS so I'm looking for any advice on how to do this more efficiently or speed things up. I'm open to any thoughts or suggestions and please let me know if I can offer more detail. I guess I'm just surprised the second query takes so long to process.

667

asked Jul 10 '13 15:07

Jeffrey Kramer

2 Answers

The most critical thing to understand when using SAS to access data in Teradata (or any other external database for that matter) is that the SAS software prepares SQL and submits it to the database. The idea is to try and relieve you (the user) from all the database specific details. SAS does this using a concept called "implict pass-through", which just means that SAS does the translation from SAS code into DBMS code. Among the many things that occur is data type conversion: SAS only has two (and only two) data types, numeric and character.

SAS deals with translating things for you but it can be confusing. For example, I've seen "lazy" database tables defined with VARCHAR(400) columns having values that never exceed some smaller length (like column for a person's name). In the data base this isn't much of a problem, but since SAS does not have a VARCHAR data type, it creates a variable 400 characters wide for each row. Even with data set compression, this can really make the resulting SAS dataset unnecessarily large.

The alternative way is to use "explicit pass-through", where you write native queries using the actual syntax of the DBMS in question. These queries execute entirely on the DBMS and return results back to SAS (which still does the data type conversion for you. For example, here is a "pass-through" query that performs a join to two tables and creates a SAS dataset as a result:

proc sql;
   connect to teradata (user=userid password=password mode=teradata);
   create table mydata as
   select * from connection to teradata (
      select a.customer_id
           , a.customer_name
           , b.last_payment_date
           , b.last_payment_amt
      from base.customers a
      join base.invoices b
      on a.customer_id=b.customer_id
      where b.bill_month = date '2013-07-01'
        and b.paid_flag = 'N'
      );
quit;

Notice that everything inside the pair of parentheses is native Teradata SQL and that the join operation itself is running inside the database.

The example code you have shown in your question is NOT a complete, working example of a SAS/Teradata program. To better assist, you need to show the real program, including any library references. For example, suppose your real program looks like this:

proc sql;
   CREATE TABLE subset_data AS
   SELECT bigTable.id,
          SUM(bigTable.value) AS total
   FROM   TDATA.bigTable bigTable
   JOIN   TDATA.subset subset
   ON     subset.id = bigTable.id
   WHERE  bigTable.date BETWEEN a AND b
   GROUP BY bigTable.id
   ;

That would indicate a previously assigned LIBNAME statement through which SAS was connecting to Teradata. The syntax of that WHERE clause would be very relevant to if SAS is even able to pass the complete query to Teradata. (You example doesn't show what "a" and "b" refer to. It is very possible that the only way SAS can perform the join is to drag both tables back into a local work session and perform the join on your SAS server.

One thing I can strongly suggest is that you try to convince your Teradata administrators to allow you to create "driver" tables in some utility database. The idea is that you would create a relatively small table inside Teradata containing the ID's you want to extract, then use that table to perform explicit joins. I'm sure you would need a bit more formal database training to do that (like how to define a proper index and how to "collect statistics"), but with that knowledge and ability, your work will just fly.

I could go on and on but I'll stop here. I use SAS with Teradata extensively every day against what I'm told is one of the largest Teradata environments on the planet. I enjoy programming in both.

119

answered Sep 18 '22 13:09

BellevueBob

You imply an assumption that the 90k records in your first query are all unique ids. Is that definite?

I ask because the implication from your second query is that they're not unique.
- One id can have multiple values over time, and have different somevalues

If the ids are not unique in the first dataset, you need to GROUP BY id or use DISTINCT, in the first query.

Imagine that the 90k rows consists of 30k unique ids, and so have an average of 3 rows per id.

And then imagine those 30k unique ids actually have 9 records in your time window, including rows where somevalue <> x.

You will then get 3x9 records back per id.

And as those two numbers grow, the number of records in your second query grows geometrically.

Alternative Query

If that's not the problem, an alternative query (which is not ideal, but possible) would be...

SELECT
  bigTable.id,
  SUM(bigTable.value) AS total
FROM
  bigTable
WHERE
  bigTable.date BETWEEN a AND b
GROUP BY
  bigTable.id
HAVING
  MAX(CASE WHEN bigTable.somevalue = x THEN 1 ELSE 0 END) = 1

answered Sep 22 '22 13:09

MatBailie

Related questions
                            
                                How to pass custom type array to Postgres function
                            
                                sort results by column not row
                            
                                MSSQL cast( [varcharColumn] to int) in SELECT gets executed before WHERE clause filters out bad values
                            
                                SQL stored procedure passing parameter into "order by"
                            
                                Sum multiple rows
                            
                                What are the differences in the following variable initialization styles in MySQL?
                            
                                MySQL GROUP BY "and filter"
                            
                                Wordpress WP_Query 'orderby' not working
                            
                                Creating a site to query a database of tables
                            
                                Store days of week in a database
                            
                                Understanding Prepared Statements - PHP
                            
                                linq equivalent sql query "not in (select query)"
                            
                                SQL SELECT JOIN COLUMN ALIAS
                            
                                Any SQL formatter plugin for Eclipse avaialble?
                            
                                If condition inside case
                            
                                mysql - group, but show all row of a column
                            
                                Alternative to NOT IN()
                            
                                SQL - Delete all duplicates but the highest of each group for each user ID
                            
                                Change Data Type Varchar To Varbinary(max) In SQL Server
                            
                                SQL: Convert an integer into a hex string?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With