I have a non-indexed 2 billion rows table in a read-only SAS SPD server (bigtable). I have another 12 million rows table in my workspace (SAS_GRID) with a single column of unique ids (idlist). Both tables are updated constantly. I want to filter the bigtable based on idlist, something like:
create table filtered_bigtable as
select t1.* from bigtable t1 inner join idlist t2
on t1.id = t2.id;
What's the best strategy and code (sql or sas) for doing it quickly?
Edit: tried Robert's suggestion using a hash table and the query ran in only 40 minutes (10 times faster). Thanks for all the answers!
There are several ways to improve query performance, including the following: using indexes and composite indexes. using the keyword ALL in set operations when you know that there are no duplicate rows, or when it does not matter if you have duplicate rows in the result table.
The PROC SQL step runs much faster than other SAS procedures and the DATA step. This is because PROC SQL can optimize the query before the discovery process is launched. The WHERE clause is processed before the tables referenced by the SASHELP.
A SIMPLE PROC SQL An asterisk on the SELECT statement will select all columns from. the data set.
The SQL Procedure pass-through facility is an extension of the SQL procedure that enables you to send DBMS -specific statements to a DBMS and to retrieve DBMS data. You specify DBMS SQL syntax instead of SAS SQL syntax when you use the pass-through facility.
Hash tables of course!
Hash tables can be used as a very fast method of joining two tables. The smaller table is read into memory. The location in RAM for any given row from the small table can be identified by running the key (in this case id
) through a very fast hashing algorithm. This circumvents the need for an index provided that you have enough memory to hold that keys from the smaller table in memory. With only 12m rows you should be fine.
Once the keys from the smaller table are read into memory, it simply iterates over the larger table, runs the hashing algorithm against each id in the large table to see if it finds a hit against the values from the small table registered in RAM, and if it does, outputs the row.
The only overhead is loading the small table into memory (very fast), and the hashing of each key in the big table (very fast). The memory lookup time may as well be considered instant.
It's incredibly efficient because it's only reading each table once from disk. Using an index effectively results in reading the smaller table many times (or at least the index of the smaller table).
data filtered_bigtable;
set bigtable;
if _n_ eq 1 then do;
declare hash ht(dataset:'idlist') ;
ht.definekey('id');
ht.definedone();
end;
if ht.find() eq 0 then do;
output;
end;
run;
Hashtables can be used for all kinds of programmatic goodness in SAS so be sure to read up on them. They offer a lot more than just joins.
Also - be sure to keep just the columns you require from the larger table as this can reduce the amount of time spent reading in the rows from the larger table.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With