SQL/SAS: Best performance for selecting from big table (2bn rows)

Tags:

I have a non-indexed 2 billion rows table in a read-only SAS SPD server (bigtable). I have another 12 million rows table in my workspace (SAS_GRID) with a single column of unique ids (idlist). Both tables are updated constantly. I want to filter the bigtable based on idlist, something like:

create table filtered_bigtable as
select t1.* from bigtable t1 inner join idlist t2
on t1.id = t2.id;

What's the best strategy and code (sql or sas) for doing it quickly?

Edit: tried Robert's suggestion using a hash table and the query ran in only 40 minutes (10 times faster). Thanks for all the answers!

578

asked Nov 07 '17 21:11

Will Razen

1 Answers

Hash tables of course!

Hash tables can be used as a very fast method of joining two tables. The smaller table is read into memory. The location in RAM for any given row from the small table can be identified by running the key (in this case id) through a very fast hashing algorithm. This circumvents the need for an index provided that you have enough memory to hold that keys from the smaller table in memory. With only 12m rows you should be fine.

Once the keys from the smaller table are read into memory, it simply iterates over the larger table, runs the hashing algorithm against each id in the large table to see if it finds a hit against the values from the small table registered in RAM, and if it does, outputs the row.

The only overhead is loading the small table into memory (very fast), and the hashing of each key in the big table (very fast). The memory lookup time may as well be considered instant.

It's incredibly efficient because it's only reading each table once from disk. Using an index effectively results in reading the smaller table many times (or at least the index of the smaller table).

data filtered_bigtable;

  set bigtable;

  if _n_ eq 1 then do;
    declare hash ht(dataset:'idlist') ;
    ht.definekey('id');
    ht.definedone();
  end;

  if ht.find() eq 0 then do;
    output;
  end;
run;

Hashtables can be used for all kinds of programmatic goodness in SAS so be sure to read up on them. They offer a lot more than just joins.

Also - be sure to keep just the columns you require from the larger table as this can reduce the amount of time spent reading in the rows from the larger table.

169

answered Oct 08 '22 09:10

Robert Penridge

Related questions
                            
                                Is it possible to make a query like this?
                            
                                Delete from 2 tables using INNER JOIN
                            
                                Database is locked when inside a foreach with linq without ToList()
                            
                                Alternative to sys.columns.max_length
                            
                                How to SELECT the top row per group based on multiple ordering columns?
                            
                                Hive update with subquery
                            
                                Has anyone noticed that EF Core 1.0 2015.1 has made the queries very inefficient
                            
                                Checking Truncate/Alter Permission for a login
                            
                                Need to assign a Column name through a variable SQL
                            
                                MySQL Adding & Multiple Columns On Select
                            
                                SQL User Defined Table Types: Why can we drop them if not used as parameter?
                            
                                Dense Rank with order by
                            
                                Joining MySQL tables with key value pairs
                            
                                Row Level Security in Postgres on Normalized Tables
                            
                                Converting GETDATE() to Hijri date to yyyymmdd
                            
                                How to get laravel DB connection to php connection?
                            
                                Binary_Checksum Vs HashBytes function
                            
                                Querying a MariaDB database with C#
                            
                                SQL Server : SELECT from sys.tables and sys.views
                            
                                How to improve order by performance with joins in mysql

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With