Which statistics is calculated faster in SAS, proc summary?

Tags:

I need a theoretical answer.

Imagine that you have a table with 1.5 billion rows (the table is created as column-based with DB2-Blu).

You are using SAS and you will do some statistics by using Proc Summary like min/max/mean values, standard deviation value and percentile-10, percentile-90 through your peer-groups.

For instance, you have 30.000 peer-groups and you have 50.000 values in each peer group (Total 1.5 billions values).

The other case you have 3 million peer-groups and also you have 50 values in each peer-group. So you have total 1.5 billion values again.

Would it go faster if you have less peer groups but more values in each peer-group? Or would it go faster with more peer-groups but less less values in each peer-group.

I could test the first case (30.000 peer-groups and 50.000 values per peer group) and it took around 16 mins. But I can't test for the second case.

Can you write an approximate prognose for run-time in case when I have 3 million peer-groups and also 50 values in each peer-group?

One more dimension for the question. Would it be faster to do those statistics if I use Proc SQL instead?

Example code is below:

proc summary data = table_blu missing chartype;
   class var1 var2; /* Var1 and var2 are toghether peer-group  */
   var values;

   output out = stattable(rename = (_type_ = type) drop = _freq_)
   n=n min=min max=max mean=mean std=std q1=q1 q3=q3 p10=p10 p90=p90 p95=p95 
;
run;

904

asked Apr 20 '17 12:04

user3714330

Video Answer

1 Answers

So there are a number of things to think about here.

The first point and quite possibly the largest in terms of performance is getting the data from DB2 into SAS. (I'm assuming this is not an in database instance of SAS -- correct me if it is). That's a big table and moving it across the wire takes time. Because of that, if you can calculate all these statistics inside DB2 with an SQL statement, that will probably be your fastest option.

So assuming you've downloaded the table to the SAS server:

A table sorted by the CLASS variables will be MUCH faster to process than an unsorted table. If SAS knows the table is sorted, it doesn't have to scan the table for records to go into a group, it can do block reads instead of random IO.

If the table is not sorted, then the larger the number of groups, then more table scans that have to occur.

The point is, the speed of getting data from the HD to the CPU will be paramount in an unsorted process.

From there, you get into a memory and cpu issue. PROC SUMMARY is multithreaded and SAS will read N groups at a time. If group size can fit into the memory allocated for that thread, you won't have an issue. If the group size is too large, then SAS will have to page.

I scaled down the problem to a 15M row example:

%let grps=3000;
%let pergrp=5000;

UNSORTED:

NOTE: There were 15000000 observations read from the data set
      WORK.TEST.
NOTE: The data set WORK.SUMMARY has 3001 observations and 9
      variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           20.88 seconds
      cpu time            31.71 seconds

SORTED:

NOTE: There were 15000000 observations read from the data set
      WORK.TEST.
NOTE: The data set WORK.SUMMARY has 3001 observations and 9
      variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           5.44 seconds
      cpu time            11.26 seconds

=============================

%let grps=300000;
%let pergrp=50;

UNSORTED:

NOTE: There were 15000000 observations read from the data set
      WORK.TEST.
NOTE: The data set WORK.SUMMARY has 300001 observations and 9
      variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           19.26 seconds
      cpu time            41.35 seconds

SORTED:

NOTE: There were 15000000 observations read from the data set
      WORK.TEST.
NOTE: The data set WORK.SUMMARY has 300001 observations and 9
      variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           5.43 seconds
      cpu time            10.09 seconds

I ran these a few times and the run times were similar. Sorted times are about equal and way faster.

The more groups / less per group was faster unsorted, but look at the total CPU usage, it is higher. My laptop has an extremely fast SSD so IO was probably not the limiting factor -- the HD was able to keep up with the multi-core CPU's demands. On a system with a slower HD, the total run times could be different.

In the end, it depends too much on how the data is structured and the specifics of your server and DB.

answered Sep 22 '22 18:09

DomPazz

Related questions
                            
                                Creating Surrogate Key table with other unique constraints
                            
                                postgres query with IN is very slow
                            
                                what's the difference between is not null and <>' '
                            
                                Get substring between second and fourth slash
                            
                                HQL left join with condition
                            
                                Oracle REGEXP_LIKE doesn't work as expected
                            
                                Strange translation of jOOQ query for array contains function
                            
                                SQL select items that make datetime range between flag toggle
                            
                                MySQL Performance - string vs integer
                            
                                SQL: How to change a row order position
                            
                                How to add xmlns in the root in XML in SQL Server 2014
                            
                                Oracle SQL - SELECT query locks index & blocks DML sessions
                            
                                What is the difference between NOT condition and NOT() in Oracle and MS SQL Server
                            
                                Can't commit changes to table with Datagrip
                            
                                Making a CHECK CONSTRAINT with OR in postgres SQL
                            
                                How to store datetime with millisecond precision in SQL database
                            
                                PostgreSQL - Start A Transaction block IN Function
                            
                                how to change column value in spark sql
                            
                                How to get history of table structure change in SQL Server
                            
                                How is SQL Injection Possible When Using Bind Variables?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Which statistics is calculated faster in SAS, proc summary?

Tags:

sql

statistics

db2

sas

proc-sql

user3714330

People also ask

Video Answer

1 Answers

DomPazz

Recent Activity

Donate For Us