I need a theoretical answer.
Imagine that you have a table with 1.5 billion rows (the table is created as column-based with DB2-Blu).
You are using SAS and you will do some statistics by using Proc Summary
like min/max/mean values, standard deviation value and percentile-10, percentile-90 through your peer-groups.
For instance, you have 30.000 peer-groups and you have 50.000 values in each peer group (Total 1.5 billions values).
The other case you have 3 million peer-groups and also you have 50 values in each peer-group. So you have total 1.5 billion values again.
Would it go faster if you have less peer groups but more values in each peer-group? Or would it go faster with more peer-groups but less less values in each peer-group.
I could test the first case (30.000 peer-groups and 50.000 values per peer group) and it took around 16 mins. But I can't test for the second case.
Can you write an approximate prognose for run-time in case when I have 3 million peer-groups and also 50 values in each peer-group?
One more dimension for the question. Would it be faster to do those statistics if I use Proc SQL
instead?
Example code is below:
proc summary data = table_blu missing chartype;
class var1 var2; /* Var1 and var2 are toghether peer-group */
var values;
output out = stattable(rename = (_type_ = type) drop = _freq_)
n=n min=min max=max mean=mean std=std q1=q1 q3=q3 p10=p10 p90=p90 p95=p95
;
run;
The default statistics of Proc Summary are N, MIN, MAX, MEAN and STD. The data set contains 4 variables. The variable height is simply the value of the computed statistic. Besides that, SAS creates three new variables.
The key difference between PROC MEANS and PROC SUMMARY is that the default action of PROC MEANS is to place the analyses it performs in to your Output Window and in PROC SUMMARY the default is to create an output data set.
By default, PROC MEANS generates N, Mean, Standard Deviation, Minimum and Maximum statistics.
Computes descriptive statistics for variables across all observations or within groups of observations. See: For full syntax details, see PROC MEANS Statement.
Summary or Descriptive statistics in SAS is obtained using multiple ways like PROC Means and PROC Univariate. Summary statistic of all columns in SAS. Descriptive or Summary statistics of single column in SAS.
Summary or Descriptive statistics of a column by Groups in SAS : PROC MEANS Summary or Descriptive statistics of a column (MPG) by Groups (Luxury) in SAS using PROC MEANS. CLASS statement is used to define groups. 1
When we specify NWAY, Proc Summary limits the output statistics to the observations with the highest _TYPE_ value. This means, that SAS outputs only the observations where all class variables (if any) contribute to the statistic. Consequently, no overall statistics appear in the output.
Apply PROC PRINT to the output SAS data set, which may result in a LOT of output if you have several CLASS variables.
So there are a number of things to think about here.
The first point and quite possibly the largest in terms of performance is getting the data from DB2 into SAS. (I'm assuming this is not an in database instance of SAS -- correct me if it is). That's a big table and moving it across the wire takes time. Because of that, if you can calculate all these statistics inside DB2 with an SQL statement, that will probably be your fastest option.
So assuming you've downloaded the table to the SAS server:
A table sorted by the CLASS
variables will be MUCH faster to process than an unsorted table. If SAS knows the table is sorted, it doesn't have to scan the table for records to go into a group, it can do block reads instead of random IO.
If the table is not sorted, then the larger the number of groups, then more table scans that have to occur.
The point is, the speed of getting data from the HD to the CPU will be paramount in an unsorted process.
From there, you get into a memory and cpu issue. PROC SUMMARY is multithreaded and SAS will read N groups at a time. If group size can fit into the memory allocated for that thread, you won't have an issue. If the group size is too large, then SAS will have to page.
I scaled down the problem to a 15M row example:
%let grps=3000;
%let pergrp=5000;
UNSORTED:
NOTE: There were 15000000 observations read from the data set
WORK.TEST.
NOTE: The data set WORK.SUMMARY has 3001 observations and 9
variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 20.88 seconds
cpu time 31.71 seconds
SORTED:
NOTE: There were 15000000 observations read from the data set
WORK.TEST.
NOTE: The data set WORK.SUMMARY has 3001 observations and 9
variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 5.44 seconds
cpu time 11.26 seconds
=============================
%let grps=300000;
%let pergrp=50;
UNSORTED:
NOTE: There were 15000000 observations read from the data set
WORK.TEST.
NOTE: The data set WORK.SUMMARY has 300001 observations and 9
variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 19.26 seconds
cpu time 41.35 seconds
SORTED:
NOTE: There were 15000000 observations read from the data set
WORK.TEST.
NOTE: The data set WORK.SUMMARY has 300001 observations and 9
variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 5.43 seconds
cpu time 10.09 seconds
I ran these a few times and the run times were similar. Sorted times are about equal and way faster.
The more groups / less per group was faster unsorted, but look at the total CPU usage, it is higher. My laptop has an extremely fast SSD so IO was probably not the limiting factor -- the HD was able to keep up with the multi-core CPU's demands. On a system with a slower HD, the total run times could be different.
In the end, it depends too much on how the data is structured and the specifics of your server and DB.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With