I'm curious about SAS's use of memory, sorting, and why it seems to be so inefficient.
I have a quad core xeon with 8GB ram. I have a 3GB dataset. Why, at any given time during a standard proc sort, is a mere 120MB of ram being used and a meager 15-20% CPU utilization? This seems like something horribly inefficient is going on with the procedure.
In my opinion, as I have the available memory, it would load the entire dataset and then proceed to obliterate all available CPU cycles. But only 15%? It's a stunning waste of available resources and bothers me. It seems like it's constantly going back and forth to the disk which is painfully slow.
Is there some magical setting that says "SAS, you can utilize everything to go faster" I'm missing?
64bit OS running 64bit SAS, btw.
One way to limit the amount of virtual memory that SAS can use is to specify a value for the MEMSIZE= system option when you invoke SAS. Under OS/390, MEMSIZE= has a default value of 0, which means that SAS can use memory up to the maximum amount that is available.
The sort routine that SAS uses can be based on either the number of observations in a data set, or on the size of the data set. When the SORTPGM system option is set to BEST, SAS uses the first available and pertinent sorting algorithm based on the following order of precedence: host sort utility. SAS sort utility.
However, for larger datasets, PROC SQL is faster than PROC SORT. Also, it seems that sorting character data is easier (read: faster), than numeric data.
TAGSORT Option Do not specify TAGSORT if you want the SAS to use multiple threads to sort. When you specify the TAGSORT option, only sort keys (that is, the variables specified in the BY statement) and the observation number for each observation are stored in the temporary files.
You might check your MEMSIZE and SORTSIZE settings. More discussion about sort performance is here.
The thing with sort is that it's not the sorting that takes the time, generally it's the reading the data set in and writing it out again. Sorting is, comparatively, quick. So with a 3GB data set significant time is taken just waiting for the disk to supply all of the data. It can overlap sorting parts of the data with reading more of it in, but it's still likely to be I/O bound. That said, MEMSIZE and SORTSIZE will at least allow you to make maximum use of your available memory. You need to ensure that SAS will be reading the entire data set in and sorting it in one go and then writing it out again. With lower memory, or if MEMSIZE/SORTSIZE are not suitably configured, it will sort the data set in chunks and then have to merge those chunks. You really want to avoid "multi-pass sort" if at all possible as it will double the time it takes (has to go through the whole data set sorting chunks, then to through all the data again, merging those chunks). I think you get hints from the SASLOG as to whether it is multi-pass sorting or not.
In general, that's not how SAS works. SAS keeps your data on your disk drives and only reads a small portion of it at a time. To, me that's the advantage of SAS: I use SAS for stuff that can't fit in RAM.
You might be interested in Stata, R, or another package that keeps your data in RAM. It's pretty easy to move back & forth between the programs, even for the same project.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With