Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I stop SAS from adding an extra empty byte to every string variable when I use PROC EXPORT?

Tags:

sas

stata

When I export a dataset to Stata format using PROC EXPORT, SAS 9.4 automatically expands adds an extra (empty) byte to every observation of every string variable. For example, in this data set:

data test1;
    input cust_id   $ 1
          month       3-8
          category  $ 10-12 
          status    $ 14-14
;
datalines;
A 200003 ABC C
A 200004 DEF C
A 200006 XYZ 3
B 199910 ASD X
B 199912 ASD C
;
quit;

proc export data = test1
    file = "test1.dta"
    dbms = stata replace;
quit;

the variables cust_id, category, and status should be str1, str3, and str1 in the final Stata file, and thus take up 1 byte, 3 bytes, and 1 byte, respectively, for every observation. However, SAS automatically adds an extra empty byte to each observation, which expands their data types to str2, str4, and str2 data type in the outputted Stata file.

This is extremely problematic because that's an extra byte added to every observation of every string variable. For large datasets (I have some with ~530 million observations and numerous string variables), this can add several gigabytes to the exported file.

Once the file is loaded into Stata, the compress command in Stata can automatically remove these empty bytes and shrink the file, but for large datasets, PROC EXPORT adds so many extra bytes to the file that I don't always have enough memory to load the dataset into Stata in the first place.

Is there a way to stop SAS from padding the string variables in the first place? When I export a file with a one character string variable (for example), I want that variable stored as a one character string variable in the output file.

like image 819
Michael A Avatar asked Jun 10 '15 20:06

Michael A


1 Answers

This is how you can do it using existing functions.

filename FT41F001 temp;
data _null_;
   file FT41F001;
   set test1;
   put 256*' ' @;
   __s=1;
   do while(1);
      length __name $32.;
      call vnext(__name);
      if missing(__name) or __name eq: '__' then leave;
      substr(_FILE_,__s) = vvaluex(__name);
      putlog _all_;
      __s = sum(__s,vformatwx(__name));
      end;
   _file_ = trim(_file_);
   put;
   format month f6.;
   run;

To avoid the use of _FILE_;

data _null_;
   file FT41F001;
   set test1;
   __s=1;
   do while(1);
      length __name $32. __value $128 __w 8;
      call vnext(__name);
      if missing(__name) or __name eq: '__' then leave;
      __value = vvaluex(__name);
      __w = vformatwx(__name);
      put __value $varying128. __w @;
      end;
   put;
   format month f6.;
   run;
like image 100
data _null_ Avatar answered Sep 27 '22 17:09

data _null_