Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is the length of the SAS character field 32,767?

Tags:

sas

According to The Little SAS Book, SAS character data can be up to 2^(15)-1 in length.

Where does that 1 character go? Usually in floating point arithmetic, we reserve one byte for the sign of the floating point number. Is something similar happening for SAS character data?

like image 834
Demetri Pananos Avatar asked Nov 24 '25 17:11

Demetri Pananos


1 Answers

I don't have a definite answer, but I have a supposition.

I think that the length of 32,767 is not related to the field itself; SAS stores all of its rows (in an uncompressed file) in identical sized blocks, and so there is no need for a field length indicator or a null terminator. IE, in a SAS dataset you would have something like, for the following data step equivalent:

data want;
  length name $8;
  input recnum name $ age;
datalines;
01 Johnny 13
02 Nancy 12
03 Rachel 14
04 Madison 12
05 Dennis 15
;;;;
run;

You'd have something like this. The headers are of course not written that way but are just packed sequences of bytes.

<dataset header>
Dataset name: Want
Dataset record size: 24 bytes
... etc. ...
<subheaders>
Name character type length=8
Recnum numeric type length=8
Age numeric type length=8
... etc. ...
<first row of data follows>
4A6F686E6E792020000000010000000D
4E616E6379202020000000020000000C
52616368656C2020000000030000000E
4D616469736F6E20000000040000000C
44656E6E69732020000000050000000F
<end of data>

The variables run directly into each other, and SAS knows where one starts and one stops from the information in the subheaders. (This is just a PUT statement of course; I think in the actual file the integers are stored first, if I remember correctly; but the idea is the same.)

Technically the .sas7bdat specification is not a publicly disclosed specification, but several people have worked out most of how the file format works. Some R programmers have written a specification which while a bit challenging to read does give some information.

It denotes that 4 bytes are used to specify the field length, which is more than enough for 32767 (it's enough for 2 billion), so this isn't the definite answer; I suppose it may have originally been 2 bytes and changed to 4 at some later point in the development of SAS, though .sas7bdat was a totally new filetype created relatively recently (version 7, hence sas7bdat; we're on 9 now).

Another possibility, and perhaps the more likely one, is that before 1999 the ANSI C standard only required C compilers to support objects to a minimum of 32767 bytes - meaning a compiler didn't have to support arrays larger than 32767 bytes. While many of them did support much larger arrays/objects, it's possible that SAS was working with the minimum standard to avoid issues with different OS and hardware implementations. See this discussion of the ANSI C standards for some background. It's also possible another language's limitations (as SAS uses several different ones) of a similar nature are at fault here. [Credit to FriedEgg for the beginning of this idea (offline).]

like image 105
Joe Avatar answered Nov 27 '25 16:11

Joe