Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is the md5 function safe to use for merging datasets?

Tags:

md5

sas

We are about to promote a piece of code which uses the SAS md5() hash function to efficiently track changes in a large dataset.

format md5 $hex32.;
md5=md5(cats(of _all_));

As per the documentation:

The MD5 function converts a string, based on the MD5 algorithm, into a 128-bit hash value. This hash value is referred to as a message digest (digital signature), which is nearly unique for each string that is passed to the function.

At approximately what stage does 'nearly unique' begin to pose a data integrity risk (if at all)?

like image 603
Allan Bowe Avatar asked Dec 20 '25 20:12

Allan Bowe


1 Answers

I have seen an example where the md5 comparison goes wrong. If you have the values "AB" and "CD" in the (two columns of the) first row and "ABC" and "D" in the second row, they got the same md5 value. See this example:

data md5;
  attrib a b length=$3 informat=$3.;
  infile datalines;
  input a b;
  format md5 $hex32.;
  md5=md5(cats(of _all_));
datalines;
AB CD
A BCD
;run;

This is, of course, because the CATS(of _all_) will concatinate and strip the variables (converting numbers to string using the "best" format), without a delimiter. If you use CAT instead , this will not happen because the leading and trailing blanks are not removed. This error is not very far fetched. If you have missing values, then this could occur more often. If, for example, you have a lot of binary values in text variables, some of which are missing, it could occur very often.

One could do this manually, adding a delimiter in between the values. Of course, you would still have the case when you have ("AB!" and "CD") and ("AB" and "!CD") and you use "!" as delimiter...

like image 180
Stig Eide Avatar answered Dec 23 '25 08:12

Stig Eide



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!