Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Instability in DeleteDuplicates and Tally

While preparing an answer to Count how many different values a list takes in Mathematica I came across an instability (for lack of a better term) in both DeleteDuplicates and Tally that I do not understand.

Consider first:

a = {2.2000000000000005, 2.2, 2.1999999999999999};

a // InputForm
DeleteDuplicates@a // InputForm
Union@a // InputForm
Tally@a // InputForm
   {2.2000000000000006`, 2.2, 2.1999999999999997`}
   {2.2000000000000006`, 2.2, 2.1999999999999997`}
   {2.1999999999999997`, 2.2, 2.2000000000000006`}
   {{2.2000000000000006`, 3}}

This behavior is as I expected in each case. Tally compensates for the slight numerical differences and sees each element as being equivalent. Union and DeleteDuplicates see all elements as unique. (This behavior of Tally is not documented to my knowledge, but I have made use of it before.)

Now, consider this complication:

a = {11/5, 2.2000000000000005, 2.2, 2.1999999999999997};

a // InputForm
DeleteDuplicates@a // InputForm
Union@a // InputForm
Tally@a // InputForm
   {11/5, 2.2000000000000006, 2.2, 2.1999999999999997}
   {11/5, 2.2000000000000006, 2.2}
   {2.1999999999999997, 2.2, 11/5, 2.2000000000000006}
   {{11/5, 1}, {2.2000000000000006, 1}, {2.2, 2}}

The output of Union is as anticipated, but the results from both DeleteDuplicates and Tally are surprising.

  • Why does DeleteDuplicates suddenly see 2.1999999999999997 as a duplicate to be eliminated?

  • Why does Tally suddenly see 2.2000000000000006 and 2.2 as distinct, when it did not before?


As a related point, it can be seen that packed arrays affect Tally:

a = {2.2000000000000005, 2.2, 2.1999999999999999};
a // InputForm
Tally@a // InputForm
   {2.2000000000000006, 2.2, 2.1999999999999997}
   {{2.2000000000000006`, 3}}
a = Developer`ToPackedArray@a;
a // InputForm
Tally@a // InputForm
   {2.2000000000000006, 2.2, 2.1999999999999997}
   {{2.2000000000000006`, 1}, {2.2, 2}}
like image 465
Mr.Wizard Avatar asked May 29 '11 09:05

Mr.Wizard


2 Answers

The exhibited behaviour appears to be the result of a the usual woes associated with floating point arithmetic coupled with some questionable behaviour in some of the functions under discussion.

SameQ Is Not An Equivalence Relation

First on the slate: consider that SameQ is not an equivalence relation because it is not transitive:

In[1]:= $a = {11/5, 2.2000000000000005, 2.2, 2.1999999999999997};

In[2]:= SameQ[$a[[2]], $a[[3]]]
Out[2]= True

In[3]:= SameQ[$a[[3]], $a[[4]]]
Out[3]= True

In[4]:= SameQ[$a[[2]], $a[[4]]]
Out[4]= False                     (* !!! *)

So right out the gate, we are faced with erratic behaviour even before turning to the other functions.

This behaviour is due to the documented rule for SameQ that says that two real numbers are treated as "equal" if they "differ in their last binary digit":

In[5]:= {# // InputForm, Short@RealDigits[#, 2][[1, -10;;]]} & /@ $a[[2;;4]] // TableForm
(* showing only the last ten binary digits for each *)
Out[5]//TableForm= 2.2000000000000006  {0,1,1,0,0,1,1,0,1,1}
                   2.2                 {0,1,1,0,0,1,1,0,1,0}
                   2.1999999999999997  {0,1,1,0,0,1,1,0,0,1}

Note that, strictly speaking, $a[[3]] and $a[[4]] differ in the last two binary digits, but the magnitude of the difference is one bit of the lowest order.

DeleteDuplicates Does Not Really Use SameQ

Next, consider that the documentation states that DeleteDuplicates[...] is equivalent to DeleteDuplicates[..., SameQ]. Well, that is strictly true -- but probably not in the sense that you might expect:

In[6]:= DeleteDuplicates[$a] // InputForm
Out[6]//InputForm= {11/5, 2.2000000000000006, 2.2}

In[7]:= DeleteDuplicates[$a, SameQ] // InputForm
Out[7]//InputForm= {11/5, 2.2000000000000006, 2.2}

The same, as documented... but what about this:

In[8]:= DeleteDuplicates[$a, SameQ[#1, #2]&] // InputForm
Out[8]//InputForm= {11/5, 2.2000000000000006, 2.1999999999999997}

It appears that DeleteDuplicates goes through a different branch of logic when the comparison function is manifestly SameQ as opposed to a function whose behaviour is identical to SameQ.

Tally is... Confused

Tally shows similar, but not identical, erratic behaviour:

In[9]:= Tally[$a] // InputForm
Out[9]//InputForm=  {{11/5, 1}, {2.2000000000000006, 1}, {2.2, 2}}

In[10]:= Tally[$a, SameQ] // InputForm
Out[10]//InputForm= {{11/5, 1}, {2.2000000000000006, 1}, {2.2, 2}}

In[11]:= Tally[$a, SameQ[#1, #2]&] // InputForm
Out[11]//InputForm= {{11/5, 1}, {2.2000000000000006, 1}, {2.2000000000000006, 2}}

That last is particularly baffling, since the same number appears twice in the list with different counts.

Equal Suffers Similar Problems

Now, back to the problem of floating point equality. Equal fares a little bit better than SameQ -- but emphasis on "little". Equal looks at the last seven binary digits instead of the last one. That doesn't fix the problem, though... troublesome cases can always be found:

In[12]:= $x1 = 0.19999999999999823;
         $x2 = 0.2;
         $x3 = 0.2000000000000018;

In[15]:= Equal[$x1, $x2]
Out[15]= True

In[16]:= Equal[$x2, $x3]
Out[16]= True

In[17]:= Equal[$x1, $x3]
Out[17]= False             (* Oops *)

The Villain Unmasked

The main culprit in all of this discussion is the floating-point real number format. It is simply not possible to represent arbitrary real numbers in full fildelity using a finite format. This is why Mathematica stresses symbolic form and makes every possible attempt to work with expressions in symbolic form for as long as possible. If one finds numeric forms to be unavoidable, then one must wade into that swamp called numerical analysis to sort out all of the corner cases involving equality and inequality.

Poor SameQ, Equal, DeleteDuplicates, Tally and all of their friends never stood a chance.

like image 92
WReach Avatar answered Nov 14 '22 01:11

WReach


In my opinion, relying on anything for Tally or DeleteDuplicates with default (SameQ-like based) comparison function and numerical values is relying on implementation details, because SameQ does not have a well-defined semantics on numerical values. What you see is what is normally called "undefined behavior" in other languages. What one should be doing to get robust results is to use

DeleteDuplicates[a,Equal]

or

Tally[a,Equal]

and similarly for Union (although I would not use Union since explicit test leads to quadratic complexity for it). OTOH, if your desire is to understand the internal implementation details because you want to make use of them, I can not say much except to warn that this may cause more harm than good, particularly because these implementations are subject to change from version to version - even assuming that you get all their details right for some particular version.

like image 30
Leonid Shifrin Avatar answered Nov 14 '22 01:11

Leonid Shifrin