While preparing an answer to Count how many different values a list takes in Mathematica I came across an instability (for lack of a better term) in both DeleteDuplicates
and Tally
that I do not understand.
Consider first:
a = {2.2000000000000005, 2.2, 2.1999999999999999};
a // InputForm
DeleteDuplicates@a // InputForm
Union@a // InputForm
Tally@a // InputForm
{2.2000000000000006`, 2.2, 2.1999999999999997`}
{2.2000000000000006`, 2.2, 2.1999999999999997`}
{2.1999999999999997`, 2.2, 2.2000000000000006`}
{{2.2000000000000006`, 3}}
This behavior is as I expected in each case. Tally
compensates for the slight numerical differences and sees each element as being equivalent. Union
and DeleteDuplicates
see all elements as unique. (This behavior of Tally
is not documented to my knowledge, but I have made use of it before.)
Now, consider this complication:
a = {11/5, 2.2000000000000005, 2.2, 2.1999999999999997};
a // InputForm
DeleteDuplicates@a // InputForm
Union@a // InputForm
Tally@a // InputForm
{11/5, 2.2000000000000006, 2.2, 2.1999999999999997}
{11/5, 2.2000000000000006, 2.2}
{2.1999999999999997, 2.2, 11/5, 2.2000000000000006}
{{11/5, 1}, {2.2000000000000006, 1}, {2.2, 2}}
The output of Union
is as anticipated, but the results from both DeleteDuplicates
and Tally
are surprising.
Why does DeleteDuplicates
suddenly see 2.1999999999999997
as a duplicate to be eliminated?
Why does Tally
suddenly see 2.2000000000000006
and 2.2
as distinct, when it did not before?
As a related point, it can be seen that packed arrays affect Tally
:
a = {2.2000000000000005, 2.2, 2.1999999999999999};
a // InputForm
Tally@a // InputForm
{2.2000000000000006, 2.2, 2.1999999999999997}
{{2.2000000000000006`, 3}}
a = Developer`ToPackedArray@a;
a // InputForm
Tally@a // InputForm
{2.2000000000000006, 2.2, 2.1999999999999997}
{{2.2000000000000006`, 1}, {2.2, 2}}
The exhibited behaviour appears to be the result of a the usual woes associated with floating point arithmetic coupled with some questionable behaviour in some of the functions under discussion.
SameQ Is Not An Equivalence Relation
First on the slate: consider that SameQ
is not an equivalence relation because it is not transitive:
In[1]:= $a = {11/5, 2.2000000000000005, 2.2, 2.1999999999999997};
In[2]:= SameQ[$a[[2]], $a[[3]]]
Out[2]= True
In[3]:= SameQ[$a[[3]], $a[[4]]]
Out[3]= True
In[4]:= SameQ[$a[[2]], $a[[4]]]
Out[4]= False (* !!! *)
So right out the gate, we are faced with erratic behaviour even before turning to the other functions.
This behaviour is due to the documented rule for SameQ
that says that two real numbers are treated as "equal" if they "differ in their last binary digit":
In[5]:= {# // InputForm, Short@RealDigits[#, 2][[1, -10;;]]} & /@ $a[[2;;4]] // TableForm
(* showing only the last ten binary digits for each *)
Out[5]//TableForm= 2.2000000000000006 {0,1,1,0,0,1,1,0,1,1}
2.2 {0,1,1,0,0,1,1,0,1,0}
2.1999999999999997 {0,1,1,0,0,1,1,0,0,1}
Note that, strictly speaking, $a[[3]]
and $a[[4]]
differ in the last two binary digits, but the magnitude of the difference is one bit of the lowest order.
DeleteDuplicates Does Not Really Use SameQ
Next, consider that the documentation states that DeleteDuplicates[...]
is equivalent to DeleteDuplicates[..., SameQ]
. Well, that is strictly true -- but probably not in the sense that you might expect:
In[6]:= DeleteDuplicates[$a] // InputForm
Out[6]//InputForm= {11/5, 2.2000000000000006, 2.2}
In[7]:= DeleteDuplicates[$a, SameQ] // InputForm
Out[7]//InputForm= {11/5, 2.2000000000000006, 2.2}
The same, as documented... but what about this:
In[8]:= DeleteDuplicates[$a, SameQ[#1, #2]&] // InputForm
Out[8]//InputForm= {11/5, 2.2000000000000006, 2.1999999999999997}
It appears that DeleteDuplicates
goes through a different branch of logic when the comparison function is manifestly SameQ
as opposed to a function whose behaviour is identical to SameQ
.
Tally is... Confused
Tally
shows similar, but not identical, erratic behaviour:
In[9]:= Tally[$a] // InputForm
Out[9]//InputForm= {{11/5, 1}, {2.2000000000000006, 1}, {2.2, 2}}
In[10]:= Tally[$a, SameQ] // InputForm
Out[10]//InputForm= {{11/5, 1}, {2.2000000000000006, 1}, {2.2, 2}}
In[11]:= Tally[$a, SameQ[#1, #2]&] // InputForm
Out[11]//InputForm= {{11/5, 1}, {2.2000000000000006, 1}, {2.2000000000000006, 2}}
That last is particularly baffling, since the same number appears twice in the list with different counts.
Equal Suffers Similar Problems
Now, back to the problem of floating point equality. Equal
fares a little bit better than SameQ
-- but emphasis on "little". Equal
looks at the last seven binary digits instead of the last one. That doesn't fix the problem, though... troublesome cases can always be found:
In[12]:= $x1 = 0.19999999999999823;
$x2 = 0.2;
$x3 = 0.2000000000000018;
In[15]:= Equal[$x1, $x2]
Out[15]= True
In[16]:= Equal[$x2, $x3]
Out[16]= True
In[17]:= Equal[$x1, $x3]
Out[17]= False (* Oops *)
The Villain Unmasked
The main culprit in all of this discussion is the floating-point real number format. It is simply not possible to represent arbitrary real numbers in full fildelity using a finite format. This is why Mathematica stresses symbolic form and makes every possible attempt to work with expressions in symbolic form for as long as possible. If one finds numeric forms to be unavoidable, then one must wade into that swamp called numerical analysis to sort out all of the corner cases involving equality and inequality.
Poor SameQ
, Equal
, DeleteDuplicates
, Tally
and all of their friends never stood a chance.
In my opinion, relying on anything for Tally
or DeleteDuplicates
with default (SameQ
-like based) comparison function and numerical values is relying on implementation details, because SameQ
does not have a well-defined semantics on numerical values. What you see is what is normally called "undefined behavior" in other languages. What one should be doing to get robust results is to use
DeleteDuplicates[a,Equal]
or
Tally[a,Equal]
and similarly for Union
(although I would not use Union
since explicit test leads to quadratic complexity for it). OTOH, if your desire is to understand the internal implementation details because you want to make use of them, I can not say much except to warn that this may cause more harm than good, particularly because these implementations are subject to change from version to version - even assuming that you get all their details right for some particular version.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With