Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Any ideas why R and Python's NumPy scaling of vectors is not matching?

I have the following Python code and output:

>>> import numpy as np
>>> s = [12.40265325, -1.3362417499999921, 6.8768662500000062, 25.673127166666703, 19.733372250000002, 21.649556250000003, 7.1676752500000021, -0.85349583333329804, 23.130314250000012, 20.074925250000007, -0.29701574999999281, 17.078694250000012, 3.3652611666666985, 19.491246250000003, -0.76856974999999039, -1.8838917499999965, -6.8547018333333085, 4.5195052500000088, 5.9882702500000136, -9.5889237499999922, 13.98170916666669, -2.3662137499999929, 12.111165250000013, -6.8334957499999902, -21.379336749999993, 8.4651301666666967, 2.5094612500000082, -0.21386274999998989, 5.1226162500000072, 14.283680166666699, -4.3340977499999909, -2.7831607499999933, 8.2339832500000085, -12.841856749999991, -6.4984398333333075, -6.2773697499999912, -13.638411749999996, -15.90088974999999, -8.2505068333333043, -19.616496749999996, -4.4346607499999919, -10.056376749999991, -13.581729833333299, -8.2284047499999957, -4.5957137499999945, -5.3758427499999968, -12.254779833333302, 11.207287250000007, -12.848971749999997, -14.449801749999992, -17.247984749999993, -17.475253833333305]
>>> np.mean(s)
1.3664283380001927e-14
>>> np.std(s)
12.137473069268983
>>> (s - np.mean(s)) / np.std(s)
array([ 1.02184806, -0.11009225,  0.56658138,  2.1151954 , ...

When I run this in R, the results do not match:

> options(digits=16)
> s = c(12.40265325, -1.3362417499999921, 6.8768662500000062, 25.673127166666703, 19.733372250000002, 21.649556250000003, 7.1676752500000021, -0.85349583333329804, 23.130314250000012, 20.074925250000007, -0.29701574999999281, 17.078694250000012, 3.3652611666666985, 19.491246250000003, -0.76856974999999039, -1.8838917499999965, -6.8547018333333085, 4.5195052500000088, 5.9882702500000136, -9.5889237499999922, 13.98170916666669, -2.3662137499999929, 12.111165250000013, -6.8334957499999902, -21.379336749999993, 8.4651301666666967, 2.5094612500000082, -0.21386274999998989, 5.1226162500000072, 14.283680166666699, -4.3340977499999909, -2.7831607499999933, 8.2339832500000085, -12.841856749999991, -6.4984398333333075, -6.2773697499999912, -13.638411749999996, -15.90088974999999, -8.2505068333333043, -19.616496749999996, -4.4346607499999919, -10.056376749999991, -13.581729833333299, -8.2284047499999957, -4.5957137499999945, -5.3758427499999968, -12.254779833333302, 11.207287250000007, -12.848971749999997, -14.449801749999992, -17.247984749999993, -17.475253833333305)
> mean(s)
[1] 1.243449787580175e-14
> sd(s)
[1] 12.25589024484334
> (s - mean(s)) / sd(s)
 [1]  1.01197489551755737 -0.10902853430514588  2.09475824715945480  0.56110703609584245 ...

I know the differences are pretty minor, but this is a bit of an issue with my application. Also of note, the R results match Stata's results too.

Note: I'm using Python 2.7.2, NumpPy 1.6.1, R 2.15.2 GUI 1.53 Leopard build 64-bit (6335)

like image 530
Dolan Antenucci Avatar asked Oct 04 '13 22:10

Dolan Antenucci


1 Answers

For the std, which is clearly off by some substantial amount, in numpy, std returns sqrt(sum((x-x.mean())**2)) / (n-ddof) where ddof=0 by default. I guess R assumes ddof=1, because:

In [7]: s.std()
Out[7]: 12.137473069268983

In [8]: s.std(ddof=1)
Out[8]: 12.255890244843339

and:

> sd(s)
[1] 12.25589

I can't explain the mean, but since it's basically zero in each case, I would call it precision issues. numpy would report them as being 'close enough' under default tolerances:

In [5]: np.isclose(s.mean(), 1.24345e-14)
Out[5]: True

The other answers discuss this issue better than I can.

like image 179
askewchan Avatar answered Nov 15 '22 21:11

askewchan