Perhaps I am doing something wrong while z-normalizing my array. Can someone take a look at this and suggest what's going on? In R: <pre class="prettyprint"><code>> data <- c(2.02, 2.33, 2.99, 6.85, 9.20, 8.80, 7.50, 6.00, 5.85, 3.85, 4.85, 3.85, 2.22, 1.45, 1.34) > data.mean <- mean(data) > data.sd <- sqrt(var(data)) > data.norm <- (data - data.mean) / data.sd > print(data.norm) [1] -0.9796808 -0.8622706 -0.6123005 0.8496459 1.7396910 1.5881940 1.0958286 0.5277147 0.4709033 -0.2865819 [11] 0.0921607 -0.2865819 -0.9039323 -1.1955641 -1.2372258 </code></pre> In Python using numpy: <pre class="prettyprint"><code>>>> import string >>> import numpy as np >>> from scipy.stats import norm >>> data = np.array([np.array([2.02, 2.33, 2.99, 6.85, 9.20, 8.80, 7.50, 6.00, 5.85, 3.85, 4.85, 3.85, 2.22, 1.45, 1.34])]) >>> data -= np.split(np.mean(data, axis=1), data.shape[0]) >>> data *= np.split(1.0/data.std(axis=1), data.shape[0]) >>> print data [[-1.01406602 -0.89253491 -0.63379126 0.87946705 1.80075126 1.64393692 1.13429034 0.54623659 0.48743122 -0.29664045 0.09539539 -0.29664045 -0.93565885 -1.23752644 -1.28065039]] </code></pre> Am I using <code>numpy</code> incorrectly?

The reason you're getting different results has to do with how the standard deviation/variance is calculated. R calculates using denominator <code>N-1</code>, while numpy calculates using denominator <code>N</code>. You can get a numpy result equal to the R result by using <code>data.std(ddof=1)</code>, which tells numpy to use <code>N-1</code> as the denominator when calculating the variance.

I believe that your NumPy result is correct. I would do the normalization in a simpler way, though: <pre class="prettyprint"><code>>>> data = np.array([2.02, 2.33, 2.99, 6.85, 9.20, 8.80, 7.50, 6.00, 5.85, 3.85, 4.85, 3.85, 2.22, 1.45, 1.34]) >>> data -= data.mean() >>> data /= data.std() >>> data array([-1.01406602, -0.89253491, -0.63379126, 0.87946705, 1.80075126, 1.64393692, 1.13429034, 0.54623659, 0.48743122, -0.29664045, 0.09539539, -0.29664045, -0.93565885, -1.23752644, -1.28065039]) </code></pre> The difference between your two results lies in the normalization: with <code>r</code> as the R result: <pre class="prettyprint"><code>>>> r / data array([ 0.96609173, 0.96609173, 0.96609173, 0.96609179, 0.96609179, 0.96609181, 0.9660918 , 0.96609181, 0.96609179, 0.96609179, 0.9660918 , 0.96609179, 0.96609175, 0.96609176, 0.96609177]) </code></pre> Thus, your two results are mostly simply proportional to each other. You may therefore want to compare the standard deviations obtained with R and with Python. PS: Now that I am thinking of it, it may be that the variance in NumPy and in R is not defined in the same way: for <code>N</code> elements, some tools normalize with <code>N-1</code> instead of <code>N</code>, when calculating the variance. You may want to check this. PPS: Here is the reason for the discrepancy: the difference in factors comes from two different normalization conventions: the observed factor is simply sqrt(14/15) = 0.9660917… (because the data has 15 elements). Thus, in order to obtain in R the same result as in Python, you need to divide the R result by this factor.

Output values differ between R and Python?

Tags:

python

r

debugging

numpy

statistics

Perhaps I am doing something wrong while z-normalizing my array. Can someone take a look at this and suggest what's going on?

In R:

Click to copy

> data <- c(2.02, 2.33, 2.99, 6.85, 9.20, 8.80, 7.50, 6.00, 5.85, 3.85, 4.85, 3.85, 2.22, 1.45, 1.34)
> data.mean <- mean(data)
> data.sd <- sqrt(var(data))
> data.norm <- (data - data.mean) / data.sd
> print(data.norm)
 [1] -0.9796808 -0.8622706 -0.6123005  0.8496459  1.7396910  1.5881940  1.0958286  0.5277147  0.4709033 -0.2865819
[11]  0.0921607 -0.2865819 -0.9039323 -1.1955641 -1.2372258

In Python using numpy:

Click to copy

>>> import string
>>> import numpy as np
>>> from scipy.stats import norm
>>> data = np.array([np.array([2.02, 2.33, 2.99, 6.85, 9.20, 8.80, 7.50, 6.00, 5.85, 3.85, 4.85, 3.85, 2.22, 1.45, 1.34])])
>>> data -= np.split(np.mean(data, axis=1), data.shape[0])
>>> data *= np.split(1.0/data.std(axis=1), data.shape[0])
>>> print data

[[-1.01406602 -0.89253491 -0.63379126  0.87946705  1.80075126  1.64393692
   1.13429034  0.54623659  0.48743122 -0.29664045  0.09539539 -0.29664045
  -0.93565885 -1.23752644 -1.28065039]]

Am I using numpy incorrectly?

290

asked Jun 28 '12 01:06

Legend

2 Answers

The reason you're getting different results has to do with how the standard deviation/variance is calculated. R calculates using denominator N-1, while numpy calculates using denominator N. You can get a numpy result equal to the R result by using data.std(ddof=1), which tells numpy to use N-1 as the denominator when calculating the variance.

answered Oct 03 '22 19:10

BrenBarn

I believe that your NumPy result is correct. I would do the normalization in a simpler way, though:

Click to copy

>>> data = np.array([2.02, 2.33, 2.99, 6.85, 9.20, 8.80, 7.50, 6.00, 5.85, 3.85, 4.85, 3.85, 2.22, 1.45, 1.34])
>>> data -= data.mean()
>>> data /= data.std()
>>> data
array([-1.01406602, -0.89253491, -0.63379126,  0.87946705,  1.80075126,
        1.64393692,  1.13429034,  0.54623659,  0.48743122, -0.29664045,
        0.09539539, -0.29664045, -0.93565885, -1.23752644, -1.28065039])

The difference between your two results lies in the normalization: with r as the R result:

Click to copy

>>> r / data
array([ 0.96609173,  0.96609173,  0.96609173,  0.96609179,  0.96609179, 0.96609181,  0.9660918 ,  0.96609181,
        0.96609179,  0.96609179,        0.9660918 ,  0.96609179,  0.96609175,  0.96609176,  0.96609177])

Thus, your two results are mostly simply proportional to each other. You may therefore want to compare the standard deviations obtained with R and with Python.

PS: Now that I am thinking of it, it may be that the variance in NumPy and in R is not defined in the same way: for N elements, some tools normalize with N-1 instead of N, when calculating the variance. You may want to check this.

PPS: Here is the reason for the discrepancy: the difference in factors comes from two different normalization conventions: the observed factor is simply sqrt(14/15) = 0.9660917… (because the data has 15 elements). Thus, in order to obtain in R the same result as in Python, you need to divide the R result by this factor.

answered Oct 03 '22 17:10

Eric O Lebigot

Related questions
                            
                                finding the missing values in a range using any scripting language - perl, python or shell script
                            
                                Processing command-line arguments in prefix notation in Python
                            
                                Loading a config file from operation system independent place in python
                            
                                Opposite of Python for ... else
                            
                                Linear Regression with Python numpy
                            
                                Psycopg2 using wildcard causes TypeError
                            
                                How can "k in d" be False, but "k in d.keys()" be True?
                            
                                Python memory management for list()
                            
                                Scala: Implementing Java's AspectJ around advice or Python decorators
                            
                                Assigning random value to a parameter in a python program
                            
                                Use of return in long if-elseif-else statements (Python)
                            
                                Preserve space when stripping HTML with Beautiful Soup
                            
                                python read-only class properties
                            
                                Python 2.7: replace method of string object deprecated
                            
                                python subclasses
                            
                                Removing one list from another
                            
                                making exe file from python that uses command line arguments
                            
                                Handle circular dependencies in Python modules?
                            
                                Find indexes on two lists based on items condition
                            
                                Why can i read lines from file only one time?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Output values differ between R and Python?

Tags:

python

r

debugging

numpy

statistics

Legend

People also ask

2 Answers

BrenBarn

Eric O Lebigot

Recent Activity

Donate For Us