mean from pandas and numpy differ

Tags:

I have a MEMS IMU on which I've been collecting data and I'm using pandas to get some statistical data from it. There are 6 32-bit floats collected each cycle. Data rates are fixed for a given collection run. The data rates vary between 100Hz and 1000Hz and the collection times run as long as 72 hours. The data is saved in a flat binary file. I read the data this way:

import numpy as np import pandas as pd dataType=np.dtype([('a','<f4'),('b','<f4'),('c','<f4'),('d','<f4'),('e','<f4'),('e','<f4')]) df=pd.DataFrame(np.fromfile('FILENAME',dataType)) df['c'].mean() -9.880581855773926 x=df['c'].values x.mean() -9.8332081

-9.833 is the correct result. I can create a similar result that someone should be able to repeat this way:

import numpy as np import pandas as pd x=np.random.normal(-9.8,.05,size=900000) df=pd.DataFrame(x,dtype='float32',columns=['x']) df['x'].mean() -9.859579086303711 x.mean() -9.8000648778888628

I've repeated this on linux and windows, on AMD and Intel processors, in Python 2.7 and 3.5. I'm stumped. What am I doing wrong? And get this:

x=np.random.normal(-9.,.005,size=900000) df=pd.DataFrame(x,dtype='float32',columns=['x']) df['x'].mean() -8.999998092651367 x.mean() -9.0000075889406528

I could accept this difference. It's at the limit of the precision of 32 bit floats.

NEVERMIND. I wrote this on Friday and the solution hit me this morning. It is a floating point precision problem exacerbated by the large amount of data. I needed to convert the data into 64 bit float on the creation of the dataframe this way:

df=pd.DataFrame(np.fromfile('FILENAME',dataType),dtype='float64')

I'll leave the post should anyone else run into a similar issue.

791

asked Oct 29 '18 09:10

Rob

1 Answers

Short version:

The reason it's different is because pandas uses bottleneck (if it's installed) when calling the mean operation, as opposed to just relying on numpy. bottleneck is presumably used since it appears to be faster than numpy (at least on my machine), but at the cost of precision. They happen to match for the 64 bit version, but differ in 32 bit land (which is the interesting part).

Long version:

It's extremely difficult to tell what's going on just by inspecting the source code of these modules (they're quite complex, even for simple computations like mean, turns out numerical computing is hard). Best to use the debugger to avoid brain-compiling and those types of mistakes. The debugger won't make a mistake in logic, it'll tell you exactly what's going on.

Here's some of my stack trace (values differ slightly since no seed for RNG):

Can reproduce (Windows):

>>> import numpy as np; import pandas as pd >>> x=np.random.normal(-9.,.005,size=900000) >>> df=pd.DataFrame(x,dtype='float32',columns=['x']) >>> df['x'].mean() -9.0 >>> x.mean() -9.0000037501099754 >>> x.astype(np.float32).mean() -9.0000029

Nothing extraordinary going on with numpy's version. It's the pandas version that's a little wacky.

Let's have a look inside df['x'].mean():

>>> def test_it_2(): ...   import pdb; pdb.set_trace() ...   df['x'].mean() >>> test_it_2() ... # Some stepping/poking around that isn't important (Pdb) l 2307 2308            if we have an ndarray as a value, then simply perform the operation, 2309            otherwise delegate to the object 2310 2311            """ 2312 ->         delegate = self._values 2313            if isinstance(delegate, np.ndarray): 2314                # Validate that 'axis' is consistent with Series's single axis. 2315                self._get_axis_number(axis) 2316                if numeric_only: 2317                    raise NotImplementedError('Series.{0} does not implement ' (Pdb) delegate.dtype dtype('float32') (Pdb) l 2315                self._get_axis_number(axis) 2316                if numeric_only: 2317                    raise NotImplementedError('Series.{0} does not implement ' 2318                                              'numeric_only.'.format(name)) 2319                with np.errstate(all='ignore'): 2320 ->                 return op(delegate, skipna=skipna, **kwds) 2321 2322            return delegate._reduce(op=op, name=name, axis=axis, skipna=skipna, 2323                                    numeric_only=numeric_only, 2324                                    filter_type=filter_type, **kwds)

So we found the trouble spot, but now things get kind of weird:

(Pdb) op <function nanmean at 0x000002CD8ACD4488> (Pdb) op(delegate) -9.0 (Pdb) delegate_64 = delegate.astype(np.float64) (Pdb) op(delegate_64) -9.000003749978807 (Pdb) delegate.mean() -9.0000029 (Pdb) delegate_64.mean() -9.0000037499788075 (Pdb) np.nanmean(delegate, dtype=np.float64) -9.0000037499788075 (Pdb) np.nanmean(delegate, dtype=np.float32) -9.0000029

Note that delegate.mean() and np.nanmean output -9.0000029 with type float32, not -9.0 as pandas nanmean does. With a bit of poking around, you can find the source to pandas nanmean in pandas.core.nanops. Interestingly, it actually appears like it should be matching numpy at first. Let's have a look at pandas nanmean:

(Pdb) import inspect (Pdb) src = inspect.getsource(op).split("\n") (Pdb) for line in src: print(line) @disallow('M8') @bottleneck_switch() def nanmean(values, axis=None, skipna=True):     values, mask, dtype, dtype_max = _get_values(values, skipna, 0)      dtype_sum = dtype_max     dtype_count = np.float64     if is_integer_dtype(dtype) or is_timedelta64_dtype(dtype):         dtype_sum = np.float64     elif is_float_dtype(dtype):         dtype_sum = dtype         dtype_count = dtype     count = _get_counts(mask, axis, dtype=dtype_count)     the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))      if axis is not None and getattr(the_sum, 'ndim', False):         the_mean = the_sum / count         ct_mask = count == 0         if ct_mask.any():             the_mean[ct_mask] = np.nan     else:         the_mean = the_sum / count if count > 0 else np.nan      return _wrap_results(the_mean, dtype)

Here's a (short) version of the bottleneck_switch decorator:

import bottleneck as bn ... class bottleneck_switch(object):      def __init__(self, **kwargs):         self.kwargs = kwargs      def __call__(self, alt):         bn_name = alt.__name__          try:             bn_func = getattr(bn, bn_name)         except (AttributeError, NameError):  # pragma: no cover             bn_func = None     ...                  if (_USE_BOTTLENECK and skipna and                         _bn_ok_dtype(values.dtype, bn_name)):                     result = bn_func(values, axis=axis, **kwds)

This is called with alt as the pandas nanmean function, so bn_name is 'nanmean', and this is the attr that's grabbed from the bottleneck module:

(Pdb) l  93                             result = np.empty(result_shape)  94                             result.fill(0)  95                             return result  96  97                     if (_USE_BOTTLENECK and skipna and  98  ->                         _bn_ok_dtype(values.dtype, bn_name)):  99                         result = bn_func(values, axis=axis, **kwds) 100 101                         # prefer to treat inf/-inf as NA, but must compute the fun 102                         # twice :( 103                         if _has_infs(result): (Pdb) n > d:\anaconda3\lib\site-packages\pandas\core\nanops.py(99)f() -> result = bn_func(values, axis=axis, **kwds) (Pdb) alt <function nanmean at 0x000001D2C8C04378> (Pdb) alt.__name__ 'nanmean' (Pdb) bn_func <built-in function nanmean> (Pdb) bn_name 'nanmean' (Pdb) bn_func(values, axis=axis, **kwds) -9.0

Pretend that bottleneck_switch() decorator doesn't exist for a second. We can actually see that calling that manually stepping through this function (without bottleneck) will get you the same result as numpy:

(Pdb) from pandas.core.nanops import _get_counts (Pdb) from pandas.core.nanops import _get_values (Pdb) from pandas.core.nanops import _ensure_numeric (Pdb) values, mask, dtype, dtype_max = _get_values(delegate, skipna=skipna) (Pdb) count = _get_counts(mask, axis=None, dtype=dtype) (Pdb) count 900000.0 (Pdb) values.sum(axis=None, dtype=dtype) / count -9.0000029

That never gets called, though, if you have bottleneck installed. Instead, the bottleneck_switch() decorator instead blasts over the nanmean function with bottleneck's version. This is where the discrepancy lies (interestingly it matches on the float64 case, though):

(Pdb) import bottleneck as bn (Pdb) bn.nanmean(delegate) -9.0 (Pdb) bn.nanmean(delegate.astype(np.float64)) -9.000003749978807

bottleneck is used solely for speed, as far as I can tell. I'm assuming they're taking some type of shortcut with their nanmean function, but I didn't look into it much (see @ead's answer for details on this topic). You can see that it's typically a bit faster than numpy by their benchmarks: https://github.com/kwgoodman/bottleneck. Clearly the price to pay for this speed is precision.

Is bottleneck actually faster?

Sure looks like it (at least on my machine).

In [1]: import numpy as np; import pandas as pd  In [2]: x=np.random.normal(-9.8,.05,size=900000)  In [3]: y_32 = x.astype(np.float32)  In [13]: %timeit np.nanmean(y_32) 100 loops, best of 3: 5.72 ms per loop  In [14]: %timeit bn.nanmean(y_32) 1000 loops, best of 3: 854 µs per loop

It might be nice for pandas to introduce a flag here (one for speed, the other for better precision, default is for speed since that's the current impl). Some users care much more about the accuracy of the computation than the speed at which it happens.

HTH.

answered Sep 28 '22 08:09

Matt Messersmith

Related questions
                            
                                Map object has no len() in Python 3
                            
                                Calling __enter__ and __exit__ manually
                            
                                Python, how to handle the "ValueError: unsupported pickle protocol: 4" error?
                            
                                In trio, how can I have a background task that lives as long as my object does?
                            
                                Scatter plot form dataframe with index on x-axis
                            
                                Create conda environment: "Found conflicts!" when solving environment and "Finding shortest conflict path" running forever
                            
                                Is there a way of drawing a caption box in matplotlib
                            
                                Can SQLAlchemy be configured to be non-blocking?
                            
                                How to install the Python development headers on Mac OS X?
                            
                                What is the multiplication operator actually doing with numpy arrays? [duplicate]
                            
                                Positive integer from Python hash() function
                            
                                Why are Python and Ruby so slow, while Lisp implementations are fast?
                            
                                mean calculation in pandas excluding zeros
                            
                                Append 2D array to 3D array, extending third dimension
                            
                                How to make python .post() requests to retry?
                            
                                How can I access different Anaconda environment from Pycharm (on Windows 10)
                            
                                Keras confusion about number of layers
                            
                                Pickle alternatives
                            
                                Python with selenium: unable to locate element which really exist
                            
                                built-in max heap API in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

mean from pandas and numpy differ

Tags:

python

floating-point

floating-accuracy

pandas

numpy

Rob

People also ask

1 Answers

Short version:

Long version:

Matt Messersmith

Recent Activity

Donate For Us