Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

binning data in python with scipy/numpy

People also ask

How do you binning data in Python?

Smoothing by bin means : In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. Smoothing by bin median : In this method each bin value is replaced by its bin median value.

Is SciPy a superset of NumPy?

1. SciPy builds on NumPy. All the numerical code resides in SciPy. The SciPy module consists of all the NumPy functions.


It's probably faster and easier to use numpy.digitize():

import numpy
data = numpy.random.random(100)
bins = numpy.linspace(0, 1, 10)
digitized = numpy.digitize(data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]

An alternative to this is to use numpy.histogram():

bin_means = (numpy.histogram(data, bins, weights=data)[0] /
             numpy.histogram(data, bins)[0])

Try for yourself which one is faster... :)


The Scipy (>=0.11) function scipy.stats.binned_statistic specifically addresses the above question.

For the same example as in the previous answers, the Scipy solution would be

import numpy as np
from scipy.stats import binned_statistic

data = np.random.rand(100)
bin_means = binned_statistic(data, data, bins=10, range=(0, 1))[0]

Not sure why this thread got necroed; but here is a 2014 approved answer, which should be far faster:

import numpy as np

data = np.random.rand(100)
bins = 10
slices = np.linspace(0, 100, bins+1, True).astype(np.int)
counts = np.diff(slices)

mean = np.add.reduceat(data, slices[:-1]) / counts
print mean

The numpy_indexed package (disclaimer: I am its author) contains functionality to efficiently perform operations of this type:

import numpy_indexed as npi
print(npi.group_by(np.digitize(data, bins)).mean(data))

This is essentially the same solution as the one I posted earlier; but now wrapped in a nice interface, with tests and all :)


I would add, and also to answer the question find mean bin values using histogram2d python that the scipy also have a function specially designed to compute a bidimensional binned statistic for one or more sets of data

import numpy as np
from scipy.stats import binned_statistic_2d

x = np.random.rand(100)
y = np.random.rand(100)
values = np.random.rand(100)
bin_means = binned_statistic_2d(x, y, values, bins=10).statistic

the function scipy.stats.binned_statistic_dd is a generalization of this funcion for higher dimensions datasets