Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running median of y-values over a range of x

Below is a scatter plot I constructed from two numpy arrays.

Scatter Plot Example enter image description here

What I'd like to add to this plot is a running median of y over a range of x. I've photoshoped in an example:

Modified Scatter Plot enter image description here

Specifically, I need the median for data points in bins of 1 unit along the x axis between two values (this range will vary between many plots, but I can manually adjust it). I appreciate any tips that can point me in the right direction.

like image 250
mjcowley Avatar asked Apr 22 '14 11:04

mjcowley


3 Answers

I would use np.digitize to do the bin sorting for you. This way you can easily apply any function and set the range you are interested in.

import numpy as np
import pylab as plt

N = 2000
total_bins = 10

# Sample data
X = np.random.random(size=N)*10
Y = X**2 + np.random.random(size=N)*X*10

bins = np.linspace(X.min(),X.max(), total_bins)
delta = bins[1]-bins[0]
idx  = np.digitize(X,bins)
running_median = [np.median(Y[idx==k]) for k in range(total_bins)]

plt.scatter(X,Y,color='k',alpha=.2,s=2)
plt.plot(bins-delta/2,running_median,'r--',lw=4,alpha=.8)
plt.axis('tight')
plt.show()

enter image description here

As an example of the versatility of the method, let's add errorbars given by the standard deviation of each bin:

running_std    = [Y[idx==k].std() for k in range(total_bins)]
plt.errorbar(bins-delta/2,running_median,
              running_std,fmt=None)

enter image description here

like image 131
Hooked Avatar answered Nov 17 '22 08:11

Hooked


This problem can also be efficiently tackled via python pandas (Python Data Analysis Library), which offers native data cutting and analysis methods.

Consider this

(Kudos and +1 to @Hooked for his example from which I borrowed the X and Y data)

 import pandas as pd
 df = pd.DataFrame({'X' : X, 'Y' : Y})  #we build a dataframe from the data

 data_cut = pd.cut(df.X,bins)           #we cut the data following the bins
 grp = df.groupby(by = data_cut)        #we group the data by the cut

 ret = grp.aggregate(np.median)         #we produce an aggregate representation (median) of each bin

 #plotting

 plt.scatter(df.X,df.Y,color='k',alpha=.2,s=2)
 plt.plot(ret.X,ret.Y,'r--',lw=4,alpha=.8)
 plt.show()

Remark: here the x values of the red curve are the bin-wise x-medians (the midpoints of the bins can be used).

enter image description here

like image 43
Acorbe Avatar answered Nov 17 '22 08:11

Acorbe


You can create a function based on numpy.median() that will calculate the median value given the intervals:

import numpy as np

def medians(x, y, intervals):
    out = []
    for xmin, xmax in intervals:
        mask = (x >= xmin) & (x < xmax)
        out.append(np.median(y[mask]))
    return np.array(out)

Then use this function for the desired intervals:

import matplotlib.pyplot as plt

intervals = ((18, 19), (19, 20), (20, 21), (21, 22))
centers = [(xmin+xmax)/2. for xmin, xmax in intervals]

plt.plot(centers, medians(x, y, intervals)
like image 3
Saullo G. P. Castro Avatar answered Nov 17 '22 09:11

Saullo G. P. Castro