How to calculate the Kolmogorov-Smirnov statistic between two weighted samples

Question

Let's say that we have two samples data1 and data2 with their respective weights weight1 and weight2 and that we want to calculate the Kolmogorov-Smirnov statistic between the two weighted samples.

The way we do that in python follows:

import numpy as np

def ks_w(data1,data2,wei1,wei2):
    ix1=np.argsort(data1)
    ix2=np.argsort(data2)
    wei1=wei1[ix1]
    wei2=wei2[ix2]
    data1=data1[ix1]
    data2=data2[ix2]
    d=0.
    fn1=0.
    fn2=0.
    j1=0
    j2=0
    j1w=0.
    j2w=0.
    while(j1<len(data1))&(j2<len(data2)):
        d1=data1[j1]
        d2=data2[j2]
        w1=wei1[j1]
        w2=wei2[j2]
        if d1<=d2:
            j1+=1
            j1w+=w1
            fn1=(j1w)/sum(wei1)
        if d2<=d1:
            j2+=1
            j2w+=w2
            fn2=(j2w)/sum(wei2)
        if abs(fn2-fn1)>d:
            d=abs(fn2-fn1)
    return d

where we just modify to our purpose the classical two-sample KS statistic as implemented in Press, Flannery, Teukolsky, Vetterling - Numerical Recipes in C - Cambridge University Press - 1992 - pag.626.

Our questions are:

is anybody aware of any other way to do it?
is there any library in python/R/* that performs it?
what about the test? Does it exist or should we use a reshuffling procedure in order to evaluate the statistic?

Luca Jokull · Accepted Answer

This solution is based on the code for scipy.stats.ks_2samp and runs in about 1/10000 the time (notebook):

import numpy as np

def ks_w2(data1, data2, wei1, wei2):
    ix1 = np.argsort(data1)
    ix2 = np.argsort(data2)
    data1 = data1[ix1]
    data2 = data2[ix2]
    wei1 = wei1[ix1]
    wei2 = wei2[ix2]
    data = np.concatenate([data1, data2])
    cwei1 = np.hstack([0, np.cumsum(wei1)/sum(wei1)])
    cwei2 = np.hstack([0, np.cumsum(wei2)/sum(wei2)])
    cdf1we = cwei1[[np.searchsorted(data1, data, side='right')]]
    cdf2we = cwei2[[np.searchsorted(data2, data, side='right')]]
    return np.max(np.abs(cdf1we - cdf2we))

Here's a test of its accuracy and performance:

ds1 = np.random.rand(10000)
ds2 = np.random.randn(40000) + .2
we1 = np.random.rand(10000) + 1.
we2 = np.random.rand(40000) + 1.

ks_w2(ds1, ds2, we1, we2)
# 0.4210415232236593
ks_w(ds1, ds2, we1, we2)
# 0.4210415232236593

%timeit ks_w2(ds1, ds2, we1, we2)
# 100 loops, best of 3: 17.1 ms per loop
%timeit ks_w(ds1, ds2, we1, we2)
# 1 loop, best of 3: 3min 44s per loop

Alberto Stefanelli · Answer

This is a R version of a two-tails weighted KS statistic following the suggestion of Numerical Methods of Statistics by Monohan, pg. 334 in 1E and pg. 358 in 2E.

ks_weighted <- function(vector_1,vector_2,weights_1,weights_2){
    F_vec_1 <- ewcdf(vector_1, weights = weights_1, normalise=FALSE)
    F_vec_2 <- ewcdf(vector_2, weights = weights_2, normalise=FALSE)
    xw <- c(vector_1,vector_2) 
    d <- max(abs(F_vec_1(xw) - F_vec_2(xw)))

    ## P-VALUE with NORMAL SAMPLE 
    # n_vector_1 <- length(vector_1)                                                           
    # n_vector_2<- length(vector_2)        
    # n <- n_vector_1 * n_vector_2/(n_vector_1 + n_vector_2)

    # P-VALUE EFFECTIVE SAMPLE SIZE as suggested by Monahan
    n_vector_1 <- sum(weights_1)^2/sum(weights_1^2)
    n_vector_2 <- sum(weights_2)^2/sum(weights_2^2)
    n <- n_vector_1 * n_vector_2/(n_vector_1 + n_vector_2)


    pkstwo <- function(x, tol = 1e-06) {
                if (is.numeric(x)) 
                    x <- as.double(x)
                else stop("argument 'x' must be numeric")
                p <- rep(0, length(x))
                p[is.na(x)] <- NA
                IND <- which(!is.na(x) & (x > 0))
                if (length(IND)) 
                    p[IND] <- .Call(stats:::C_pKS2, p = x[IND], tol)
                p
            }

    pval <- 1 - pkstwo(sqrt(n) * d)

    out <- c(KS_Stat=d, P_value=pval)
    return(out)
}

Tim Williams · Answer

To add to Luca Jokull's answer, if you want to also return a p-value (similar to the unweighted scipy.stats.ks_2samp function), the suggested ks_w2() function can be modified as follows:

from scipy.stats import distributions

def ks_weighted(data1, data2, wei1, wei2, alternative='two-sided'):
    ix1 = np.argsort(data1)
    ix2 = np.argsort(data2)
    data1 = data1[ix1]
    data2 = data2[ix2]
    wei1 = wei1[ix1]
    wei2 = wei2[ix2]
    data = np.concatenate([data1, data2])
    cwei1 = np.hstack([0, np.cumsum(wei1)/sum(wei1)])
    cwei2 = np.hstack([0, np.cumsum(wei2)/sum(wei2)])
    cdf1we = cwei1[np.searchsorted(data1, data, side='right')]
    cdf2we = cwei2[np.searchsorted(data2, data, side='right')]
    d = np.max(np.abs(cdf1we - cdf2we))
    # calculate p-value
    n1 = data1.shape[0]
    n2 = data2.shape[0]
    m, n = sorted([float(n1), float(n2)], reverse=True)
    en = m * n / (m + n)
    if alternative == 'two-sided':
        prob = distributions.kstwo.sf(d, np.round(en))
    else:
        z = np.sqrt(en) * d
        # Use Hodges' suggested approximation Eqn 5.3
        # Requires m to be the larger of (n1, n2)
        expt = -2 * z**2 - 2 * z * (m + 2*n)/np.sqrt(m*n*(m+n))/3.0
        prob = np.exp(expt)
    return d, prob

This is the asymptotic method that scipy's original unweighted function uses.

How to calculate the Kolmogorov-Smirnov statistic between two weighted samples

Tags:

python

scipy

kolmogorov-smirnov

Luca Jokull

3 Answers

Luca Jokull

Alberto Stefanelli

Tim Williams

Recent Activity

Donate For Us

How to calculate the Kolmogorov-Smirnov statistic between two weighted samples

Tags:

python

scipy

kolmogorov-smirnov

Luca Jokull

3 Answers

Luca Jokull

Alberto Stefanelli

Tim Williams

Related questions

Recent Activity

Donate For Us