Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an efficient Python implementation for Somers'D for ungrouped variables?

Tags:

python

I'm looking for an efficient python implementation of Somers'D, for which I need to compute the number of concordant, discordant and tied pairs between two random variables X and Y. Two pairs (X_i, Y_i), (X_j, Y_j) are concordant if the ranks of both elements agree; that is, x_i > x_j and y_i > y_j or x_i < x_j and y_i < y_j. Two pairs are called discordant if the ranks of both elements do not agree: x_i > x_j and y_i < y_j or x_i < x_j and y_i > y_j. Two pairs are said to be tied in X (Y) when x_i = x_j y_i = y_j.

Somers'D is then computed as D = (N_C - N_D) / (N_tot - N_Ty). (See: https://en.wikipedia.org/wiki/Somers%27_D.)

I wrote a naive implementation using nested for-loops. Here, S contains my predictions and Y the realized outcomes.

def concordance_computer(Y, S): 
    N_C = 0
    N_D = 0
    N_T_y = 0
    N_T_x = 0

    for i in range(0, len(S)):
        for j in range(i+1, len(Y)):
            Y1 = Y[i]
            X1 = S[i]
            Y2 = Y[j]
            X2 = S[j]

            if Y1 > Y2 and X1 > X2:
                N_C += 1
            elif Y1 < Y2 and X1 < X2:
                N_C += 1
            elif Y1 > Y2 and X1 < X2:
                N_D += 1
            elif Y1 < Y2 and X1 > X2:
                N_D += 1
            elif Y1 == Y2:
                N_T_y += 1
            elif X1 == X2:
                N_T_x += 1

    N_tot = len(S)*(len(S)-1) / 2                        
    SomersD = (N_C - N_D) / (N_tot - N_T_y)      

    return SomersD

Obviously, this is gonna be very slow when (Y,S) have a lot of rows. I stumbled upon the use of bisect while searching the net for solutions:

merge['Y'] = Y
merge['S'] = S
zeros2 = merge.loc[merge['Y'] == 0]
ones2 = merge.loc[merge['Y'] == 1]

from bisect import bisect_left, bisect_right
def bin_conc(zeros2, ones2):
    zeros2_list = sorted([zeros2.iloc[j, 1] for j in range(len(zeros2))])
    zeros2_length = len(zeros2_list)

    conc = disc = ties = 0
    for i in range(len(ones2)):
        cur_conc = bisect_left(zeros2_list, ones2.iloc[i, 1])
        cur_ties = bisect_right(zeros2_list, ones2.iloc[i, 1]) - cur_conc
        conc += cur_conc
        ties += cur_ties
        disc += zeros2_length - cur_ties - cur_conc

    pairs_tested = zeros2_length * len(ones2.index)

    return conc, disc, ties, pairs_tested

This is very efficient, but only works for binary variables Y. Now my question is: how can I implement the concordance_computer in an efficient way for ungrouped Y?

like image 682
Aardappel Avatar asked Dec 22 '19 07:12

Aardappel


1 Answers

I was able to solve this with some help. A friend pointed me to the fact that there already exists a kendall tau implementation in scipy, which is very efficient. The code can be adapted to create a very fast Somers' D implementation. See code below. On my laptop, it runs in ~60ms, which makes it fast enough to use for e.g. bootstrapping confidence intervals with n_boots ~ O(10^3).

import pandas as pd
import numpy as np
import time
from scipy.stats._stats import _kendall_dis

# import data
df_ = pd.read_csv('Data/CR_mockData_EAD.csv')
df = df_[['realized_ead', 'pred_value']][0:100]

def SomersD(x, y):

    x = np.asarray(x).ravel()
    y = np.asarray(y).ravel()

    if x.size != y.size:
        raise ValueError("All inputs must be of the same size, "
                         "found x-size %s and y-size %s" % (x.size, y.size))

    def count_rank_tie(ranks):
        cnt = np.bincount(ranks).astype('int64', copy=False)
        cnt = cnt[cnt > 1]
        return ((cnt * (cnt - 1) // 2).sum(),
            (cnt * (cnt - 1.) * (cnt - 2)).sum(),
            (cnt * (cnt - 1.) * (2*cnt + 5)).sum())

    size = x.size
    perm = np.argsort(y)  # sort on y and convert y to dense ranks
    x, y = x[perm], y[perm]
    y = np.r_[True, y[1:] != y[:-1]].cumsum(dtype=np.intp)

    # stable sort on x and convert x to dense ranks
    perm = np.argsort(x, kind='mergesort')
    x, y = x[perm], y[perm]
    x = np.r_[True, x[1:] != x[:-1]].cumsum(dtype=np.intp)

    dis = _kendall_dis(x, y)  # discordant pairs

    obs = np.r_[True, (x[1:] != x[:-1]) | (y[1:] != y[:-1]), True]
    cnt = np.diff(np.where(obs)[0]).astype('int64', copy=False)

    ntie = (cnt * (cnt - 1) // 2).sum()  # joint ties
    xtie, x0, x1 = count_rank_tie(x)     # ties in x, stats
    ytie, y0, y1 = count_rank_tie(y)     # ties in y, stats

    tot = (size * (size - 1)) // 2

    # Note that tot = con + dis + (xtie - ntie) + (ytie - ntie) + ntie
    #               = con + dis + xtie + ytie - ntie
    #con_minus_dis = tot - xtie - ytie + ntie - 2 * dis
    SD = (tot - xtie - ytie + ntie - 2 * dis) / (tot - ntie)
    return (SD, dis)


start_time = time.time()
SD, dis = SomersD(df.realized_ead, df.pred_value)
print("--- %s seconds ---" % (time.time() - start_time))
like image 116
Aardappel Avatar answered Oct 16 '22 19:10

Aardappel