I'm looking for an efficient python implementation of Somers'D, for which I need to compute the number of concordant, discordant and tied pairs between two random variables X and Y. Two pairs (X_i, Y_i), (X_j, Y_j) are concordant if the ranks of both elements agree; that is, x_i > x_j
and y_i > y_j
or x_i < x_j
and y_i < y_j
. Two pairs are called discordant if the ranks of both elements do not agree: x_i > x_j
and y_i < y_j
or x_i < x_j
and y_i > y_j
. Two pairs are said to be tied in X (Y) when x_i = x_j
y_i = y_j
.
Somers'D is then computed as D = (N_C - N_D) / (N_tot - N_Ty).
(See: https://en.wikipedia.org/wiki/Somers%27_D.)
I wrote a naive implementation using nested for-loops. Here, S contains my predictions and Y the realized outcomes.
def concordance_computer(Y, S):
N_C = 0
N_D = 0
N_T_y = 0
N_T_x = 0
for i in range(0, len(S)):
for j in range(i+1, len(Y)):
Y1 = Y[i]
X1 = S[i]
Y2 = Y[j]
X2 = S[j]
if Y1 > Y2 and X1 > X2:
N_C += 1
elif Y1 < Y2 and X1 < X2:
N_C += 1
elif Y1 > Y2 and X1 < X2:
N_D += 1
elif Y1 < Y2 and X1 > X2:
N_D += 1
elif Y1 == Y2:
N_T_y += 1
elif X1 == X2:
N_T_x += 1
N_tot = len(S)*(len(S)-1) / 2
SomersD = (N_C - N_D) / (N_tot - N_T_y)
return SomersD
Obviously, this is gonna be very slow when (Y,S) have a lot of rows. I stumbled upon the use of bisect while searching the net for solutions:
merge['Y'] = Y
merge['S'] = S
zeros2 = merge.loc[merge['Y'] == 0]
ones2 = merge.loc[merge['Y'] == 1]
from bisect import bisect_left, bisect_right
def bin_conc(zeros2, ones2):
zeros2_list = sorted([zeros2.iloc[j, 1] for j in range(len(zeros2))])
zeros2_length = len(zeros2_list)
conc = disc = ties = 0
for i in range(len(ones2)):
cur_conc = bisect_left(zeros2_list, ones2.iloc[i, 1])
cur_ties = bisect_right(zeros2_list, ones2.iloc[i, 1]) - cur_conc
conc += cur_conc
ties += cur_ties
disc += zeros2_length - cur_ties - cur_conc
pairs_tested = zeros2_length * len(ones2.index)
return conc, disc, ties, pairs_tested
This is very efficient, but only works for binary variables Y. Now my question is: how can I implement the concordance_computer
in an efficient way for ungrouped Y?
I was able to solve this with some help. A friend pointed me to the fact that there already exists a kendall tau implementation in scipy, which is very efficient. The code can be adapted to create a very fast Somers' D implementation. See code below. On my laptop, it runs in ~60ms, which makes it fast enough to use for e.g. bootstrapping confidence intervals with n_boots ~ O(10^3).
import pandas as pd
import numpy as np
import time
from scipy.stats._stats import _kendall_dis
# import data
df_ = pd.read_csv('Data/CR_mockData_EAD.csv')
df = df_[['realized_ead', 'pred_value']][0:100]
def SomersD(x, y):
x = np.asarray(x).ravel()
y = np.asarray(y).ravel()
if x.size != y.size:
raise ValueError("All inputs must be of the same size, "
"found x-size %s and y-size %s" % (x.size, y.size))
def count_rank_tie(ranks):
cnt = np.bincount(ranks).astype('int64', copy=False)
cnt = cnt[cnt > 1]
return ((cnt * (cnt - 1) // 2).sum(),
(cnt * (cnt - 1.) * (cnt - 2)).sum(),
(cnt * (cnt - 1.) * (2*cnt + 5)).sum())
size = x.size
perm = np.argsort(y) # sort on y and convert y to dense ranks
x, y = x[perm], y[perm]
y = np.r_[True, y[1:] != y[:-1]].cumsum(dtype=np.intp)
# stable sort on x and convert x to dense ranks
perm = np.argsort(x, kind='mergesort')
x, y = x[perm], y[perm]
x = np.r_[True, x[1:] != x[:-1]].cumsum(dtype=np.intp)
dis = _kendall_dis(x, y) # discordant pairs
obs = np.r_[True, (x[1:] != x[:-1]) | (y[1:] != y[:-1]), True]
cnt = np.diff(np.where(obs)[0]).astype('int64', copy=False)
ntie = (cnt * (cnt - 1) // 2).sum() # joint ties
xtie, x0, x1 = count_rank_tie(x) # ties in x, stats
ytie, y0, y1 = count_rank_tie(y) # ties in y, stats
tot = (size * (size - 1)) // 2
# Note that tot = con + dis + (xtie - ntie) + (ytie - ntie) + ntie
# = con + dis + xtie + ytie - ntie
#con_minus_dis = tot - xtie - ytie + ntie - 2 * dis
SD = (tot - xtie - ytie + ntie - 2 * dis) / (tot - ntie)
return (SD, dis)
start_time = time.time()
SD, dis = SomersD(df.realized_ead, df.pred_value)
print("--- %s seconds ---" % (time.time() - start_time))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With