I want to find a way to make faster the computation of the pairwise accuracy, that is going to compare elements of the same array (in this case it's a panda df column) computing their difference and then comparing the two results obtained. I would have a dataframe df with 3 column (id of the document, Jugment that represent the human evaluation and it's an int object, PR_score that represent the pagerank of that document and it's a float object) and i want to check if they agree on classifing one document better/worst of another one.
For example:
id: id1, id2, id3
Jugment : 1, 0, 0
PR_score: 0.18, 0.5, 0.12
in this case the two scores agree on classifing id1 better than id3, disagree on id1 and id2, and between id2 and id3 there is a human judgment tie, hence my pairwise accuracy is:
agreement = 1
disagreement = 1
pairwise accuracy = agreement/ (agreement +disagreement) =1/2 = 0.5
This was my first solution's code, in which i used the column of the df as array (which help to reduce the computation time):
def pairwise(agree, disagree):
return(agree/(agree+disagree))
def pairwise_computing_array(df):
humanScores = np.array(df['Judgement'])
pagerankScores = np.array(df['PR_Score'])
total = 0
agree = 0
disagree = 0
for i in range(len(df)-1):
for j in range(i+1, len(df)):
total += 1
human = humanScores[i] - humanScores[j] #difference human judg
if human != 0:
pr = pagerankScores[i] - pagerankScores[j]#difference pagerank score
if pr != 0:
if np.sign(human) == np.sign(pr):
agree += 1 #they agree in which of the two is better
else:
disagree +=1 #they do not agree in which of the two is better
else:
continue;
else:
continue;
pairwise_accuracy = pairwise(agree, disagree)
return(agree, disagree, total, pairwise_accuracy)
I have try with a list comprehension in order to get a faster computation, but it's actually slower than the first solution:
def pairwise_computing_list_comprehension(df):
humanScores = np.array(df['Judgement'])
pagerankScores = np.array(judgmentPR['PR_Score'])
sign = [np.sign(pagerankScores[i] - pagerankScores[j]) == np.sign(humanScores[i] - humanScores[j] )
for i in range(len(df)) for j in range(i+1, len(df))
if (np.sign(pagerankScores[i] - pagerankScores[j]) != 0
and np.sign(humanScores[i] - humanScores[j])!=0)]
agreement = sum(sign)
disagreement = len(sign) - agreement
pairwise_accuracy = pairwise(agreement, disagreement)
return(agreement, disagreement, pairwise_accuracy)
I can't run on my entire dataset cause it takes too much time, i would like to have something that can be computed in less than 1 minutes ideally.
The computation over my computer of a small subset of 1000 row reached this performance:
code1: 1.57 s ± 3.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
code2: 3.51 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You have numpy array, so why not just use it? You can offload the work from Python to C-compiled code (often, but not always):
First, resize the vectors into 1xN matrices:
humanScores = np.array(df['Judgement']).resize((1,-1))
pagerankScores = np.array(judgmentPR['PR_Score']).resize((1,-1))
Then find the difference, and we are interested only on the sign:
humanDiff = (humanScores - humanScores.T).clip(-1,1)
pagerankDiff = (pagerankScores - pagerankScores.T).clip(-1,1)
Here I assumed the data are integers, so the clip
function will only produce -1, 0, or 1. Then you can count it:
agree = ((humanDiff != 0) & (pagerankDiff != 0) & (humanDiff == pagerankDiff)).sum()
disagree = ((humanDiff != 0) & (pagerankDiff != 0) & (humanDiff != pagerankDiff)).sum()
But the above count is double-counting as item (i,j) and item (j,i) would be exact oppsite sign in both humanDiff
and pagerankDiff
. You may consider take only the upper-triangular part of the square matrix in the sum:
agree = ((humanDiff != 0) &
(pagerankDiff != 0) &
(np.triu(humanDiff) == np.triu(pagerankDiff))
).sum()
This is the code that works in a reasonable amount of time, obtained thanks to @juanpa.arrivillaga suggestion:
from numba import jit
@jit(nopython = True)
def pairwise_computing(humanScores, pagerankScores):
total = 0
agree = 0
disagree = 0
for i in range(len(humanScores)-1):
for j in range(i+1, len(humanScores)):
total += 1
human = humanScores[i] - humanScores[j] #difference human judg
if human != 0:
pr = pagerankScores[i] - pagerankScores[j]#difference pagerank score
if pr != 0:
if np.sign(human) == np.sign(pr):
agree += 1 #they agree in which of the two is better
else:
disagree +=1 #they do not agree in which of the two is better
else:
continue
else:
continue
pairwise_accuracy = agree/(agree+disagree)
return(agree, disagree, total, pairwise_accuracy)
This are the time perfomance reached for my entire dataset (58k row):
7.98 s ± 2.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It's possible to get rid of the inner for
loop by leveraging broadcasting, since index j
always ranges ahead of index i
by 1 (i.e. we don't look back). But there is a slight problem with computing agreement/disagreement in the following line:
if np.sign(human) == np.sign(pr):
which I don't know how to resolve. So, I just provide the skeleton code here for more tweaking and making it to work, since you know the problem better. Here it goes:
def pairwise_computing_array(df):
humanScores = df['Judgement'].values
pagerankScores = df['PR_Score'].values
total = 0
agree = 0
disagree = 0
for i in range(len(df)-1):
j = i+1
human = humanScores[i] - humanScores[j:] #difference human judg
human_mask = human != 0
if np.sum(human_mask) > 0: # check for at least one positive case
pr = pagerankScores[i] - pagerankScores[j:][human_mask] #difference pagerank score
pr_mask = pr !=0
if np.sum(pr_mask) > 0: # check for at least one positive case
# TODO: issue arises here; how to resolve when (human.shape != pr.shape) ?
# once this `if ... else` block is fixed, it's done
if np.sign(human) == np.sign(pr):
agree += 1 #they agree in which of the two is better
else:
disagree +=1 #they do not agree in which of the two is better
else:
continue
else:
continue
pairwise_accuracy = pairwise(agree, disagree)
return(agree, disagree, total, pairwise_accuracy)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With