Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster double iteration over a single array in Python

I want to find a way to make faster the computation of the pairwise accuracy, that is going to compare elements of the same array (in this case it's a panda df column) computing their difference and then comparing the two results obtained. I would have a dataframe df with 3 column (id of the document, Jugment that represent the human evaluation and it's an int object, PR_score that represent the pagerank of that document and it's a float object) and i want to check if they agree on classifing one document better/worst of another one.


For example:

id: id1, id2, id3

Jugment : 1, 0, 0

PR_score: 0.18, 0.5, 0.12

in this case the two scores agree on classifing id1 better than id3, disagree on id1 and id2, and between id2 and id3 there is a human judgment tie, hence my pairwise accuracy is:

agreement = 1

disagreement = 1

pairwise accuracy = agreement/ (agreement +disagreement) =1/2 = 0.5


This was my first solution's code, in which i used the column of the df as array (which help to reduce the computation time):

def pairwise(agree, disagree):
    return(agree/(agree+disagree))

def pairwise_computing_array(df):

    humanScores = np.array(df['Judgement'])  
    pagerankScores =  np.array(df['PR_Score']) 

    total = 0 
    agree = 0
    disagree = 0

    for i in range(len(df)-1):  
        for j in range(i+1, len(df)):
            total += 1
            human = humanScores[i] -  humanScores[j] #difference human judg
            if human != 0:
                pr = pagerankScores[i] -  pagerankScores[j]#difference pagerank score
                if pr != 0:
                    if np.sign(human) == np.sign(pr):  
                        agree += 1 #they agree in which of the two is better
                    else:
                        disagree +=1 #they do not agree in which of the two is better
                else:
                    continue;   
            else:
                continue;

    pairwise_accuracy = pairwise(agree, disagree)

    return(agree, disagree, total,  pairwise_accuracy)


I have try with a list comprehension in order to get a faster computation, but it's actually slower than the first solution:

def pairwise_computing_list_comprehension(df):

    humanScores = np.array(df['Judgement'])  
    pagerankScores =  np.array(judgmentPR['PR_Score']) 

    sign = [np.sign(pagerankScores[i] - pagerankScores[j]) == np.sign(humanScores[i] - humanScores[j] ) 
            for i in range(len(df)) for j in range(i+1, len(df)) 
                if (np.sign(pagerankScores[i] - pagerankScores[j]) != 0 
                    and np.sign(humanScores[i] - humanScores[j])!=0)]

    agreement = sum(sign)
    disagreement = len(sign) -  agreement                             
    pairwise_accuracy = pairwise(agreement, disagreement)

    return(agreement, disagreement, pairwise_accuracy)

I can't run on my entire dataset cause it takes too much time, i would like to have something that can be computed in less than 1 minutes ideally.

The computation over my computer of a small subset of 1000 row reached this performance:

code1: 1.57 s ± 3.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

code2: 3.51 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

like image 277
Roberta Parisi Avatar asked Apr 25 '19 16:04

Roberta Parisi


3 Answers

You have numpy array, so why not just use it? You can offload the work from Python to C-compiled code (often, but not always):

First, resize the vectors into 1xN matrices:

humanScores = np.array(df['Judgement']).resize((1,-1))
pagerankScores =  np.array(judgmentPR['PR_Score']).resize((1,-1))

Then find the difference, and we are interested only on the sign:

humanDiff = (humanScores - humanScores.T).clip(-1,1)
pagerankDiff = (pagerankScores - pagerankScores.T).clip(-1,1)

Here I assumed the data are integers, so the clip function will only produce -1, 0, or 1. Then you can count it:

agree = ((humanDiff != 0) & (pagerankDiff != 0) & (humanDiff == pagerankDiff)).sum()
disagree = ((humanDiff != 0) & (pagerankDiff != 0) & (humanDiff != pagerankDiff)).sum()

But the above count is double-counting as item (i,j) and item (j,i) would be exact oppsite sign in both humanDiff and pagerankDiff. You may consider take only the upper-triangular part of the square matrix in the sum:

agree = ((humanDiff != 0) &
         (pagerankDiff != 0) &
         (np.triu(humanDiff) == np.triu(pagerankDiff))
        ).sum()
like image 70
adrtam Avatar answered Nov 15 '22 11:11

adrtam


This is the code that works in a reasonable amount of time, obtained thanks to @juanpa.arrivillaga suggestion:

from numba import jit

@jit(nopython = True)
def pairwise_computing(humanScores, pagerankScores):

    total = 0 
    agree = 0
    disagree = 0

    for i in range(len(humanScores)-1):  
        for j in range(i+1, len(humanScores)):
            total += 1
            human = humanScores[i] -  humanScores[j] #difference human judg
            if human != 0:
                pr = pagerankScores[i] -  pagerankScores[j]#difference pagerank score
                if pr != 0:
                    if np.sign(human) == np.sign(pr):  
                        agree += 1 #they agree in which of the two is better
                    else:
                        disagree +=1 #they do not agree in which of the two is better
                else:
                    continue   
            else:
                continue
    pairwise_accuracy = agree/(agree+disagree)
    return(agree, disagree, total,  pairwise_accuracy)

This are the time perfomance reached for my entire dataset (58k row):

7.98 s ± 2.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

like image 45
Roberta Parisi Avatar answered Nov 15 '22 09:11

Roberta Parisi


It's possible to get rid of the inner for loop by leveraging broadcasting, since index j always ranges ahead of index i by 1 (i.e. we don't look back). But there is a slight problem with computing agreement/disagreement in the following line:

if np.sign(human) == np.sign(pr):

which I don't know how to resolve. So, I just provide the skeleton code here for more tweaking and making it to work, since you know the problem better. Here it goes:

def pairwise_computing_array(df):

    humanScores = df['Judgement'].values
    pagerankScores = df['PR_Score'].values 

    total = 0 
    agree = 0
    disagree = 0

    for i in range(len(df)-1):
        j = i+1
        human = humanScores[i] -  humanScores[j:]   #difference human judg
        human_mask = human != 0
        if np.sum(human_mask) > 0:  # check for at least one positive case
            pr = pagerankScores[i] -  pagerankScores[j:][human_mask]  #difference pagerank score
            pr_mask = pr !=0
            if np.sum(pr_mask) > 0:  # check for at least one positive case
                # TODO: issue arises here; how to resolve when (human.shape != pr.shape) ?
                # once this `if ... else` block is fixed, it's done
                if np.sign(human) == np.sign(pr):
                    agree += 1   #they agree in which of the two is better
                else:
                    disagree +=1   #they do not agree in which of the two is better
            else:
                continue
        else:
            continue
    pairwise_accuracy = pairwise(agree, disagree)

    return(agree, disagree, total,  pairwise_accuracy)
like image 27
kmario23 Avatar answered Nov 15 '22 09:11

kmario23