Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to Calculate h-index (impact/productivity of author publication) in pandas DataFrame

I'm very new to pandas, but I've been reading about it and how much faster it is when dealing with big data.

I managed to create a dataframe, and I now have a pandas dataframe that looks something like this:

    0     1
0    1    14
1    2    -1
2    3  1817
3    3    29
4    3    25
5    3     2
6    3     1
7    3    -1
8    4    25
9    4    24
10   4     2
11   4    -1
12   4    -1
13   5    25
14   5     1

Columns 0 is author's id and column 1 is the number of citations this author had on a publication (-1 means zero citations). Each row represents a different publication for an author.

I'm trying to calculate the h-index for each of these authors. h-index is defined as the number of h publications the author has that are cited at least h times. So for authors:

author 1 has h-index of 1

author 2 has h-index of 0

author 3 has h-index of 3

author 4 has h-index of 2

author 5 has h-index of 1

This is the way I am currently doing it, which involves a lot of looping:

current_author=1
hindex=0

for index, row in df.iterrows():
    if row[0]==current_author:
        if row[1]>hindex:
            hindex+=1
    else:
        print "author ",current_author," has h-index:", hindex
        current_author+=1
        hindex=0
        if row[1]>hindex:
            hindex+=1
            
print "author ",current_author," has h-index:", hindex  

My actual database have over 3 million authors. If I loop for each one this will take days to calculate. I'm trying to figure out what you think is the fastest way to tackle this?

Thanks in advance!

like image 535
BKS Avatar asked Apr 16 '15 10:04

BKS


People also ask

How is an author's h-index calculated?

To manually calculate your h-index, organize articles in descending order, based on the number of times they have been cited. In the below example, an author has 8 papers that have been cited 33, 30, 20, 15, 7, 6, 5 and 4 times. This tells us that the author's h-index is 6.

What is h-index of an author?

The h-index is an author-level metric that measures both the productivity and citation impact of the publications of an author. The index is based on both the number of papers published, and the number of citations those papers have received. The index was suggested in 2005 by Jorge E.

How do you find the h-index in Python?

If we sort all the citations in decreasing order to sortlist, and index each citation number by 1, 2, 3, ..., then we can find the h-index is the max value i, which makes sortlist[i]>=i.


1 Answers

I renamed your columns to 'author' and 'citations' here, we can groupby the authors and then apply a lambda, here the lambda is comparing the number of citations against the value, this will generate a 1 or 0 if true, we can then sum this:

In [104]:

df['h-index'] = df.groupby('author')['citations'].transform( lambda x: (x >= x.count()).sum() )
​
df
Out[104]:
    author  citations  h-index
0        1         14        1
1        2         -1        0
2        3       1817        3
3        3         29        3
4        3         25        3
5        3          2        3
6        3          1        3
7        3         -1        3
8        4         25        2
9        4         24        2
10       4          2        2
11       4         -1        2
12       4         -1        2
13       5         25        1
14       5          1        1

EDIT As pointed out by @Julien Spronck the above doesn't work correctly if for author 4 they had citations 3,3,3. Normally you cannot access the inter group index but we can compare the citation value against the rank, this is a pseudo index but it only works if the citation values are unique:

In [129]:

df['h-index'] = df.groupby('author')['citations'].transform(lambda x: ( x >= x.rank(ascending=False, method='first') ).sum() )
​
df
Out[129]:
    author  citations  h-index
0        1         14        1
1        2         -1        0
2        3       1817        3
3        3         29        3
4        3         25        3
5        3          2        3
6        3          1        3
7        3         -1        3
8        4         25        2
9        4         24        2
10       4          2        2
11       4         -1        2
12       4         -1        2
13       5         25        1
14       5          1        1
like image 115
EdChum Avatar answered Nov 15 '22 04:11

EdChum