Efficient way to Calculate h-index (impact/productivity of author publication) in pandas DataFrame

Tags:

I'm very new to pandas, but I've been reading about it and how much faster it is when dealing with big data.

I managed to create a dataframe, and I now have a pandas dataframe that looks something like this:

    0     1
0    1    14
1    2    -1
2    3  1817
3    3    29
4    3    25
5    3     2
6    3     1
7    3    -1
8    4    25
9    4    24
10   4     2
11   4    -1
12   4    -1
13   5    25
14   5     1

Columns 0 is author's id and column 1 is the number of citations this author had on a publication (-1 means zero citations). Each row represents a different publication for an author.

I'm trying to calculate the h-index for each of these authors. h-index is defined as the number of h publications the author has that are cited at least h times. So for authors:

author 1 has h-index of 1

author 2 has h-index of 0

author 3 has h-index of 3

author 4 has h-index of 2

author 5 has h-index of 1

This is the way I am currently doing it, which involves a lot of looping:

current_author=1
hindex=0

for index, row in df.iterrows():
    if row[0]==current_author:
        if row[1]>hindex:
            hindex+=1
    else:
        print "author ",current_author," has h-index:", hindex
        current_author+=1
        hindex=0
        if row[1]>hindex:
            hindex+=1
            
print "author ",current_author," has h-index:", hindex

My actual database have over 3 million authors. If I loop for each one this will take days to calculate. I'm trying to figure out what you think is the fastest way to tackle this?

Thanks in advance!

535

asked Apr 16 '15 10:04

BKS

1 Answers

I renamed your columns to 'author' and 'citations' here, we can groupby the authors and then apply a lambda, here the lambda is comparing the number of citations against the value, this will generate a 1 or 0 if true, we can then sum this:

In [104]:

df['h-index'] = df.groupby('author')['citations'].transform( lambda x: (x >= x.count()).sum() )

df
Out[104]:
    author  citations  h-index
0        1         14        1
1        2         -1        0
2        3       1817        3
3        3         29        3
4        3         25        3
5        3          2        3
6        3          1        3
7        3         -1        3
8        4         25        2
9        4         24        2
10       4          2        2
11       4         -1        2
12       4         -1        2
13       5         25        1
14       5          1        1

EDIT As pointed out by @Julien Spronck the above doesn't work correctly if for author 4 they had citations 3,3,3. Normally you cannot access the inter group index but we can compare the citation value against the rank, this is a pseudo index but it only works if the citation values are unique:

In [129]:

df['h-index'] = df.groupby('author')['citations'].transform(lambda x: ( x >= x.rank(ascending=False, method='first') ).sum() )

df
Out[129]:
    author  citations  h-index
0        1         14        1
1        2         -1        0
2        3       1817        3
3        3         29        3
4        3         25        3
5        3          2        3
6        3          1        3
7        3         -1        3
8        4         25        2
9        4         24        2
10       4          2        2
11       4         -1        2
12       4         -1        2
13       5         25        1
14       5          1        1

115

answered Nov 15 '22 04:11

EdChum

Related questions
                            
                                Does asyncio support running a subprocess from a non-main thread?
                            
                                Django admin list display optimize queryset
                            
                                Shift time series with missing dates in Pandas
                            
                                Which $TERM to use to have both 256 colors and mouse move events in python curses?
                            
                                How to calculate differences across n columns in pandas rather than rows
                            
                                Print numbers with a percentage sign
                            
                                How can I have the magic `__contains__` method invoked?
                            
                                How to remove english text from arabic string in python?
                            
                                tkinter askopenfilename doesn't allow multiple file selection
                            
                                How convert None to NULL with Python 2.7 and pyodbc
                            
                                How to create classes from existing tables using Flask-SQLaclhemy
                            
                                Python method available for both instantiated/uninstantiated class
                            
                                TraceBack (most recent call last), and GPIO.setmode(GPIO.BOARD) or GPIO.setmode(GPIO.BCM) errors
                            
                                Join and count in sql-alchemy
                            
                                Checking flash messages in flask application nose tests
                            
                                What is numpy empty doing under the hood when I allocate a massive ndarray?
                            
                                Python reverse alphabetical order
                            
                                matplotlib: Different Marker colour when value crosses a threshold
                            
                                python count number of unique elements in csv column
                            
                                Find unsorted indices after using numpy.searchsorted

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient way to Calculate h-index (impact/productivity of author publication) in pandas DataFrame

Tags:

python

pandas

dataframe

python-2.7

BKS

People also ask

1 Answers

EdChum

Recent Activity

Donate For Us