I'm very new to pandas, but I've been reading about it and how much faster it is when dealing with big data.
I managed to create a dataframe, and I now have a pandas dataframe that looks something like this:
0 1
0 1 14
1 2 -1
2 3 1817
3 3 29
4 3 25
5 3 2
6 3 1
7 3 -1
8 4 25
9 4 24
10 4 2
11 4 -1
12 4 -1
13 5 25
14 5 1
Columns 0
is author's id and column 1
is the number of citations this author had on a publication (-1 means zero citations). Each row represents a different publication for an author.
I'm trying to calculate the h-index
for each of these authors. h-index
is defined as the number of h publications the author has that are cited at least h times. So for authors:
author 1 has h-index of 1
author 2 has h-index of 0
author 3 has h-index of 3
author 4 has h-index of 2
author 5 has h-index of 1
This is the way I am currently doing it, which involves a lot of looping:
current_author=1
hindex=0
for index, row in df.iterrows():
if row[0]==current_author:
if row[1]>hindex:
hindex+=1
else:
print "author ",current_author," has h-index:", hindex
current_author+=1
hindex=0
if row[1]>hindex:
hindex+=1
print "author ",current_author," has h-index:", hindex
My actual database have over 3 million authors. If I loop for each one this will take days to calculate. I'm trying to figure out what you think is the fastest way to tackle this?
Thanks in advance!
To manually calculate your h-index, organize articles in descending order, based on the number of times they have been cited. In the below example, an author has 8 papers that have been cited 33, 30, 20, 15, 7, 6, 5 and 4 times. This tells us that the author's h-index is 6.
The h-index is an author-level metric that measures both the productivity and citation impact of the publications of an author. The index is based on both the number of papers published, and the number of citations those papers have received. The index was suggested in 2005 by Jorge E.
If we sort all the citations in decreasing order to sortlist, and index each citation number by 1, 2, 3, ..., then we can find the h-index is the max value i, which makes sortlist[i]>=i.
I renamed your columns to 'author' and 'citations' here, we can groupby the authors and then apply a lambda, here the lambda is comparing the number of citations against the value, this will generate a 1 or 0 if true, we can then sum this:
In [104]:
df['h-index'] = df.groupby('author')['citations'].transform( lambda x: (x >= x.count()).sum() )
df
Out[104]:
author citations h-index
0 1 14 1
1 2 -1 0
2 3 1817 3
3 3 29 3
4 3 25 3
5 3 2 3
6 3 1 3
7 3 -1 3
8 4 25 2
9 4 24 2
10 4 2 2
11 4 -1 2
12 4 -1 2
13 5 25 1
14 5 1 1
EDIT As pointed out by @Julien Spronck the above doesn't work correctly if for author 4 they had citations 3,3,3. Normally you cannot access the inter group index but we can compare the citation value against the rank
, this is a pseudo index but it only works if the citation values are unique:
In [129]:
df['h-index'] = df.groupby('author')['citations'].transform(lambda x: ( x >= x.rank(ascending=False, method='first') ).sum() )
df
Out[129]:
author citations h-index
0 1 14 1
1 2 -1 0
2 3 1817 3
3 3 29 3
4 3 25 3
5 3 2 3
6 3 1 3
7 3 -1 3
8 4 25 2
9 4 24 2
10 4 2 2
11 4 -1 2
12 4 -1 2
13 5 25 1
14 5 1 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With