SQL for computing h-score (h-index)

According to wikipedia:

A scientist has index h if h of his/her Np papers have at least h citations each, and the other (Np − h) papers have no more than h citations each.

Imagine we have SCIENTISTS, PAPERS, CITATIONS tables with 1-n relation between SCIENTISTS and PAPERS and 1-n relation between PAPERS and CITATION TABLES. How to write a SQL statement that would compute h-score for each scientist in SCIENTISTS table?

To present some research effort I did here is a SQL computing number of citations for each paper:

SELECT COUNT(CITATIONS.id) AS citations_count
ORDER BY citations_count DESC;
People also ask

How can I calculate my h-index?

The h-index is a measure of the number of publications published (productivity), as well as how often they are cited. h-index = the number of publications with a citation number greater than or equal to h. For example, 15 publications cited 15 times or more, is a h-index of 15.

How do I manually calculate my h-index?

To manually calculate your h-index, organize articles in descending order, based on the number of times they have been cited. In the below example, an author has 8 papers that have been cited 33, 30, 20, 15, 7, 6, 5 and 4 times. This tells us that the author's h-index is 6.

How h-index is calculated in Scopus?

The h-index can be calculated automatically in Web of Science and Scopus or manually in other databases that provide citation information (e.g. SciFinder, PsychINFO, Google Scholar). The index is based on a list of publications ranked in descending order by the number of citations these publications received.

How do I find my h-index and i10 Index?

The h-index reflects both the number of publications and the number of citations per publication. For example a scientist with an h-index of 20 has 20 papers cited at least 20 times. The i10-index is the number of articles with at least 10 citations.

1 Answers

What the h-value is doing is counting the citations in two ways. Let's say a scientist has the following citation counts:


Let's the number that have that many or more citations, and the difference between the two:

10    1    9
 8    2    6
 5    3    2
 5    3    2
 2    5    -3
 1    6    -5

The number you want is where this is 0. In this case, the number is 4.

The fact that the number is 4 makes this hard, because it is not in the original data. That makes the calculation harder, because you need to generate a numbers table.

The following does this using SQL Server syntax for generating a table with 100 numbers:

with numbers as (
      select 1 as n
      union all
      select n+1
      from numbers
      where n < 100
     numcitations as (
      SELECT p.scientistid, p.id, COUNT(c.id) AS citations_count
           CITATIONS c
           ON p.id = c.paper_id
      GROUP BY p.scientist, p.id
     hcalc as (
      select scientistid, numbers.n,
             (select count(*)
              from numcitations nc
              where nc.scientistid = s.scientistid and
                    nc.citations_count >= numbers.n
             ) as hval
      from numbers cross join
           (select scientistid from scientist) s
select *
from hcalc
where hval = n;


There is a way to do this without using the numbers table. The h-score is the count of cases where the number of citations is greater than or equal to the citation count. This is much easier to calculate:

select scientistid, count(*)
from (SELECT p.scientistid, p.id, COUNT(c.id) AS citations_count,
             rank() over (partition by p.scientistid, p.id order by count(c.id) desc) as ranking
           CITATIONS c
           ON p.id = c.paper_id
      GROUP BY p.scientist, p.id
     ) t
where ranking <= citations_count
group by scientistid;
