Answering Subarray H-index Queries Efficiently

Question

I'm working on an interesting question about answering range h-index queries.

Fyi h-index is the max integer h that an author has published at least h papers that each are cited at least h times.

You are given N papers and the i-th paper has been cited p[i] times. There will be Q questions, each giving a range [l, r] and asks what would be the h-index of an author that only publishes the papers from index l to r inclusive. N, Q, and all p[i] are between 1 and 2*10^5.

Here is a example input

The output should be

2
1

Here p=[5, 7, 4, 1], a query on range [2, 3] should give 2 because the 2 papers in that range have each been cited at least twice (7 and 4 times specifically). Query on [3, 4] should give 1.

Doing binary search on the subarray of each query is O(QNlogN). This code is my attempt at solving the problem. Unfortunately, it gives the right answers but it exceeds the 3 second time limit. The time limit is for the execution of the entire program, so this includes the time to read input. I did also try with scanf and cin.tie(0);ios::sync_with_stdio(0); to speed up i/o but that still exceeds the time limit.

#include <iostream>
int p[200005];
int main() {
    using namespace std;
    int n;
    cin >> n;
    int q;
    cin >> q;
    for (int i = 0; i < n; i++) {
        cin >> p[i + 1];
    }
    for (int i = 0; i < q; i++) {
        int l;
        cin >> l;
        int r;
        cin >> r;
        int lo = 1, hi = n;
        while (lo <= hi) {
            int mid = (lo + hi) / 2;
            int cnt = 0;
            for (int j = l; j <= r; j++) {
                if (p[j] >= mid) {
                    cnt++;
                }
            }
            if (cnt >= mid) {
                lo = mid + 1;
            } else {
                hi = mid - 1;
            }
        }
        cout << hi << '
';
    }
}

My other idea was using a segment tree but idk how to do the merging of two segment tree nodes faster than linear time. The problem here is still mainly the time complexity, I'm looking for a faster algorithm and based on the constraints I believe the complexity per query should definitely be better than O(N).

I'm stumped, what is a faster way to solve this within the time limit?

Unmitigated · Accepted Answer

Mo's algorithm can solve this problem offline in O(QlogQ + N√Q). This time complexity is not the best, but the advantage of this approach lies in the simplicity of its implementation.

First, split the array into √Q blocks of size N/√Q. Then, sort all the queries in increasing order of the block number of the left endpoint, then in increasing order of the right endpoint (for queries whose left indexes belong in the same block).

Instead of merging answers for multiple ranges, we can maintain a current subarray and its answer, which starts out empty. While iterating over the queries, we just need to transform the subarray from the previous query into the subarray for the next query by adding and removing elements from the ends.

In order to efficiently maintain the answer with each addition or deletion of an element, we keep track of how many papers there are with each citation count as well as how many papers have been cited more times than the current h-index in the current subarray. To add an element, notice that the h-index will either stay the same or increase by 1 if the number of papers that have been cited more than the current h-index is greater than the current h-index. Removing an element is similar: the answer decreases by at most 1. Thus, adding or removing a single element can be done in O(1).

To understand the time complexity, let's now consider how often we move the left and right index of the current subarray. The left index of adjacent queries after sorting differ by at most the block size, so the left index moves at most O(Q * N/√Q) = O(N√Q) times in total. For each block, the queries are sorted in increasing order of right index, so the right index moves O(N) times for each block, which is O(N * √Q) in total. This shows that we require O(N√Q) addition and deletion operations; since each operation is done in constant time, the time complexity for answering the queries after O(QlogQ) sorting is O(N√Q). Finally, we only need to output the answers in the correct order based on the original index of each query.

#include <iostream>
#include <vector>
#include <algorithm>
#include <utility>
#include <cmath>
int main() {
    struct Query {
        int index, left, right;
    };
    int N, Q;
    std::cin >> N >> Q;
    // just for demonstration; it's often better to use a fixed constant block size
    const int blockSize = std::max(N / std::sqrt(Q), 1.);
    std::vector<int> citations(1); // make it 1-indexed
    for (int i = 0, c; i < N; ++i) {
        std::cin >> c;
        citations.push_back(std::min(c, N));
    }
    std::vector<Query> queries;
    std::vector<int> answers(Q), count(N + 1);
    for (int i = 0, left, right; i < Q; ++i) {
        std::cin >> left >> right;
        queries.emplace_back(i, left, right);
    }
    std::ranges::sort(queries, {}, [&](const auto& query) {
        return std::pair(query.left / blockSize, query.right);
    });
    int hIndex = 0, greater = 0;
    auto addPaper = [&](int index) {
        ++count[citations[index]];
        greater += citations[index] > hIndex;
        if (greater > hIndex) {
            ++hIndex;
            greater -= count[hIndex];
        }
    };
    auto removePaper = [&](int index) {
        --count[citations[index]];
        greater -= citations[index] > hIndex;
        if (count[hIndex] + greater < hIndex) {
            greater += count[hIndex];
            --hIndex;
        }
    };
    int currLeft = 1, currRight = 0;
    for (const auto& query : queries) {
        while (currRight < query.right) addPaper(++currRight);
        while (currRight > query.right) removePaper(currRight--);
        while (currLeft > query.left) addPaper(--currLeft);
        while (currLeft < query.left) removePaper(currLeft++);
        answers[query.index] = hIndex;
    }
    for (int answer : answers) std::cout << answer << '
';
}

Alternatively, this problem can be solved in O((N + Q)logN) with a persistent segment tree (used in a similar way as in my answer to Efficient way to find sum of largest x elements in a subarray). First, replace all citation counts greater than N with N as the h-index cannot be greater than the number of papers. Then, we can build a persistent segment tree over the range 1 to N in O(NlogN), where each node stores the frequency of papers in a certain prefix of the array with a citation count that falls within its segment. Each query on a subarray [l, r] can be answered in O(logN) by walking down the segment tree starting from the roots for the prefix at r and l - 1.

Answering Subarray H-index Queries Efficiently

Tags:

c++

algorithm

data-structures

Isabella

1 Answers

Unmitigated

Recent Activity

Donate For Us

Answering Subarray H-index Queries Efficiently

Tags:

c++

algorithm

data-structures

Isabella

1 Answers

Unmitigated

Related questions

Recent Activity

Donate For Us