Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rank the suffixes of a list

ranking an element x in an array/list is just to find out how many elements in the array/list that strictly smaller than x.

So ranking a list is just get ranks of all elements in the list.

For example, rank [51, 38, 29, 51, 63, 38] = [3, 1, 0, 3, 5, 1], i.e., there are 3 elements smaller than 51, etc.

Ranking a list can be done in O(NlogN). Basically, we can sort the list while remembering the original index of each element, and then see for each element, how many before it.


The question here is How to rank the suffixes of a list, in O(NlogN)?

Ranking the suffixes of a list means:

for list [3; 1; 2], rank [[3;1;2]; [1;2]; [2]]

note that elements may not be distinct.


edit

We don't need to print out all elements for all suffixes. You can image that we just need to print out a list/array, where each element is a rank of a suffix.

For example, rank suffix_of_[3;1;2] = rank [[3;1;2]; [1;2]; [2]] = [2;0;1] and you just print out [2;0;1].


edit 2

Let me explain what is all suffixes and what means sorting/ranking all suffixes more clearly here.

Suppose we have an array/list [e1;e2;e3;e4;e5].

Then all suffixes of [e1;e2;e3;e4;e5] are:

[e1;e2;e3;e4;e5]
[e2;e3;e4;e5]
[e3;e4;e5]
[e4;e5]
[e5]

for example, all suffixes of [4;2;3;1;0] are

[4;2;3;1;0]
[2;3;1;0]
[3;1;0]
[1;0]
[0]

Sorting above 5 suffixes implies lexicographic sort. sorting above all suffixes, you get

[0]
[1;0]
[2;3;1;0]
[3;1;0]
[4;2;3;1;0]

by the way, if you can't image how 5 lists/arrays can be sorted among them, just think of sorting strings in lexicographic order.

"0" < "10" < "2310" < "310" < "42310"


It seems sorting all suffixes is actually sorting all elements of the original array.

However, please be careful that all elements may not be distinct, for example

for [4;2;2;1;0], all suffixes are:

[4;2;2;1;0]
[2;2;1;0]
[2;1;0]
[1;0]
[0]

then the order is

[0]
[1;0]
[2;1;0]
[2;2;1;0]
[4;2;2;1;0]

like image 314
Jackson Tale Avatar asked Feb 15 '23 01:02

Jackson Tale


2 Answers

As MBo noted correctly, your problem is that of constructing the suffix array of your input list. The fast and complicated algorithms to do this are actually linear time, but since you only aim for O(n log n), I will try to propose a simpler version that is much easier to implement.

Basic idea and an initial O(n log² n) implementation

Let's take the sequence [4, 2, 2, 1] as an example. Its suffixes are

0: 4 2 2 1
1: 2 2 1
2: 2 1
3: 1

I numbered the suffixes with their starting index in the original sequence. Ultimately we want to sort this set of suffixes lexicographically, and fast. We know we can represent each suffix using its starting index in constant space and we can sort in O(n log n) comparisons using merge sort, heap sort or a similar algorithm. So the question remains, how can we compare two suffixes fast?

Let's say we want to compare the suffixes [2, 2, 1] and [2, 1]. We can pad those with negative infinity values changing the result of the comparison: [2, 2, 1, -∞] and [2, 1, -∞, -∞].

Now the key idea here is the following divide-and-conquer observation: Instead of comparing the sequences character by character until we find a position where the two differ, we can instead split both lists in half and compare the halves lexicographically:

     [a, b, c, d]     < [e, f, g, h] 
 <=> ([a, b], [c, d]) < ([e, f], [g, h])
 <=> [a, b] < [e, f] or ([a, b,] = [e, f] and [c, d] < [g, h])

Essentially we have decomposed the problem of comparing the sequences into two problems of comparing smaller sequences. This leads to the following algorithm:

Step 1: Sort the substrings (contiguous subsequences) of length 1. In our example, the substrings of length 1 are [4], [2], [2], [1]. Every substring can be represented by the starting position in the original list. We sort them by a simple comparison sort and get [1], [2], [2], [4]. We store the result by assigning to every position it's rank in the sorted lists of lists:

position   substring   rank
0          [4]         2
1          [2]         1
2          [2]         1
3          [1]         0

It is important that we assign the same rank to equal substrings!

Step 2: Now we want to sort the substrings of length 2. The are only really 3 such substrings, but we assign one to every position by padding with negative infinity if necessary. The trick here is that we can use our divide-and-conquer idea from above and the ranks assigned in step 1 to do a fast comparison (this isn't really necessary yet but will become important later).

position  substring    halves        ranks from step 1   final rank
0         [4,  2]      ([4], [2])    (2,  1)             3               
1         [2,  2]      ([2], [2])    (1,  1)             2
2         [2,  1]      ([2], [2])    (1,  0)             1
3         [1, -∞]      ([1], [-∞])   (0, -∞)             0

Step 3: You guessed it, now we sort substrings of length 4 (!). These are exactly the suffixes of the list! We can use the divide-and-conquer trick and the results from step 2 this time:

position  substring         halves              ranks from step 2   final rank
0         [4,  2,  2,  1]   ([4, 2], [2,  1])   (3,  1)             3
1         [2,  2,  1, -∞]   ([2, 2], [1, -∞])   (2,  0)             2
2         [2,  1, -∞, -∞]   ([2, 1], [-∞,-∞])   (1, -∞)             1
3         [1, -∞, -∞, -∞]   ([1,-∞], [-∞,-∞])   (0, -∞)             0

We're done! If our initial sequence would have had size 2^k, we would have needed k steps. Or put the other way round, we need log_2 n steps to process a sequence of size n. If its length is not a power of two, we just pad with negative infinity.

For an actual implementation we just need to remember the sequence "final rank" for every step of the algorithm.

An implementation in C++ could look like this (compile with -std=c++11):

#include <algorithm>
#include <iostream>
using namespace std;

int seq[] = {8, 3, 2, 4, 2, 2, 1};
const int n = 7;
const int log2n = 3;       // log2n = ceil(log_2(n))
int Rank[log2n + 1][n];    // Rank[i] will save the final Ranks of step i
tuple<int, int, int> L[n]; // L is a list of tuples. in step i,
                           // this will hold pairs of Ranks from step i - 1
                           // along with the substring index
const int neginf = -1;     // should be smaller than all the numbers in seq
int main() {
  for (int i = 0; i < n; ++i)
    Rank[1][i] = seq[i]; // step 1 is actually simple if you think about it
  for (int step = 2; step <= log2n; ++step) {
    int length = 1 << (step - 1); // length is 2^(step - 1)
    for (int i = 0; i < n; ++i)
      L[i] = make_tuple(
                Rank[step - 1][i],
                (i + length / 2 < n) ? Rank[step - 1][i + length / 2] : neginf,
                i); // we need to know where the tuple came from later
    sort(L, L + n); // lexicographical sort
    for (int i = 0; i < n; ++i) {
      // we save the rank of the index, but we need to be careful to
      // assign equal ranks to equal pairs
      Rank[step][get<2>(L[i])] = (i > 0 && get<0>(L[i]) == get<0>(L[i - 1])
                                        && get<1>(L[i]) == get<1>(L[i - 1]))
                                    ? Rank[step][get<2>(L[i - 1])] 
                                    : i;
    }
  }
  // the suffix array is in L after the last step
  for (int i = 0; i < n; ++i) {
    int start = get<2>(L[i]);
    cout << start << ":";
    for (int j = start; j < n; ++j)
      cout << " " << seq[j];
    cout << endl;
  }
}

Output:

6: 1
5: 2 1
4: 2 2 1
2: 2 4 2 2 1
1: 3 2 4 2 2 1
3: 4 2 2 1
0: 8 3 2 4 2 2 1

The complexity is O(log n * (n + sort)), which is O(n log² n) in this implementation because we use a comparison sort of complexity O(n log n)

A simple O(n log n) algorithm

If we manage to do the sorting parts in O(n) per step, we get a O(n log n) bound. So basically we have to sort a sequence of pairs (x, y), where 0 <= x, y < n. We know that we can sort a sequence of integers in the given range in O(n) time using counting sort. We can intepret our pairs (x, y) as numbers z = n * x + y in base n. We can now see how to use LSD radix sort to sort the pairs. In practice, this means we sort the pairs by increasing y using counting sort, and then use counting sort again to sort by increasing x. Since counting sort is stable, this gives us the lexicographical order of our pairs in 2 * O(n) = O(n). The final complexity is thus O(n log n).

In case you are interested, you can find an O(n log² n) implementation of the approach at my Github repo. The implementation has 27 lines of code. Neat, ain't it?

like image 157
Niklas B. Avatar answered Feb 23 '23 03:02

Niklas B.


This is exactly suffix array construction problem, and wiki page contains links to the linear-complexity algorithms (probably, depending on alphabet)

like image 32
MBo Avatar answered Feb 23 '23 02:02

MBo