Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it true that sorting strings is O(n^2logn)? [duplicate]

I read the following:

Sorting takes O(NlogN) so how it is O(N^2logN)??. What we miss here is that comparison of two strings is not O(1); in worst case, it takes O(N). So the final complexity is O(N^2logN).

Is this correct? I had always assumed that sorting was always O(NlogN) but now I feel a little thrown off because it has become O(N^2logN).

If someone could explain why it's O(N^2logN) that would be great.

Edit: quote taken from here: https://www.hackerrank.com/challenges/string-similarity/topics/suffix-array

like image 722
jcm Avatar asked Dec 16 '15 17:12

jcm


People also ask

What is the complexity of sorting string?

The complexity of string sorting depends on the alphabet as well as the machine model. The main solution [15] described in this entry works for alphabets of unbounded size (i. e., comparisons are the only operations on characters of Σ), and can be implemented on a pointer machine.

How does sorting a string work?

The sort string in python accepts a string as a parameter, then each character of the string compare based on the ASCII value of the characters and then return the list of sorted characters of the string.

What happens when you sort an array of strings?

The sort() sorts the elements of an array. The sort() overwrites the original array. The sort() sorts the elements as strings in alphabetical and ascending order.


2 Answers

Think of it this way.

When you are sorting numbers, to detect which number is bigger, you need only 1 comparison.

But when sorting strings, to detect which string is bigger, at times you need to compare all the characters of the string (i.e comparing hello and hella would need 5 char comparisons).

So in that case, every string comparison would take time that is directly proportional to the length of the strings. If the length is consistent, (suppose l), then the complexity would become O(l*nlogn)


Do not get confused between n and l. In any time complexity, n would stand for the number of inputs. In your case, complexity will be O(n^2logn) only when the length of the strings are also n.

Otherwise, taking the length of the strings as l, complexity would be O(l*nlogn)

like image 105
Haris Avatar answered Oct 04 '22 02:10

Haris


You extracted that quote out of context; it's worth putting it back into context to understand what's going on.

The context is that we start with some string S of length N, and wish to sort the set of all possible suffixes of that string.

There are N possible non-empty suffixes, one for each character position, and the average size of a string in that set is (N+1)/2 (since every length from 1 to N is equally likely.)

If we assume that the expected cost of comparing two strings of average length L is O(L), and the expected number of comparisons to sort N objects is O(N log N), and finally since L is (N+1)/2, we can see that the expected time to comparison-sort this particular set of strings is O((N+1)/2×N×logN) which simplifies to O(N2logN).

But the expected cost of comparing two strings of length L is not actually O(L), because it is generally the case that it is not necessary to look at every character position in two strings in order to compare them. It is sufficient to look at the characters until the first non-match is found. If the strings are randomly distributed in the universe of possible strings, then the expected number of characters which need to be examined is actually O(1) (by a simple probabilistic argument), although the worst-case cost is O(L).

In the case of quicksort (at least), there is a counter-balance, which is the tendency for comparisons in later phases of the quicksort to be between strings which are closer together. In this case, the number of characters needed to be examined tends to approach log N, which is the expected length of the longest common prefix between two consecutive strings in a sorted uniformly-distributed sample of N strings.

So the actual expected cost of the sort should be O(Nlog2N).

As a further complication, it is possible to modify the quicksort algorithm to track the LCPs of each partition, which might help restore the expected number of comparisons to something smaller than log N.

(A similar argument can be applied to radix sort, which also eliminates the apparent O(L) term from the expected -- but not worst-case -- time. The cited article proceeds to provide a much better algorithm for the suffix tree problem, which avoids comparison-sorting.)

like image 42
rici Avatar answered Oct 04 '22 01:10

rici