Here is a description of an algorithm to sort strings using a trie:
The algorithm first inserts all the items in the trie in O(n)
time, where n is the total number of characters in the list of words to be sorted.
Then it traverses the tree in-order, printing out a node preceded by its prefix when it comes to a node with the is_end
flag set. This requires a complete traversal of the trie, which takes O(m)
time, where m is the number of nodes in the trie. This is bounded by n
, so this step is bounded by O(n)
as well.
The whole algorithm is made up of two sub-routines, each bounded by O(n)
. If we say e.g. that the average word contains c
characters, then if m
is the number of words, cm == n
, and the total runtime is bounded by O(n) == O(cm) == O(m)
(the reason I changed it to m
is because that's the traditional measure of the length of a list to be sorted, not the total number of characters).
Therefore, my question is, if this runtime analysis is correct, why isn't this the default method for string sorting, as it would be faster than any O(nlogn)
sorting algorithm?
The O(n log n) lower bound is for comparison sorts, i.e. the elements in the array can only be compared with each other to check whether one should be before or after the other, or if they are equal. That is a good model for general sorting algorithms because it works for pretty much any type of data you might want to sort; numbers, strings, instances of user-defined classes, and so on. It could even just be a data type that can be mapped by a key function to some other data type supporting comparisons; or you could accept a comparator function to do the comparisons with.
Note that O(n log n) here is a lower bound on the number of comparisons, not the running time. If the comparisons take more than O(1) time each, say because you're comparing long strings that have long common prefixes, then the running time would be like O(cn log n) where comparisons are done in O(c) time. Comparing strings of length w takes O(w) time in the worst case, for example.
If you only need a sorting algorithm for a specific type of data, then you may be able to do better, because other operations specific to that data type can be performed on the elements. For example, when sorting integers, you can use the array elements to index another array, giving the counting sort algorithm which runs in O(n + r) time where r is the range of the array elements.
If the sorting keys are like strings, in the sense that they are (or can be mapped to) sequences such that comparing keys is equivalent to lexicographically comparing those sequences, then indeed you can use a trie to sort an array containing that data type. Congratulations: you have independently reinvented the radix sort algorithm, which can be implemented using tries. Its running time is O(wn), not O(n), because it takes O(w) time to insert a string of length w into the trie and you have to do that n times.
So, if the elements aren't strings, or "string-like" in the above sense, then radix sort simply isn't applicable. If the elements are strings or "string-like", then radix sort works but it takes O(wn) time instead of O(cn log n).
This means radix sort isn't strictly better, and is probably worse when the common prefixes of the strings are short relative to the strings themselves, which is often the case. For random strings, regular string comparison takes O(1) time on average, in which case O(n log n) is asymptotically better than radix sort for strings longer than O(log n).
In practical applications, one should also consider the hidden constants in the asymptotic analysis. Comparison sorts like Timsort have low hidden constants because they access array elements sequentially, which results in fewer cache misses compared to walking a tree whose nodes won't be consecutive in memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With