I have a set of strings. 90% of them are URLs start with "http://www."
. I want to sort them alphabetically.
Currently I use C++ std::sort(). but std::sort is a variant of quick-sort based on comparison, and comparing two strings with long common prefix is not effecient. However (I think) a radix-sort won't work either, since most strings are put in the same bucket because of long common prefix.
Is there any better algorithm than normal quick-sort/radix-sort for this problem?
In all cases tested, burstsort is the fastest string sorting method; for the largest data set, it is almost twice as fast as the best of the previous methods—and almost four times as fast as an efficient implemen- tation of quicksort (Bentley & McIlroy 1993).
The longest common prefix for an array of strings is the common prefix between 2 most dissimilar strings. For example, in the given array {“apple”, “ape”, “zebra”}, there is no common prefix because the 2 most dissimilar strings of the array “ape” and “zebra” do not share any starting characters.
I would suspect that the processing time you spend trying to exploit common prefixes on the order of 10 characters per URL doesn't even pay for itself when you consider the average length of URLs.
Just try a completely standard sort. If that's not fast enough, look at parallelizing or distributing a completely standard sort. It's a straightforward approach that will work.
Common Prefixes seem to naturally imply that a trie data structure could be useful. So the idea is to build a trie of all the words and then sort each node. The ordering should be that the children of a particular node reside in a list and are sorted. This can be done easily since at a particular node we need only sort the children, so naturally a recursive solution reveals itself. See this for more inspiration: http://goanna.cs.rmit.edu.au/~jz/fulltext/acsc03sz.pdf
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With