Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why can’t you use Hash Tables/Dictionaries in Counting Sort algorithm?

When you use the counting sort algorithm you create a list, and use its indices as keys while adding the number of integer occurrences as the values within the list. Why is this not the same as simply creating a dictionary with the keys as the index and the counts as the values? Such as:

hash_table = collections.Counter(numList) 

or

hash_table = {x:numList.count(x) for x in numList} 

Once you have your hash table created you essentially just copy the number of integer occurrences over to another list. Hash Tables/Dictionaries have O(1) lookup times, so why would this not be preferable if your simply referencing the key/value pairs?

I've included the algorithm for Counting Sort below for reference:

def counting_sort(the_list, max_value):
    # List of 0's at indices 0...max_value
    num_counts = [0] * (max_value + 1)

    # Populate num_counts
    for item in the_list:
        num_counts[item] += 1

    # Populate the final sorted list
    sorted_list = []

    # For each item in num_counts
    for item, count in enumerate(num_counts):

        # For the number of times the item occurs
        for _ in xrange(count):

            # Add it to the sorted list
            sorted_list.append(item)

    return sorted_list
like image 988
William Merritt Avatar asked Dec 16 '18 16:12

William Merritt


1 Answers

You certainly can do something like this. The question is whether it’s worthwhile to do so.

Counting sort has a runtime of O(n + U), where n is the number of elements in the array and U is the maximum value. Notice that as U gets larger and larger the runtime of this algorithm starts to degrade noticeably. For example, if U > n and I add one more digit to U (for example, changing it from 1,000,000 to 10,000,000), the runtime can increase by a factor of ten. This means that counting sort starts to become impractical as U gets bigger and bigger, and so you typically run counting sort when U is fairly small. If you’re going to run counting sort with a small value of U, then using a hash table isn’t necessarily worth the overhead. Hashing items costs more CPU cycles than just doing standard array lookups, and for small arrays the potential savings in memory might not be worth the extra time. And if you’re using a very large value of U, you’re better off switching to radix sort, which essentially is lots of smaller passes of counting sort with a very small value of U.

The other issue is that the reassembly step of counting sort has amazing locality of reference - you simply scan over the counts array and the input array in parallel filling in values. If you use a hash table, you lose some of that locality because th elements in the hash table aren’t necessarily stored consecutively.

But these are more implementation arguments than anything else. Fundamentally, counting sort is less about “use an array” and more about “build a frequency histogram.” It just happens to be the case that a regular old array is usually preferable to a hash table when building that histogram.

like image 185
templatetypedef Avatar answered Sep 28 '22 13:09

templatetypedef