Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I sort 1 million numbers, and only print the top 10 in Python?

Tags:

python

I have a file that has 1 million numbers. I need to know how I can sort it efficiently, so that it doesn't stall the computer, and it prints ONLY the top 10.

#!/usr/bin/python3

#Find the 10 largest integers
#Don't store the whole list

import sys

def fOpen(fname):
        try:
                fd = open(fname,"r")
        except:
                print("Couldn't open file.")
                sys.exit(0)
        all = fd.read().splitlines()
        fd.close()
        return all

words = fOpen(sys.argv[1])

big = 0
g = len(words)
count = 10

for i in range(0,g-1):
        pos = i
        for j in range(i+1,g):
                if words[j] > words[pos]:
                        pos = j
                if pos != i:
                        words[i],words[pos] = words[pos],words[i]
                count -= 1
                if count == 0:
                        print(words[0:10])

I know that this is selection sort, I'm not sure what would be the best sort to do.

like image 692
Seth Kania Avatar asked Feb 10 '12 23:02

Seth Kania


People also ask

What is the best way to sort 1 million integers?

Integer Sorting with limited range is achieved efficiently with Bucket Sorting. Bucket sort, counting sort, radix sort, and van Emde Boas tree sorting all work best when the key size is small; for large enough keys, they become slower than comparison sorting algorithm.

How do you sort a list of numbers in Python from least to greatest?

Use the Python List sort() method to sort a list in place. The sort() method sorts the string elements in alphabetical order and sorts the numeric elements from smallest to largest. Use the sort(reverse=True) to reverse the default sort order.

What is the best way to read a list of 1 million elements in Python?

The fastest way to repeatedly lookup data with millions of entries in Python is using dictionaries. Because dictionaries are the built-in mapping type in Python thereby they are highly optimized.

How do you sort numerical order in Python?

Python sorted() Function The sorted() function returns a sorted list of the specified iterable object. You can specify ascending or descending order. Strings are sorted alphabetically, and numbers are sorted numerically. Note: You cannot sort a list that contains BOTH string values AND numeric values.


2 Answers

If you only need the top 10 values, then you'd waste a lot of time sorting every single number.

Just go through the list of numbers and keep track of the top 10 largest values seen so far. Update the top ten as you go through the list, and print them out when you reach the end.

This will mean you only need to make a single pass through the file (ie time complexity of theta(n))

A simpler problem

You can look at your problem as a generalization of finding the maximum value in a list of numbers. If you're given {2,32,33,55,13, ...} and are asked to find the largest value, what would you do? The typical solution is to go through the list, while remembering the largest number encountered so far and comparing it with the next number.

For simplicity, let's assume we're dealing with positive numbers.

Initialize max to 0
0 < 2, so max = 2
2 < 32, so max = 32
32 < 33, so max = 33
33 < 55, so max = 55
55 > 13, so max = 55
...
return max

So you see, we can find the max in a single traversal of the list, as opposed to any kind of comparison sort.

Generalizing

Finding the top 10 values in a list is very similar. The only difference is that we need to keep track of the top 10 instead of just the max (top 1).

The bottom line is that you need some container that holds 10 values. As you're iterating through your giant list of numbers, the only value you care about in your size-10-container is the minimum. That's because this is the number that would be replaced if you've discovered a new number that deserves to be in the top-10-so-far.

Anyway it turns out that the data structure best fit for finding mins quickly is a min heap. But I'm not sure if you've learned about heaps yet, and the overhead of using a heap for 10 elements could possibly outweigh its benefits.

Any container that holds 10 elements and can obtain the min in a reasonable amount of time would be a good start.

like image 158
pepsi Avatar answered Oct 04 '22 12:10

pepsi


The best sort is a partial sort, available in the Python library as heapq.nlargest.

like image 45
Fred Foo Avatar answered Oct 04 '22 10:10

Fred Foo