How can I sort 1 million numbers, and only print the top 10 in Python?

Tags:

python

I have a file that has 1 million numbers. I need to know how I can sort it efficiently, so that it doesn't stall the computer, and it prints ONLY the top 10.

#!/usr/bin/python3

#Find the 10 largest integers
#Don't store the whole list

import sys

def fOpen(fname):
        try:
                fd = open(fname,"r")
        except:
                print("Couldn't open file.")
                sys.exit(0)
        all = fd.read().splitlines()
        fd.close()
        return all

words = fOpen(sys.argv[1])

big = 0
g = len(words)
count = 10

for i in range(0,g-1):
        pos = i
        for j in range(i+1,g):
                if words[j] > words[pos]:
                        pos = j
                if pos != i:
                        words[i],words[pos] = words[pos],words[i]
                count -= 1
                if count == 0:
                        print(words[0:10])

I know that this is selection sort, I'm not sure what would be the best sort to do.

692

asked Feb 10 '12 23:02

2 Answers

If you only need the top 10 values, then you'd waste a lot of time sorting every single number.

Just go through the list of numbers and keep track of the top 10 largest values seen so far. Update the top ten as you go through the list, and print them out when you reach the end.

This will mean you only need to make a single pass through the file (ie time complexity of theta(n))

A simpler problem

You can look at your problem as a generalization of finding the maximum value in a list of numbers. If you're given {2,32,33,55,13, ...} and are asked to find the largest value, what would you do? The typical solution is to go through the list, while remembering the largest number encountered so far and comparing it with the next number.

For simplicity, let's assume we're dealing with positive numbers.

Initialize max to 0
0 < 2, so max = 2
2 < 32, so max = 32
32 < 33, so max = 33
33 < 55, so max = 55
55 > 13, so max = 55
...
return max

So you see, we can find the max in a single traversal of the list, as opposed to any kind of comparison sort.

Generalizing

Finding the top 10 values in a list is very similar. The only difference is that we need to keep track of the top 10 instead of just the max (top 1).

The bottom line is that you need some container that holds 10 values. As you're iterating through your giant list of numbers, the only value you care about in your size-10-container is the minimum. That's because this is the number that would be replaced if you've discovered a new number that deserves to be in the top-10-so-far.

Anyway it turns out that the data structure best fit for finding mins quickly is a min heap. But I'm not sure if you've learned about heaps yet, and the overhead of using a heap for 10 elements could possibly outweigh its benefits.

Any container that holds 10 elements and can obtain the min in a reasonable amount of time would be a good start.

158

answered Oct 04 '22 12:10

pepsi

The best sort is a partial sort, available in the Python library as heapq.nlargest.

answered Oct 04 '22 10:10

Fred Foo

Related questions
                            
                                How can I convert a XLSB file to csv using python?
                            
                                How do I remove all packages installed by PIP? [duplicate]
                            
                                What are the downsides of using Python instead of Objective-C? [closed]
                            
                                If else based on existence of python function optional arguments
                            
                                numpy and scipy for preinstalled python 2.6.7 on mac OS Lion
                            
                                Recursively decrement a list by 1
                            
                                Override namespace in Python
                            
                                Iterate every 2 elements from list at a time
                            
                                How to print formatted string in Python3?
                            
                                Curses returning AttributeError: 'module' object has no attribute 'initscr'
                            
                                django.core.exceptions.FieldDoesNotExist: model has no field named <function SET_NULL at 0x7fc5ae8836e0>
                            
                                Print 5 items in a row on separate lines for a list?
                            
                                Why doesn't Python require exactly four spaces per indentation level?
                            
                                Celery: auto discovery does not find tasks module in app
                            
                                Is there a way to read 10000 lines from a file in python?
                            
                                Does Heroku no longer support Celery?
                            
                                First month of quarter given month in Python
                            
                                Matplotlib and Numpy - Create a calendar heatmap
                            
                                Python ASCII and Unicode decode error
                            
                                Specify a sender when sending mail with Python (smtplib)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With