I have a file that has 1 million numbers. I need to know how I can sort it efficiently, so that it doesn't stall the computer, and it prints ONLY the top 10.
#!/usr/bin/python3
#Find the 10 largest integers
#Don't store the whole list
import sys
def fOpen(fname):
try:
fd = open(fname,"r")
except:
print("Couldn't open file.")
sys.exit(0)
all = fd.read().splitlines()
fd.close()
return all
words = fOpen(sys.argv[1])
big = 0
g = len(words)
count = 10
for i in range(0,g-1):
pos = i
for j in range(i+1,g):
if words[j] > words[pos]:
pos = j
if pos != i:
words[i],words[pos] = words[pos],words[i]
count -= 1
if count == 0:
print(words[0:10])
I know that this is selection sort, I'm not sure what would be the best sort to do.
Integer Sorting with limited range is achieved efficiently with Bucket Sorting. Bucket sort, counting sort, radix sort, and van Emde Boas tree sorting all work best when the key size is small; for large enough keys, they become slower than comparison sorting algorithm.
Use the Python List sort() method to sort a list in place. The sort() method sorts the string elements in alphabetical order and sorts the numeric elements from smallest to largest. Use the sort(reverse=True) to reverse the default sort order.
The fastest way to repeatedly lookup data with millions of entries in Python is using dictionaries. Because dictionaries are the built-in mapping type in Python thereby they are highly optimized.
Python sorted() Function The sorted() function returns a sorted list of the specified iterable object. You can specify ascending or descending order. Strings are sorted alphabetically, and numbers are sorted numerically. Note: You cannot sort a list that contains BOTH string values AND numeric values.
If you only need the top 10 values, then you'd waste a lot of time sorting every single number.
Just go through the list of numbers and keep track of the top 10 largest values seen so far. Update the top ten as you go through the list, and print them out when you reach the end.
This will mean you only need to make a single pass through the file (ie time complexity of theta(n))
A simpler problem
You can look at your problem as a generalization of finding the maximum value in a list of numbers. If you're given {2,32,33,55,13, ...}
and are asked to find the largest value, what would you do? The typical solution is to go through the list, while remembering the largest number encountered so far and comparing it with the next number.
For simplicity, let's assume we're dealing with positive numbers.
Initialize max to 0
0 < 2, so max = 2
2 < 32, so max = 32
32 < 33, so max = 33
33 < 55, so max = 55
55 > 13, so max = 55
...
return max
So you see, we can find the max in a single traversal of the list, as opposed to any kind of comparison sort.
Generalizing
Finding the top 10 values in a list is very similar. The only difference is that we need to keep track of the top 10 instead of just the max (top 1).
The bottom line is that you need some container that holds 10 values. As you're iterating through your giant list of numbers, the only value you care about in your size-10-container is the minimum. That's because this is the number that would be replaced if you've discovered a new number that deserves to be in the top-10-so-far.
Anyway it turns out that the data structure best fit for finding mins quickly is a min heap. But I'm not sure if you've learned about heaps yet, and the overhead of using a heap for 10 elements could possibly outweigh its benefits.
Any container that holds 10 elements and can obtain the min in a reasonable amount of time would be a good start.
The best sort is a partial sort, available in the Python library as heapq.nlargest
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With