Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieve the two highest item from a list containing 100,000 integers

How can retrieve the two highest item from a list containing 100,000 integers without having to sort the entire list first?

like image 659
Joey Avatar asked Apr 29 '10 16:04

Joey


Video Answer


2 Answers

Use heapq.nlargest. This is the most flexible approach in case you ever want to handle more than just the top two elements.

Here's an example.

>>> import heapq >>> import random >>> x = range(100000) >>> random.shuffle(x) >>> heapq.nlargest(2, x) [99999, 99998] 
like image 103
FogleBird Avatar answered Oct 12 '22 21:10

FogleBird


JacobM's answer is absolutely the way to go. However, there are a few things to keep in mind while implementing what he described. Here's a little play-along-at-home tutorial to guide you through the trickier parts of solving this problem.

If this code is meant for production use, please use one of the more efficient/concise answers listed. This answer is targetted at someone new to programming.

The idea

The idea is simple.

  • Keep two variables: largest and second_largest.
  • Go through the list.
    • If an item is greater than largest, assign it to largest.
    • If an item is greater than second_largest, but less than largest, assign it to second_largest.

Getting started

Let's start.

def two_largest(inlist):     """Return the two largest items in the sequence. The sequence must     contain at least two items."""     for item in inlist:         if item > largest:             largest = item         elif largest > item > second_largest:             second_largest = item     # Return the results as a tuple     return largest, second_largest  # If we run this script, it will should find the two largest items and # print those if __name__ == "__main__":     inlist = [3, 2, 1]     print two_largest(inlist) 

Okay, we now have JacobM's answer as a Python function. What happens when we try to run it?

Traceback (most recent call last):   File "twol.py", line 10, in <module>     print two_largest(inlist)   File "twol.py", line 3, in two_largest     if item > largest: UnboundLocalError: local variable 'largest' referenced before assignment 

Apparently, we need to set largest before we start the loop. This probably means we should set second_largest too.

Initializing variables

Let's set largest and second_largest to 0.

def two_largest(inlist):     """Return the two largest items in the sequence. The sequence must     contain at least two items."""     largest = 0 # NEW!     second_largest = 0 # NEW!     for item in inlist:         if item > largest:             largest = item         elif largest > item > second_largest:             second_largest = item     # Return the results as a tuple     return largest, second_largest  # If we run this script, it will should find the two largest items and # print those if __name__ == "__main__":     inlist = [3, 2, 1]     print two_largest(inlist) 

Good. Let's run it.

(3, 2) 

Great! Now let's test with inlist being [1, 2, 3]

    inlist = [1, 2, 3] # CHANGED! 

Let's try it.

(3, 0) 

...Uh oh.

Fixing the logic

The largest value (3) seems correct. The second-largest value is completely wrong though. What's going on?

Let's work through what the function is doing.

  • When we start, largest is 0 and second_largest is also 0.
  • The first item in the list we look at is 1, so largest becomes 1.
  • The next item is 2, so largest becomes 2.

But what about second_largest?

When we assign a new value to largest, the largest value actually becomes second-largest. We need to show that in the code.

def two_largest(inlist):     """Return the two largest items in the sequence. The sequence must     contain at least two items."""     largest = 0     second_largest = 0     for item in inlist:         if item > largest:             second_largest = largest # NEW!             largest = item         elif largest > item > second_largest:             second_largest = item     # Return the results as a tuple     return largest, second_largest  # If we run this script, it will should find the two largest items and # print those if __name__ == "__main__":     inlist = [1, 2, 3]     print two_largest(inlist) 

Let's run it.

(3, 2) 

Fantastic.

Initializing variables, part 2

Now let's try it with a list of negative numbers.

    inlist = [-1, -2, -3] # CHANGED! 

Let's run it.

(0, 0) 

That's not right at all. Where did these zeroes come from?

It turns out that the starting values for largest and second_largest were actually larger than all the items in the list. The first thing you might consider is setting largest and second_largest to the lowest values possible in Python. Unfortunately, Python doesn't have a smallest possible value. That means that, even if you set both of them to -1,000,000,000,000,000,000, you can have a list of values smaller than that.

So what's the best thing to do? Let's try setting largest and second_largest to the first and second items in the list. Then, to avoid double-counting any items in the list, we only look at the part of the list after the second item.

def two_largest(inlist):     """Return the two largest items in the sequence. The sequence must     contain at least two items."""     largest = inlist[0] # CHANGED!     second_largest = inlist[1] # CHANGED!     # Only look at the part of inlist starting with item 2     for item in inlist[2:]: # CHANGED!         if item > largest:             second_largest = largest             largest = item         elif largest > item > second_largest:             second_largest = item     # Return the results as a tuple     return largest, second_largest  # If we run this script, it will should find the two largest items and # print those if __name__ == "__main__":     inlist = [-1, -2, -3]     print two_largest(inlist) 

Let's run it.

(-1, -2) 

Great! Let's try with another list of negative numbers.

    inlist = [-3, -2, -1] # CHANGED! 

Let's run it.

(-1, -3) 

Wait, what?

Initializing variables, part 3

Let's step through our logic again.

  • largest is set to -3
  • second_largest is set to -2

Wait right there. Already, this seems wrong. -2 is larger than -3. Is this what caused the problem? Let's continue.

  • largest is set to -1; second_largest is set to the old value of largest, which is -3

Yes, that looks to be the problem. We need to ensure that largest and second_largest are set correctly.

def two_largest(inlist):     """Return the two largest items in the sequence. The sequence must     contain at least two items."""     if inlist[0] > inlist[1]: # NEW         largest = inlist[0]         second_largest = inlist[1]     else: # NEW         largest = inlist[1] # NEW         second_largest = inlist[0] # NEW     # Only look at the part of inlist starting with item 2     for item in inlist[2:]:         if item > largest:             second_largest = largest             largest = item         elif largest > item > second_largest:             second_largest = item     # Return the results as a tuple     return largest, second_largest  # If we run this script, it will should find the two largest items and # print those if __name__ == "__main__":     inlist = [-3, -2, -1]     print two_largest(inlist) 

Let's run it.

(-1, -2) 

Excellent.

Conclusion

So here's the code, nicely commented and formatted. It's also had all the bugs I could find beaten from it. Enjoy.

However, assuming this really is a homework question, I hope you get some useful experience from seeing an imperfect piece of code slowly improved. I hope some of these techniques will be useful in future programming assignments.


Efficiency

Not very efficient. But for most purposes, it should be okay: on my computer (Core 2 Duo), a list of 100 000 items can be processed in 0.27 seconds (using timeit, averaged over 100 runs).

like image 33
Wesley Avatar answered Oct 12 '22 21:10

Wesley