Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pythonic way to store top 10 results

Tags:

python

I'm working on a python project that runs for a couple hours before it's finished all it's calculations. I would like to hold the top 10 results of the calculation as it progresses.

There's the obvious way of doing:

if calc > highest_calc:
    second_calc = highest_calc
    highest_calc = calc
if calc < highest_calc and calc > second_calc:
    third_calc =  second_calc
    second_calc = calc
if calc < second_calc and calc > third_calc:
    fourth_calc = third_calc
    third_calc = calc
etc. 

But is there a better, more dynamic and pythonic way?

Bonus

For my project, each calcultion has three coresponding names with it: name_a name_b name_c . What I don't want is more then one of the top 10 values to have the same three names. But, if the last calc has the same names, I want to keep the highest of the two. What's the best way to do this?

For example, lets say 2.3 is the value of calc, using MCD SBUX and CAT to calculate calc. But what if I had already made a calc using MCD SBUX and CAT and it made it to the top to? How do I find the value of this calc so I can see if it is less than or greater then the new calc. If it is greater than, remove the old calc with the same and add the new calc. If it is less than, pass new calc. Hopefully that makes sense:

If name_a in top10 and name_b in top10 and name_c in top10:
   if calc > old_calc_with_same_names:
       add_calc = calc, name_a, name_b, name_c
       top10.insert(bisect.bisect(calc, top10[0]), add_calc)
else:
   add to top10

Finished Code

csc = []
top_ports = []
add_sharpe = [sharpe, name_a, weight_a, exchange_a, name_b, weight_b, exchange_b, name_c, weight_c, exchange_c]
    if init__calc == 0:
            csc.append(add_sharpe)
    if init__calc > 1:
        if name_a == prev_name_a and name_b == prev_name_b and name_c == prev_name_c:
            csc.append(add_sharpe)
        if name_a != prev_name_a or name_b != prev_name_b or name_c != prev_name_c:
            if csc:
                hs = max(csc, key=lambda x: x[0])
                if top_ports:
                    ls = min(top_ports, key=lambda x: x[0])
                    if hs[0] > ls[0]:
                        hsi = csc.index(hs)
                        top_ports.append(csc[hsi])
                else:
                    hsi = csc.index(hs)
                    top_ports.append(csc[hsi])
            csc = []
            csc.append(add_sharpe)

Later on in the script...

top_ports = sorted(top_ports, key=itemgetter(0), reverse=True)
print "The highest sharpe is: {0}".format(top_ports[0])
print " ==============================================="
print " ==============================================="
print datetime.now() - startTime
print "Second: {0}".format(top_ports[1])
print "Third: {0}".format(top_ports[2])
print "Fourth: {0}".format(top_ports[3])
print "Fifth: {0}".format(top_ports[4])

etc.

like image 756
thefoxrocks Avatar asked May 17 '15 08:05

thefoxrocks


People also ask

What are the benefits of lists in Python?

The benefit of this practice is eliminating the risk of accidentally interfering with either of these other use cases. Use the Python list * operator to create simple lists and nested lists as well: Sometimes we need to search through a collection. Let’s look at two options: lists and sets.

What is Python's counter from collections?

Python offers a bunch of tools and techniques you can use to approach this problem. However, Python’s Counter from collections provides a clean, efficient, and Pythonic solution. This dictionary subclass provides efficient counting capabilities out of the box.

How to print top n elements from a list in Python?

Method #1 : Using sorted () + lambda. The combination of above functionality can be used to perform this particular task. In this, we just employ sorted function with reverse flag true, and print the top N elements using list slicing. test_list = [ ('Manjeet', 10), ('Akshat', 4), ('Akash', 2), ('Nikhil', 8)] N = 2.

How do you know if a Python project is healthy?

(Once the project is in production, always make sure that you remove the commented code.) Run tests/add tests. Make sure that you have at least 80% test coverage. Use pytest or similar Python packages to track test coverage. Use Pylint /Vulture. Always consider running some type of linter over the code to see how “healthy” it is. Try to look for:


1 Answers

Use the heapq module. Instead of needlessly storing all results, at every step it adds the new result and then efficiently removes the lowest—which may be the one just added—effectively keeping the top 10. Storing all results is not necessarily bad though; it can be valuable to collect statistics, and make it easier to determine what to keep afterwards.

from heapq import heappush, heappushpop

heap = []
for x in [18, 85, 36, 57, 2, 45, 55, 1, 28, 73, 95, 38, 89, 15, 7, 61]:
    calculation_result = x + 1 # Dummy calculation
    if len(heap) < 10:
        heappush(heap, calculation_result)
    else:
        heappushpop(heap, calculation_result)

top10 = sorted(heap, reverse=True) # [96, 90, 86, 74, 62, 58, 56, 46, 39, 37]

Note that this module has more useful functions to only request the highest/lowest value, et cetera. This may help you to add the behavior concerning names.

Actually this construct is so common that it is available as heapq.nlargest. However, to not store all your results after all, you'd have to model the calculator as a generator, which is a bit more advanced.

from heapq import nlargest

def calculate_gen():
    for x in [18, 85, 36, 57, 2, 45, 55, 1, 28, 73, 95, 38, 89, 15, 7, 61]:
        yield x + 1 # Dummy calculation

top10 = nlargest(10, calculate_gen()) # [96, 90, 86, 74, 62, 58, 56, 46, 39, 37]

Bonus

Here is some idea to make the results unique for each combination of associated names.

Using a heap is not going to cut it anymore, because a heap is not good at locating any item that is not the absolute minimum/maximum, and what we are interested in here is some kind of local minimum given the criteria of a name combination.

Instead, you can use a dict to keep the highest value for each name combination. First you need to encode the name combination as an immutable value for it to work as a key, and because the order of the names shouldn't matter, decide on some order and stick with it. I'm going with alphabetical strings to keep it simple.

In the code below, each result is put in the dict at a place that is unique for its name combination—therefore normalization might be needed—as long as there isn't a better result already. Later the top n is compiled from the highest results for each combination.

from heapq import nlargest

calculations = [('ABC', 18), ('CDE', 85), ('BAC', 36), ('CDE', 57),
                ('ECD',  2), ('BAD', 45), ('EFG', 55), ('DCE',  1)]

highest_per_name_combi = dict()

for name_combi, value in calculations:
    normal_name_combi = ''.join(sorted(name_combi)) # Slow solution
    current = highest_per_name_combi.get(normal_name_combi, float('-inf'))
    highest_per_name_combi[normal_name_combi] = max(value, current)

top3 = nlargest(3, highest_per_name_combi.iteritems(), key=lambda x: x[1])

The only problem with this approach might be the amount of memory used. Since with 150 names there can be 551300 (150 choose 3) combinations, you may have to decide to clean up the dict every now and then, which is simple. In the loop, check for the size of the dict and if it exceeds some (still large) number, compose the current top n and create a new, minimal dict from it. Also, some micro optimizations could be applied by reducing the number of lookups/calls, e.g. not using get and/or max.

All of this would be a lot easier if you'd have control over the order in which calculations are performed. If you'd know that the next 1000 calculations are all for the same name combination, you could just find the best of those first before adding it to the overall results.

Also, with a truly massive amount of results, the simplest way may actually be the best. Just write them to a file in a convenient format, sort them there (first by name combination, then reversely by value), take only the first occurrence for each name combination (easy when they are grouped) and sort the result again, just by value.

like image 126
Thijs van Dien Avatar answered Nov 06 '22 02:11

Thijs van Dien