Pythonic way to store top 10 results

Tags:

python

I'm working on a python project that runs for a couple hours before it's finished all it's calculations. I would like to hold the top 10 results of the calculation as it progresses.

There's the obvious way of doing:

if calc > highest_calc:
    second_calc = highest_calc
    highest_calc = calc
if calc < highest_calc and calc > second_calc:
    third_calc =  second_calc
    second_calc = calc
if calc < second_calc and calc > third_calc:
    fourth_calc = third_calc
    third_calc = calc
etc.

But is there a better, more dynamic and pythonic way?

Bonus

For my project, each calcultion has three coresponding names with it: name_a name_b name_c . What I don't want is more then one of the top 10 values to have the same three names. But, if the last calc has the same names, I want to keep the highest of the two. What's the best way to do this?

For example, lets say 2.3 is the value of calc, using MCD SBUX and CAT to calculate calc. But what if I had already made a calc using MCD SBUX and CAT and it made it to the top to? How do I find the value of this calc so I can see if it is less than or greater then the new calc. If it is greater than, remove the old calc with the same and add the new calc. If it is less than, pass new calc. Hopefully that makes sense:

If name_a in top10 and name_b in top10 and name_c in top10:
   if calc > old_calc_with_same_names:
       add_calc = calc, name_a, name_b, name_c
       top10.insert(bisect.bisect(calc, top10[0]), add_calc)
else:
   add to top10

Finished Code

csc = []
top_ports = []
add_sharpe = [sharpe, name_a, weight_a, exchange_a, name_b, weight_b, exchange_b, name_c, weight_c, exchange_c]
    if init__calc == 0:
            csc.append(add_sharpe)
    if init__calc > 1:
        if name_a == prev_name_a and name_b == prev_name_b and name_c == prev_name_c:
            csc.append(add_sharpe)
        if name_a != prev_name_a or name_b != prev_name_b or name_c != prev_name_c:
            if csc:
                hs = max(csc, key=lambda x: x[0])
                if top_ports:
                    ls = min(top_ports, key=lambda x: x[0])
                    if hs[0] > ls[0]:
                        hsi = csc.index(hs)
                        top_ports.append(csc[hsi])
                else:
                    hsi = csc.index(hs)
                    top_ports.append(csc[hsi])
            csc = []
            csc.append(add_sharpe)

Later on in the script...

top_ports = sorted(top_ports, key=itemgetter(0), reverse=True)
print "The highest sharpe is: {0}".format(top_ports[0])
print " ==============================================="
print " ==============================================="
print datetime.now() - startTime
print "Second: {0}".format(top_ports[1])
print "Third: {0}".format(top_ports[2])
print "Fourth: {0}".format(top_ports[3])
print "Fifth: {0}".format(top_ports[4])

etc.

756

asked May 17 '15 08:05

thefoxrocks

1 Answers

Use the heapq module. Instead of needlessly storing all results, at every step it adds the new result and then efficiently removes the lowest—which may be the one just added—effectively keeping the top 10. Storing all results is not necessarily bad though; it can be valuable to collect statistics, and make it easier to determine what to keep afterwards.

from heapq import heappush, heappushpop

heap = []
for x in [18, 85, 36, 57, 2, 45, 55, 1, 28, 73, 95, 38, 89, 15, 7, 61]:
    calculation_result = x + 1 # Dummy calculation
    if len(heap) < 10:
        heappush(heap, calculation_result)
    else:
        heappushpop(heap, calculation_result)

top10 = sorted(heap, reverse=True) # [96, 90, 86, 74, 62, 58, 56, 46, 39, 37]

Note that this module has more useful functions to only request the highest/lowest value, et cetera. This may help you to add the behavior concerning names.

Actually this construct is so common that it is available as heapq.nlargest. However, to not store all your results after all, you'd have to model the calculator as a generator, which is a bit more advanced.

from heapq import nlargest

def calculate_gen():
    for x in [18, 85, 36, 57, 2, 45, 55, 1, 28, 73, 95, 38, 89, 15, 7, 61]:
        yield x + 1 # Dummy calculation

top10 = nlargest(10, calculate_gen()) # [96, 90, 86, 74, 62, 58, 56, 46, 39, 37]

Bonus

Here is some idea to make the results unique for each combination of associated names.

Using a heap is not going to cut it anymore, because a heap is not good at locating any item that is not the absolute minimum/maximum, and what we are interested in here is some kind of local minimum given the criteria of a name combination.

Instead, you can use a dict to keep the highest value for each name combination. First you need to encode the name combination as an immutable value for it to work as a key, and because the order of the names shouldn't matter, decide on some order and stick with it. I'm going with alphabetical strings to keep it simple.

In the code below, each result is put in the dict at a place that is unique for its name combination—therefore normalization might be needed—as long as there isn't a better result already. Later the top n is compiled from the highest results for each combination.

from heapq import nlargest

calculations = [('ABC', 18), ('CDE', 85), ('BAC', 36), ('CDE', 57),
                ('ECD',  2), ('BAD', 45), ('EFG', 55), ('DCE',  1)]

highest_per_name_combi = dict()

for name_combi, value in calculations:
    normal_name_combi = ''.join(sorted(name_combi)) # Slow solution
    current = highest_per_name_combi.get(normal_name_combi, float('-inf'))
    highest_per_name_combi[normal_name_combi] = max(value, current)

top3 = nlargest(3, highest_per_name_combi.iteritems(), key=lambda x: x[1])

The only problem with this approach might be the amount of memory used. Since with 150 names there can be 551300 (150 choose 3) combinations, you may have to decide to clean up the dict every now and then, which is simple. In the loop, check for the size of the dict and if it exceeds some (still large) number, compose the current top n and create a new, minimal dict from it. Also, some micro optimizations could be applied by reducing the number of lookups/calls, e.g. not using get and/or max.

All of this would be a lot easier if you'd have control over the order in which calculations are performed. If you'd know that the next 1000 calculations are all for the same name combination, you could just find the best of those first before adding it to the overall results.

Also, with a truly massive amount of results, the simplest way may actually be the best. Just write them to a file in a convenient format, sort them there (first by name combination, then reversely by value), take only the first occurrence for each name combination (easy when they are grouped) and sort the result again, just by value.

126

answered Nov 06 '22 02:11

Thijs van Dien

Related questions
                            
                                How can I escape the format string?
                            
                                Can we draw digital waveform graph with Pyplot in python or Matlab?
                            
                                Using the same name for variables which are in different functions?
                            
                                Format number number with specific mask regex python
                            
                                My admin.TabularInline class returns exception: object has no attribute 'urls'
                            
                                Dictionary with lists as values - find longest list
                            
                                Pass a function as a variable with one input fixed
                            
                                Should I deploy only the .pyc files on server if I worry about code security?
                            
                                Change Flask-Babel locale outside of request context for scheduled tasks
                            
                                download images with google custom search api
                            
                                How to initialise a 2D array in Python?
                            
                                Python Pandas merge samed name columns in a dataframe
                            
                                SqlAlchemy. TypeError: filter_by() takes exactly 1 argument (2 given)
                            
                                Generator function for prime numbers [duplicate]
                            
                                Generating all DNA kmers with Python
                            
                                Auto increament the invoice number in django backend for new invoice
                            
                                How to append to Python list in a dict without having to initialize the list?
                            
                                Error Installing scikit-learn
                            
                                Get rid of grey background in python matplotlib bar chart
                            
                                Why does running my Python script start taking a screenshot? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With