Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python cartesian product of n lists with n unknown at coding time

Tags:

python

Question


  • what is the best way to generate a cartesian product of some lists, not knowing in advance how many lists there are?

You can stop reading here if you like.


Background

I don't have money for school so I am trying to teach myself some programming using the Internet whilst working night shifts at a highway tollbooth. I have decided to try to solve some "programming challenge" problems as an exercise.

Programming assignment

Here's the problem I am trying to tackle, property of TopCoder:

http://community.topcoder.com/stat?c=problem_statement&pm=3496

I will not copy and paste the full description to respect their copyright notice but I am assuming I can summarise it, provided I don't use pieces of it verbatim (IANAL though).

Summary

If a "weighted sum" of historical stock prices is the sum of addenda obtained by multiplying a subset of these prices by an equal number of "weighting" factors, provided the latter add up to 1.0 and are chosen from the given set of valid values [-1.0, -0.9, ..., 0.9, 1.0], use this formula on all historical data supplied as an argument to your function, examining 5 prices at a time, predicting the next price and returning the permutation of "weighting factors" that yields the lowest average prediction error. There will be at least 6 stock prices in each run so at least one prediction is guaranteed, final results should be accurate within 1E-9.

Test data

Format:

  • One row for input data, in list format
  • One row for the expected result
  • One empty row as a spacer

Download from:

  • http://paste.ubuntu.com/1229857/

My solution


import itertools

# For a permutation of factors to be used in a weighted sum, it should be chosen
# such than the sum of all factors is 1.
WEIGHTED_SUM_TOTAL = 1.0
FACTORS_CAN_BE_USED_IN_WEIGHTED_SUM = lambda x: sum(x) == WEIGHTED_SUM_TOTAL

# Historical stock price data should be examined using a sliding window of width
# 5 when making predictions about the next price.
N_RECENT_PRICES = 5

# Valid values for weighting factors are: [-1.0, -0.9, ..., 0.9, 1.0]
VALID_WEIGHTS = [x / 10. for x in range(-10, 11)]

# A pre-calculated list of valid weightings to consider. This is the cartesiant
# product of the set of valid weigths considering only the combinations which
# are valid as components of a weighted sum.
CARTESIAN_PRODUCT_FACTORS = [VALID_WEIGHTS] * N_RECENT_PRICES
ALL_PERMUTATIONS_OF_WEIGHTS = itertools.product(*CARTESIAN_PRODUCT_FACTORS)
WEIGHTED_SUM_WEIGHTS = filter(FACTORS_CAN_BE_USED_IN_WEIGHTED_SUM,
                              ALL_PERMUTATIONS_OF_WEIGHTS)

# Generator function to get sliding windows of a given width from a data set
def sliding_windows(data, window_width):

  for i in range(len(data) - window_width):
    yield data[i:i + window_width], data[i + window_width]

def avg_error(data):

  # The supplied data will guarantee at least one iteration
  n_iterations = len(data) - 5

  best_average_error = None

  # Consider each valid weighting (e.g. permutation of weights)
  for weighting in WEIGHTED_SUM_WEIGHTS:

    # Keep track of the prediction errors for this weighting
    errors_for_this_weighting = []

    for historical_data, next_to_predict in sliding_windows(data,
                                                            N_RECENT_PRICES):

      prediction = sum([a * b for a, b in zip(weighting, historical_data)])
      errors_for_this_weighting.append(abs(next_to_predict - prediction))

    average_error = sum(errors_for_this_weighting) / n_iterations

    if average_error == 0: return average_error

    best_average_error = (average_error if not best_average_error else
      min(average_error, best_average_error))

  return best_average_error

def main():
  with open('data.txt') as input_file:
    while True:
        data = eval(input_file.readline())
        expected_result = eval(input_file.readline())
        spacer = input_file.readline()
        if not spacer:
          break
        result = avg_error(data)
        print expected_result, result, (expected_result - result) < 1e-9

if __name__ == '__main__':
    main()

My question

I am not asking for a code review of my solution because this would be the wrong StackExchange forum for that. I would post my solution to "Code Review" in that case.

My question instead is small, precise and unambiguous, fitting this site's format (hopefully).

In my code I generate a cartesian product of lists using itertools. In essence, I do not solve the crux of the problem myself but delegate the solution to a library that does that for me. I think this is the wrong approach to take if I want to learn from doing these exercises. I should be doing the hard part myself, otherwise why do the exercise at all? So I would like to ask you:


  • what is the best way to generate a cartesian product of some lists, not knowing in advance how many lists there are?

That's all I'd like to know, you can critique my code if you like. That's welcome, even if it passes all the tests (there is always a better way of doing things, especially if you are a beginner like me) but for this question to be "just right" for SO, I am focussing on just one aspect of the code, a concrete problem I have and something I am not happy with. Let me tell you more, I'll also share the canonical "what have you tried already"...

Clearly if I knew the number of lists I could just type in some nested for loops, like the top solvers of this exercise did in the competition. I tried writing a function that does this for an unknown number of lists but I was not sure which approach to take. The first approach was to write a recursive function. From list 1, take element 1 and combine it with element 1 of list 2, then element 1 of list 3, etc. I would push onto a stack the elements from each "layer" and pop them upon reaching the desired depth. I would imagine I would not fear a "stack overflow" because the depth reachable would be reasonable. I then struggled to choose a data structure to do this in the most efficient (memory/space) way possible, without passing too many parameters to the recursive calls. Should the data structure exist outside of the calls? Be passed around in the calls? Could I achieve any level of parallelism? How? With so many questions and so few answers I realised I needed to just know more to solve this problem and I could use a nudge in the right direction. You could provide a code snippet and I would study it. Or just explain to me what is the right "Computer Science" way of handling this type of problem. I am sure there is something I am not considering.

Finally, the thing that I did consider in my solution above is that thankfully filter filters a generator so the full cartesian product is never kept in memory (as it would if I did a list(ALL_PERMUTATIONS_OF_WEIGHTS) at any time in the code) so I am occupying space in memory only for those combinations which can actually be used as a weighted sum. A similar caution would be nice if applied to whatever system allows me to generate the cartesian product without using itertools.

like image 720
Robottinosino Avatar asked Sep 30 '12 01:09

Robottinosino


People also ask

How do you find the Cartesian product of a list in Python?

Practical Data Science using Python We have to find Cartesian product of these two lists. As we know if two lists are like (a, b) and (c, d) then the Cartesian product will be {(a, c), (a, d), (b, c), (b, d)}. To do this we shall use itertools library and use the product() function present in this library.

How do you write a Cartesian product in Python?

Introducing The Cartesian Product / Cross Product Of A Set The elements (a,b) are ordered pairs. For example if A = {1,2} and B = {4,5,6} then the cartesian products of A and B is AxB = {(1,4),(1,5),(1,6),(2,4),(2,5),(2,6)}.


1 Answers

Think of how numbers are written (in the decimal system, or in any other system). Include zero's even if you don't need them:

00000
00001
00002
...
00009
00010
00011
00012
...
99998
99999

You can see how this looks like a cartesian product of 5 lists list(range(10)) (in this particular case). You can generate this output very easily by incrementing the "lowest" digit, and when it reaches the last one in the list, setting it to the first element and incrementing the "next highest" digit. You would still need for loops, of course, but a very small number. Use a similar approach when you work with arbitrary number of arbitrary lists.

For example, if you have 3 lists: ['a', 'b', 'c'], ['x', 'y'], ['1', '2'], you'll get:

ax1
ax2
ay1
ay2
bx1
bx2
by1
by2
cy1
cy2
cx1
cx2

Good luck!

EDIT:

If you like, here's a sample code to do this. I don't recursion just to show how simple this is. Recursion, of course, is also a great approach.

def lex_gen(bounds):
    elem = [0] * len(bounds)
    while True:
        yield elem
        i = 0
        while elem[i] == bounds[i] - 1:
            elem[i] = 0
            i += 1
            if i == len(bounds):
                raise StopIteration
        elem[i] += 1

def cart_product(lists):
    bounds = [len(lst) for lst in lists]
    for elem in lex_gen(bounds):
        yield [lists[i][elem[i]] for i in range(len(lists))]


for k in cart_product([['1', '2'], ['x', 'y'], ['a', 'b', 'c']]):
    print(k)
like image 164
max Avatar answered Nov 10 '22 21:11

max