Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

compute bootstrapping algorithm using Map/Reduce

This question was originally a homework assignment I had, but my answer was wrong, and I'm curious what is the best solution for this problem.

The goal is to compute key aspects of the "Recommender System bootstrapping algorithm" using 4 map reduce steps. My problem is with the 3rd step, so I'll bring only its details.


input: records of the form:
1. (population id, item, number of rating users, sum of ratings, sum of ratings squared)
2. (population id, splitter item, likers/dislikers, item, number of rating users, sum of ratings, sum of ratings squared)

The 2nd form is pretty much like the 1st form, but a record for each (splitter,likers/dislikers) - where likers/dislikers is a boolean.

This means (I think) there are 2^|items| records of the seconds form for each record from the 1st form... (many classmates made the wrong (again, I think..) assumption that there are the same amount of 1st and 2nd form records)

Task description:

This step will compute, per splitter movie, the squared error (SE) induced by each movie.

  • Output: records of the form (population id, splitter item, item, squared error on item given a split on the splitter).

Hint:

assume that there exists a string that precedes (in the system’s sort order) any splitter movie id.

This must be done within one mapreduce step!

additional background:
This was learned at the context of "The Netflix Challange"

SE definition: SE definition

EDIT: additional material concerning the problem [some description on the netflix challenge and mathematical information about the problem ] can be found in this link [slides 12-24 especially]

EDIT2: note that since we are using map/reduce, we cannot assume anything about the ORDER records will be processed [in both map and reduce].

like image 268
amit Avatar asked Mar 22 '11 14:03

amit


1 Answers

I am not sure I understand your question.

What you ultimately want is SE(U). After some math details at slides 23 and 24, it is "trivially" computed with \sum_{i} SE(U)_i

You have understood by yourself that the 4th and last sept is a map reduce to get this sum.

The 3rd step is a map reduce to get (LaTeX style)

SE(U)_i = \sum_{u in U_i} (r_{u,i} - r_i)^2

enter image description here

  • The reduce function sums over u in U_i
  • The map function splits the terms to be summed

In Python this might look like:

def map(Ui):
    ''' Ui is the list of user who have rated the film i'''
    for user in Ui:
        results.append((user,(r_{u,i} - r_i)^2))

def reduce(results):
    ''' Returns a final pair (item, SE(U)_i ) '''
    return (item, sum([value for user,value in results]))

Edit: My original answer was incomplete. Let me expain again.

What you ultimately want is SE(U) for every splitter.

Step a prepares some useful data about items. The emitted entries are defined with:

key = (population_id, item)
value =
    number: |U_i|,
    sum_of_ratings: \sum_{u \ in U_i} r_{u,i} 
    sum_of_squared_ratings: \sum_{u \in U_i} r_{u,i} ^2
  • The map function explodes the statistics over the items.
  • The reduce functions computes the sums.

Now, for any given splitter movie M:

U_M = U_{+M} + U_{-M} + U_{M?}

Step b explicitly computes, for each splitter M, the statistics for the small sub-populations M+ and M-.

NB likers/dislikers is not a boolean per se, it is the sub-population identicator '+' or '-'

There are 2 new entries for each splitter item:

key = (population_id, item, M, '+') 
value = 
    number: |U_i(+)|
    sum_of_ratings: \sum_{u \ in U_i(+)} r_{u,i}
    sum_of_squared_ratings: \sum_{u \in U_i(+)} r_{u,i} ^2

Same thing for '-'

Or if you like better the dis/likers notation

key = (population_id, item, M, dis/likers) 
value = 
    number: |U_i(dis/likers)|
    sum_of_ratings: \sum_{u \ in U_i(dis/likers)} r_{u,i}
    sum_of_squared_ratings: \sum_{u \in U_i(dis/likers)} r_{u,i} ^2

cf Middle of slide 24

NB If you consider each film might be a splitter there are 2x |item|^2 items of the second form ; that's because item -> (boolean, item, splitter) -- which is far less than your 2^|item| evaluation taht you haven't explained.

Step c computes, for each splitter M, the estimated SE by each movie, i.e. SE(U_M)_i

Because a sum can be split accross its different members:

U_M = U_{+M} + U_{-M} + U_{M?}

SE(U_M)_i = SE(U_M?)_i + SE(U_+M) + SE(U_-M)

with SE(U_{+M}) explicitly computed with this map function:

def map(key, value):
    '''     
    key = (population_id, item, M, dis/likers) 
    '''
    value = 
        count: 1
        dist: (r_u,i - r_i)^2

    emit key, value

def reduce(key, values):
    ''' 
    This function explicitly computes the SE for dis/likers
    key = (population_id, item, M, dis/likers)
    value= count, dist
    '''
    emit key, sum(count, sum(dist))

Now all we need SE(U_{M?})_i which is a "trivial" computation given in slide 24:

SE(?)_i = \sum_{u \in U_i(?)}{r_{u,i}^2} - (\sum r)^2 / |U_i(?)|

Of course, we are not going to do this big sums, but use the remark just above in the lecture, and the data already computed in step a (that's the conclusion I draw from slide 24 from the last 3 equations)

SE(?)_i = \sum_{u \in U_i}{r_{u,i}^2} - \sum_{u \in U_i('+'/'-')}{r_{u,i}^2} - (...)/ (|U_i| - |U_i('+'/'-'))

So this one is even not a Map/Reduce, it is just a finalize step:

def finalize(key, values):
    for [k in keys if k match key]:
        ''' From all entries get
        # from step a
        key = (population_id, item) value=(nb_ratings, sum_ratings, sum_ratings_squared)
        # from step b
        key = (population_id, item, M, '+') value=(nb_ratings_likers, sum_ratings_likers, sum_ratings_squared_likers)
        key = (population_id, item, M, '-') value=(nb_ratings_dislikers, sum_ratings_dislikers, sum_ratings_squared_dislikers)
        # from step c
        key = (population_id, item, M, '+') value=(se_likers)
        key = (population_id, item, M, '-') value=(se_dislikers)
        '''
        se_other = sum_rating_squared - sum_ratings_squared_likers  - sum_ratins_squared_dislikers - sum_ratings_likers / (nb_ratings -  (nb_ratings_likers)) - sum_ratins_squared_dislikers  - sum_ratings_likers / (nb_ratings -  (nb_ratings_likers))
        emit
            key: (population_id, splitter, item)
            value : se_likers + se_dislikers + se_other

Step d Finally, the last steps computes the SE for U_M. It is simply the sum of previous entries, and a trivual Map/Reduce:

For a splitter M:

SE(U_M) = \sum_i SE(U_M)_i = \sum_i SE(U_M?)_i + \sum_i SE(U_+M) + \sum_i SE(U_-M)
like image 132
rds Avatar answered Sep 28 '22 11:09

rds