This question was originally a homework assignment I had, but my answer was wrong, and I'm curious what is the best solution for this problem.
The goal is to compute key aspects of the "Recommender System bootstrapping algorithm" using 4 map reduce steps. My problem is with the 3rd step, so I'll bring only its details.
input: records of the form:
1. (population id, item, number of rating users, sum of ratings, sum of ratings squared)
2. (population id, splitter item, likers/dislikers, item, number of rating users, sum of ratings, sum of ratings squared)
The 2nd form is pretty much like the 1st form, but a record for each (splitter,likers/dislikers) - where likers/dislikers is a boolean.
This means (I think) there are 2^|items| records of the seconds form for each record from the 1st form... (many classmates made the wrong (again, I think..) assumption that there are the same amount of 1st and 2nd form records)
Task description:
This step will compute, per splitter movie, the squared error (SE) induced by each movie.
Hint:
assume that there exists a string that precedes (in the system’s sort order) any splitter movie id.
This must be done within one mapreduce step!
additional background:
This was learned at the context of "The Netflix Challange"
SE definition:
EDIT: additional material concerning the problem [some description on the netflix challenge and mathematical information about the problem ] can be found in this link [slides 12-24 especially]
EDIT2: note that since we are using map/reduce, we cannot assume anything about the ORDER records will be processed [in both map and reduce].
I am not sure I understand your question.
What you ultimately want is SE(U). After some math details at slides 23 and 24, it is "trivially" computed with \sum_{i} SE(U)_i
You have understood by yourself that the 4th and last sept is a map reduce to get this sum.
The 3rd step is a map reduce to get (LaTeX style)
SE(U)_i = \sum_{u in U_i} (r_{u,i} - r_i)^2
In Python this might look like:
def map(Ui):
''' Ui is the list of user who have rated the film i'''
for user in Ui:
results.append((user,(r_{u,i} - r_i)^2))
def reduce(results):
''' Returns a final pair (item, SE(U)_i ) '''
return (item, sum([value for user,value in results]))
Edit: My original answer was incomplete. Let me expain again.
What you ultimately want is SE(U) for every splitter.
Step a prepares some useful data about items. The emitted entries are defined with:
key = (population_id, item)
value =
number: |U_i|,
sum_of_ratings: \sum_{u \ in U_i} r_{u,i}
sum_of_squared_ratings: \sum_{u \in U_i} r_{u,i} ^2
Now, for any given splitter movie M:
U_M = U_{+M} + U_{-M} + U_{M?}
Step b explicitly computes, for each splitter M, the statistics for the small sub-populations M+ and M-.
NB likers/dislikers is not a boolean per se, it is the sub-population identicator '+' or '-'
There are 2 new entries for each splitter item:
key = (population_id, item, M, '+')
value =
number: |U_i(+)|
sum_of_ratings: \sum_{u \ in U_i(+)} r_{u,i}
sum_of_squared_ratings: \sum_{u \in U_i(+)} r_{u,i} ^2
Same thing for '-'
Or if you like better the dis/likers notation
key = (population_id, item, M, dis/likers)
value =
number: |U_i(dis/likers)|
sum_of_ratings: \sum_{u \ in U_i(dis/likers)} r_{u,i}
sum_of_squared_ratings: \sum_{u \in U_i(dis/likers)} r_{u,i} ^2
cf Middle of slide 24
NB If you consider each film might be a splitter there are 2x |item|^2 items of the second form ; that's because item -> (boolean, item, splitter) -- which is far less than your 2^|item| evaluation taht you haven't explained.
Step c computes, for each splitter M, the estimated SE by each movie, i.e. SE(U_M)_i
Because a sum can be split accross its different members:
U_M = U_{+M} + U_{-M} + U_{M?}
SE(U_M)_i = SE(U_M?)_i + SE(U_+M) + SE(U_-M)
with SE(U_{+M})
explicitly computed with this map function:
def map(key, value):
'''
key = (population_id, item, M, dis/likers)
'''
value =
count: 1
dist: (r_u,i - r_i)^2
emit key, value
def reduce(key, values):
'''
This function explicitly computes the SE for dis/likers
key = (population_id, item, M, dis/likers)
value= count, dist
'''
emit key, sum(count, sum(dist))
Now all we need SE(U_{M?})_i
which is a "trivial" computation given in slide 24:
SE(?)_i = \sum_{u \in U_i(?)}{r_{u,i}^2} - (\sum r)^2 / |U_i(?)|
Of course, we are not going to do this big sums, but use the remark just above in the lecture, and the data already computed in step a (that's the conclusion I draw from slide 24 from the last 3 equations)
SE(?)_i = \sum_{u \in U_i}{r_{u,i}^2} - \sum_{u \in U_i('+'/'-')}{r_{u,i}^2} - (...)/ (|U_i| - |U_i('+'/'-'))
So this one is even not a Map/Reduce, it is just a finalize step:
def finalize(key, values):
for [k in keys if k match key]:
''' From all entries get
# from step a
key = (population_id, item) value=(nb_ratings, sum_ratings, sum_ratings_squared)
# from step b
key = (population_id, item, M, '+') value=(nb_ratings_likers, sum_ratings_likers, sum_ratings_squared_likers)
key = (population_id, item, M, '-') value=(nb_ratings_dislikers, sum_ratings_dislikers, sum_ratings_squared_dislikers)
# from step c
key = (population_id, item, M, '+') value=(se_likers)
key = (population_id, item, M, '-') value=(se_dislikers)
'''
se_other = sum_rating_squared - sum_ratings_squared_likers - sum_ratins_squared_dislikers - sum_ratings_likers / (nb_ratings - (nb_ratings_likers)) - sum_ratins_squared_dislikers - sum_ratings_likers / (nb_ratings - (nb_ratings_likers))
emit
key: (population_id, splitter, item)
value : se_likers + se_dislikers + se_other
Step d Finally, the last steps computes the SE for U_M. It is simply the sum of previous entries, and a trivual Map/Reduce:
For a splitter M:
SE(U_M) = \sum_i SE(U_M)_i = \sum_i SE(U_M?)_i + \sum_i SE(U_+M) + \sum_i SE(U_-M)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With