As part of a product recommendation engine, I'm trying to segment my users based on their product preferences starting with using the k-means clustering algorithm.
My data is a dictionary of the form:
prefs = {
'user_id_1': { 1L: 3.0f, 2L: 1.0f, },
'user_id_2': { 4L: 1.0f, 8L: 1.5f, },
}
where the product ids are the longs, and ratings are floats. the data is sparse. I currently have about 60,000 users, most of whom have only rated a handful of products. The dictionary of values for each user is implemented using a defaultdict(float) to simplify the code.
My implementation of k-means clustering is as follows:
def kcluster(prefs,sim_func=pearson,k=100,max_iterations=100):
from collections import defaultdict
users = prefs.keys()
centroids = [prefs[random.choice(users)] for i in range(k)]
lastmatches = None
for t in range(max_iterations):
print 'Iteration %d' % t
bestmatches = [[] for i in range(k)]
# Find which centroid is closest for each row
for j in users:
row = prefs[j]
bestmatch=(0,0)
for i in range(k):
d = simple_pearson(row,centroids[i])
if d < bestmatch[1]: bestmatch = (i,d)
bestmatches[bestmatch[0]].append(j)
# If the results are the same as last time, this is complete
if bestmatches == lastmatches: break
lastmatches=bestmatches
centroids = [defaultdict(float) for i in range(k)]
# Move the centroids to the average of their members
for i in range(k):
len_best = len(bestmatches[i])
if len_best > 0:
items = set.union(*[set(prefs[u].keys()) for u in bestmatches[i]])
for user_id in bestmatches[i]:
row = prefs[user_id]
for m in items:
if row[m] > 0.0: centroids[i][m]+=(row[m]/len_best)
return bestmatches
As far as I can tell, the algorithm is handling the first part (assigning each user to its nearest centroid) fine.
The problem is when doing the next part, taking the average rating for each product in each cluster and using these average ratings as the centroids for the next pass.
Basically, before it's even managed to do the calculations for the first cluster (i=0), the algorithm bombs out with a MemoryError at this line:
if row[m] > 0.0: centroids[i][m]+=(row[m]/len_best)
Originally the division step was in a seperate loop, but fared no better.
This is the exception I get:
malloc: *** mmap(size=16777216) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Any help would be greatly appreciated.
Thanks to the help recieved here, this is my fixed algorithm. If you spot anything blatantly wrong please add a comment.
First, the simple_pearson implementation
def simple_pearson(v1,v2):
si = [val for val in v1 if val in v2]
n = len(si)
if n==0: return 0.0
sum1 = 0.0
sum2 = 0.0
sum1_sq = 0.0
sum2_sq = 0.0
p_sum = 0.0
for v in si:
sum1+=v1[v]
sum2+=v2[v]
sum1_sq+=pow(v1[v],2)
sum2_sq+=pow(v2[v],2)
p_sum+=(v1[v]*v2[v])
# Calculate Pearson score
num = p_sum-(sum1*sum2/n)
temp = (sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n)
if temp < 0.0:
temp = -temp
den = sqrt(temp)
if den==0: return 1.0
r = num/den
return r
A simple method to turn simple_pearson into a distance value:
def distance(v1,v2):
return 1.0-simple_pearson(v1,v2)
And finally, the k-means clustering implementation:
def kcluster(prefs,k=21,max_iterations=50):
from collections import defaultdict
users = prefs.keys()
centroids = [prefs[u] for u in random.sample(users, k)]
lastmatches = None
for t in range(max_iterations):
print 'Iteration %d' % t
bestmatches = [[] for i in range(k)]
# Find which centroid is closest for each row
for j in users:
row = prefs[j]
bestmatch=(0,2.0)
for i in range(k):
d = distance(row,centroids[i])
if d <= bestmatch[1]: bestmatch = (i,d)
bestmatches[bestmatch[0]].append(j)
# If the results are the same as last time, this is complete
if bestmatches == lastmatches: break
lastmatches=bestmatches
centroids = [defaultdict(float) for i in range(k)]
# Move the centroids to the average of their members
for i in range(k):
len_best = len(bestmatches[i])
if len_best > 0:
for user_id in bestmatches[i]:
row = prefs[user_id]
for m in row:
centroids[i][m]+=row[m]
for key in centroids[i].keys():
centroids[i][key]/=len_best
# We may have made the centroids quite dense which significantly
# slows down subsequent iterations, so we delete values below a
# threshold to speed things up
if centroids[i][key] < 0.001:
del centroids[i][key]
return centroids, bestmatches
Not all these observations are directly relevant to your issues as expressed, but..:
a. why are the key in prefs, as shown, longs? unless you have billions of users, simple ints will be fine and save you a little memory.
b. your code:
centroids = [prefs[random.choice(users)] for i in range(k)]
can give you repeats (two identical centroids), which in turn would not make the K-means algorithm happy. Just use the faster and more solid
centroids = [prefs[u] for random.sample(users, k)]
c. in your code as posted you're calling a function simple_pearson
which you never define anywhere; I assume you mean to call sim_func
, but it's really hard to help on different issues while at the same time having to guess how the code you posted differs from any code that might actually be working
d. one more indication that this posted code may be different from anything that might actually work: you set bestmatch=(0,0)
but then test with if d < bestmatch[1]:
-- how is the test ever going to succeed? is the distance function returning negative values?
e. the point of a defaultdict is that just accessing row[m]
magically adds an item to row
at index m
(with the value obtained by calling the defaultdict's factory, here 0.0). That item will then take up memory forevermore. You absolutely DON'T need this behavior, and therefore your code:
row = prefs[user_id]
for m in items:
if row[m] > 0.0: centroids[i][m]+=(row[m]/len_best)
is wasting huge amount of memory, making prefs
into a dense matrix (mostly full of 0.0 values) from the sparse one it used to be. If you code instead
row = prefs[user_id]
for m in row:
centroids[i][m]+=(row[m]/len_best)
there will be no growth in row
and therefore in prefs
because you're looping over the keys that row
already has.
There may be many other such issues, major like the last one or minor ones -- as an example of the latter,
f. don't divide a bazillion times by len_best
: compute its inverse one outside the loop and multiply by that inverse -- also you don't need to do that multiplication inside the loop, you can do it at the end in a separate since it's the same value that's multiplying every item -- this saves no memory but avoids wantonly wasting CPU time;-). OK, these are two minor issues, I guess, not just one;-).
As I mentioned there may be many others, but with the density of issues already shown by these six (or seven), plus the separate suggestion already advanced by S.Lott (which I think would not fix your main out-of-memory problem, since his code still addressing the row
defaultdict by too many keys it doesn't contain), I think it wouldn't be very productive to keep looking for even more -- maybe start by fixing these ones and if problems persist post a separate question about those...?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With