I need some suggestions to build a good model to make recommendation by using Collaborative Filtering
of spark. There is a sample code in the official website. I also past it following:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
# Load and parse the data
data = sc.textFile("data/mllib/als/test.data")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 10
model = ALS.train(ratings, rank, numIterations)
# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
RMSE = ratesAndPreds.map(lambda r: ((r[1][0] - r[1][1])**2).mean())**.5)
print("Root Mean Squared Error = " + str(RMSE))
A good model need the RMSE as small as possible.
Is that because I do not set proper parameter to
ALS.train
method, such as rand numIterations and so on?Or is that because my dataset is small to make RMSE big?
So could anyone help me figure out what cause RMSE is big and how to fix it.
addition:
Just as @eliasah said, I need to add some detail to narrow the answer set. Let us consider this particular situation:
Now, if I want to build a recommendation system to recommend music to my clients. I have their history rate for tracks, albums, artists, and genres. Obviously, this 4 class build a hierarchy structure. Tracks directly belong to albums, albums directly belongs to artists, and artists may belong to several different
genres. Finally, I want use all of these info to choice the some tracks to recommend to clients.
So, what is the best practice to build a good model for these situation and ensure to make RMSE as small as possible for prediction.
Hence, both the numerator and denominator for MSE will increase, and you cannot conclude if MSE will monotonically increase/decrease by changing the sample size.
Based on a rule of thumb, it can be said that RMSE values between 0.2 and 0.5 shows that the model can relatively predict the data accurately. In addition, Adjusted R-squared more than 0.75 is a very good value for showing the accuracy. In some cases, Adjusted R-squared of 0.4 or more is acceptable as well.
If the RMSE for the test set is much higher than that of the training set, it is likely that you've badly over fit the data, i.e. you've created a model that tests well in sample, but has little predictive value when tested out of sample.
Try to play with other input variables, and compare your RMSE values. The smaller the RMSE value, the better the model. Also, try to compare your RMSE values of both training and testing data. If they are almost similar, your model is good.
As you mentioned above, as rank and numIterations increase, RMSE decreases, given the same dataset. However, as dataset grows, RMSE increases.
Now, one practice done to decrease RMSE and some other similar measures is to normalize the values in ratings. In my experience, this works really well when you know in advance the minimum and maximum rating values.
Also, you should also consider using other measures other than RMSE. When doing Matrix Factorization, what I found useful is to compute Frobenius Norm of ratings - predictions then divide by Frobenius Norm of ratings. By doing this, you are getting the relative error of your predictions with respect to the original ratings.
Here's the code in spark for this method:
# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
abs_frobenius_error = sqrt(ratesAndPreds.map(lambda r: ((r[1][0] - r[1][1])**2).sum())))
# frobenius error of original ratings
frob_error_orig = sqrt(ratings.map(lambda r: r[2]**2).sum())
# finally, the relative error
rel_error = abs_frobenius_error/frob_error_orig
print("Relative Error = " + str(rel_error))
In this error measure, the closer the error to zero, the better you model is.
I hope this helps.
I do a little research on it, here is the conclusion:
When rand and iteration grow, the RMSE will decrease. However, when size of dataset grow, the RMSE will increase.From above result, rand size will change the RMSE value more significantly.
I know this is not enough to get a good model. Wish more ideas!!!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With