how to make RMSE(root mean square error) small when use ALS of spark?

Tags:

I need some suggestions to build a good model to make recommendation by using Collaborative Filtering of spark. There is a sample code in the official website. I also past it following:

from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

# Load and parse the data
data = sc.textFile("data/mllib/als/test.data")
ratings = data.map(lambda l: l.split(','))\
   .map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))

# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 10
model = ALS.train(ratings, rank, numIterations)

# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
RMSE = ratesAndPreds.map(lambda r: ((r[1][0] - r[1][1])**2).mean())**.5)
print("Root Mean Squared Error = " + str(RMSE))

A good model need the RMSE as small as possible.

Is that because I do not set proper parameter to ALS.train method, such as rand numIterations and so on?

Or is that because my dataset is small to make RMSE big?

So could anyone help me figure out what cause RMSE is big and how to fix it.

addition:

Just as @eliasah said, I need to add some detail to narrow the answer set. Let us consider this particular situation:

Now, if I want to build a recommendation system to recommend music to my clients. I have their history rate for tracks, albums, artists, and genres. Obviously, this 4 class build a hierarchy structure. Tracks directly belong to albums, albums directly belongs to artists, and artists may belong to several different genres. Finally, I want use all of these info to choice the some tracks to recommend to clients.

So, what is the best practice to build a good model for these situation and ensure to make RMSE as small as possible for prediction.

512

asked Apr 12 '16 13:04

sydridgm

2 Answers

As you mentioned above, as rank and numIterations increase, RMSE decreases, given the same dataset. However, as dataset grows, RMSE increases.

Now, one practice done to decrease RMSE and some other similar measures is to normalize the values in ratings. In my experience, this works really well when you know in advance the minimum and maximum rating values.

Also, you should also consider using other measures other than RMSE. When doing Matrix Factorization, what I found useful is to compute Frobenius Norm of ratings - predictions then divide by Frobenius Norm of ratings. By doing this, you are getting the relative error of your predictions with respect to the original ratings.

Here's the code in spark for this method:

# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))

ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)

abs_frobenius_error = sqrt(ratesAndPreds.map(lambda r: ((r[1][0] - r[1][1])**2).sum())))

# frobenius error of original ratings
frob_error_orig = sqrt(ratings.map(lambda r: r[2]**2).sum())

# finally, the relative error
rel_error = abs_frobenius_error/frob_error_orig

print("Relative Error = " + str(rel_error))

In this error measure, the closer the error to zero, the better you model is.

I hope this helps.

answered Oct 09 '22 08:10

jtitusj

I do a little research on it, here is the conclusion:

When rand and iteration grow, the RMSE will decrease. However, when size of dataset grow, the RMSE will increase.From above result, rand size will change the RMSE value more significantly.

I know this is not enough to get a good model. Wish more ideas!!!

answered Oct 09 '22 07:10

sydridgm

Related questions
                            
                                How does Apache Spark send functions to other machines under the hood
                            
                                spark on yarn, Connecting to ResourceManager at /0.0.0.0:8032
                            
                                How to setup Spark with a multi node Cassandra cluster?
                            
                                How to stop spark structured streaming from listing all files in an S3 bucket every time
                            
                                Spark job reading from S3 on Spark cluster gives IllegalAccessError: tried to access method MutableCounterLong [duplicate]
                            
                                Is there a way to dynamically stop Spark Structured Streaming?
                            
                                How to write TIMESTAMP logical type (INT96) to parquet, using ParquetWriter?
                            
                                Spark Truncated Spark Plan
                            
                                Spark createDataFrame(df.rdd, df.schema) vs checkPoint for breaking lineage
                            
                                What is the difference between Driver and Application manager in spark
                            
                                spark <console>:12: error: not found: value sc
                            
                                Why are aggregate and fold two different APIs in Spark?
                            
                                Spark can no longer execute jobs. Executors fail to create directory
                            
                                SparkSQL MissingRequirementError when registering table
                            
                                How to get Histogram of all columns in a large CSV / RDD[Array[double]] using Apache Spark Scala?
                            
                                How to control number of parquet files generated when using partitionBy
                            
                                Numpy and static linking
                            
                                Difference between Apache spark mllib.linalg vectors and spark.util vectors for machine learning
                            
                                Spark Exception : Task failed while writing rows
                            
                                Spark netlib-java BLAS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to make RMSE(root mean square error) small when use ALS of spark?

Tags:

apache-spark

pyspark

apache-spark-mllib

collaborative-filtering

sydridgm

People also ask

2 Answers

jtitusj

sydridgm

Recent Activity

Donate For Us