I'm very new to Apache Spark and big data in general. I'm using the ALS method to create rating predictions based on a matrix of users, items, and ratings. The confusing part is that when I run the script to calculate the predictions, the results are different every time, without the input or the requested predictions changing. Is this expected behavior, or should the results be identical? Below is the Python code for reference.
from pyspark import SparkContext
from pyspark.mllib.recommendation import ALS
sc = SparkContext("local", "CF")
# get ratings from text
def parseRating(line):
fields = line.split(',')
return (int(fields[0]), int(fields[1]), float(fields[2]))
# define input and output files
ratingsFile = 's3n://weburito/data/weburito_ratings.dat'
unratedFile = 's3n://weburito/data/weburito_unrated.dat'
predictionsFile = '/root/weburito/data/weburito_predictions.dat'
# read training set
training = sc.textFile(ratingsFile).map(parseRating).cache()
# get unknown ratings set
predictions = sc.textFile(unratedFile).map(parseRating)
# define model
model = ALS.train(training, rank = 5, iterations = 20)
# generate predictions
predictions = model.predictAll(predictions.map(lambda x: (x[0], x[1]))).collect()
This is expected behaviour. The factor matrices in ALS are initialized randomly (well actually one of them is, and the other is solved based on that initialization in the first step).
So different runs will give slightly different results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With