Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inconsistent results using ALS in Apache Spark

I'm very new to Apache Spark and big data in general. I'm using the ALS method to create rating predictions based on a matrix of users, items, and ratings. The confusing part is that when I run the script to calculate the predictions, the results are different every time, without the input or the requested predictions changing. Is this expected behavior, or should the results be identical? Below is the Python code for reference.

from pyspark import SparkContext
from pyspark.mllib.recommendation import ALS

sc = SparkContext("local", "CF")

# get ratings from text
def parseRating(line):
  fields = line.split(',')
  return (int(fields[0]), int(fields[1]), float(fields[2]))

# define input and output files
ratingsFile = 's3n://weburito/data/weburito_ratings.dat'
unratedFile = 's3n://weburito/data/weburito_unrated.dat'
predictionsFile = '/root/weburito/data/weburito_predictions.dat'

# read training set
training = sc.textFile(ratingsFile).map(parseRating).cache()

# get unknown ratings set
predictions = sc.textFile(unratedFile).map(parseRating)

# define model
model = ALS.train(training, rank = 5, iterations = 20)

# generate predictions
predictions = model.predictAll(predictions.map(lambda x: (x[0], x[1]))).collect()
like image 817
Ricky Vesel Avatar asked Oct 19 '22 17:10

Ricky Vesel


1 Answers

This is expected behaviour. The factor matrices in ALS are initialized randomly (well actually one of them is, and the other is solved based on that initialization in the first step).

So different runs will give slightly different results.

like image 82
Nick Pentreath Avatar answered Nov 01 '22 16:11

Nick Pentreath