Tuning parameters for implicit pyspark.ml ALS matrix factorization model through pyspark.ml CrossValidator

Tags:

I'm trying to tune the parameters of an ALS matrix factorization model that uses implicit data. For this, I'm trying to use pyspark.ml.tuning.CrossValidator to run through a parameter grid and select the best model. I believe my problem is in the evaluator, but I can't figure it out.

I can get this to work for an explicit data model with a regression RMSE evaluator, as follows:

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.evaluation import RegressionEvaluator

from pyspark.sql.functions import rand


conf = SparkConf() \
  .setAppName("MovieLensALS") \
  .set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)

sqlContext = SQLContext(sc)

dfRatings = sqlContext.createDataFrame([(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 2, 5.0)],
                                 ["user", "item", "rating"])
dfRatingsTest = sqlContext.createDataFrame([(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)], ["user", "item"])

alsExplicit = ALS()
defaultModel = alsExplicit.fit(dfRatings)

paramMapExplicit = ParamGridBuilder() \
                    .addGrid(alsExplicit.rank, [8, 12]) \
                    .addGrid(alsExplicit.maxIter, [10, 15]) \
                    .addGrid(alsExplicit.regParam, [1.0, 10.0]) \
                    .build()

evaluatorR = RegressionEvaluator(metricName="rmse", labelCol="rating")

cvExplicit = CrossValidator(estimator=alsExplicit, estimatorParamMaps=paramMapExplicit, evaluator=evaluatorR)
cvModelExplicit = cvExplicit.fit(dfRatings)

predsExplicit = cvModelExplicit.bestModel.transform(dfRatingsTest)
predsExplicit.show()

When I try to do this for implicit data (let's say counts of views rather than ratings), I get an error that I can't quite figure out. Here's the code (very similar to the above):

dfCounts = sqlContext.createDataFrame([(0,0,0), (0,1,12), (0,2,3), (1,0,5), (1,1,9), (1,2,0), (2,0,0), (2,1,11), (2,2,25)],
                                 ["user", "item", "rating"])
dfCountsTest = sqlContext.createDataFrame([(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)], ["user", "item"])

alsImplicit = ALS(implicitPrefs=True)
defaultModelImplicit = alsImplicit.fit(dfCounts)

paramMapImplicit = ParamGridBuilder() \
                    .addGrid(alsImplicit.rank, [8, 12]) \
                    .addGrid(alsImplicit.maxIter, [10, 15]) \
                    .addGrid(alsImplicit.regParam, [1.0, 10.0]) \
                    .addGrid(alsImplicit.alpha, [2.0,3.0]) \
                    .build()

evaluatorB = BinaryClassificationEvaluator(metricName="areaUnderROC", labelCol="rating")
evaluatorR = RegressionEvaluator(metricName="rmse", labelCol="rating")

cv = CrossValidator(estimator=alsImplicit, estimatorParamMaps=paramMapImplicit, evaluator=evaluatorR)
cvModel = cv.fit(dfCounts)

predsImplicit = cvModel.bestModel.transform(dfCountsTest)
predsImplicit.show()

I tried doing this with an RMSE evaluator and I get an error. As I understand, I should also be able to use the AUC metric for the binary classification evaluator, because the predictions of the implicit matrix factorization are a confidence matrix c_ui for predictions of a binary matrix p_ui per this paper, which the documentation for pyspark ALS cites.

Using either evaluator gives me an error and I can't find any fruitful discussion about cross-validating implicit ALS models online. I'm looking through the CrossValidator source code to try to figure out what's wrong, but am having trouble. One of my thoughts is that after the process converts the implicit data matrix r_ui to the binary matrix p_ui and confidence matrix c_ui, I'm not sure what it's comparing the predicted c_ui matrix against during the evaluation stage.

Here is the error:

Traceback (most recent call last):

  File "<ipython-input-16-6c43b997005e>", line 1, in <module>
    cvModel = cv.fit(dfCounts)

  File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\pipeline.py", line 69, in fit
    return self._fit(dataset)

  File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\tuning.py", line 239, in _fit
    model = est.fit(train, epm[j])

  File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\pipeline.py", line 67, in fit
    return self.copy(params)._fit(dataset)

  File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\wrapper.py", line 133, in _fit
    java_model = self._fit_java(dataset)

  File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\ml\wrapper.py", line 130, in _fit_java
    return self._java_obj.fit(dataset._jdf)

  File "C:\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py", line 813, in __call__
    answer, self.gateway_client, self.target_id, self.name)

  File "C:/spark-1.6.1-bin-hadoop2.6/python\pyspark\sql\utils.py", line 45, in deco
    return f(*a, **kw)

  File "C:\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py", line 308, in get_return_value
    format(target_id, ".", name), value)

etc.......

UPDATE

I tried scaling the input so it's in the range of 0 to 1 and using a RMSE evaluator. It seems to work well until I try to insert it into the CrossValidator.

The following code works. I get predictions and i get an RMSE value from my evaluator.

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import FloatType
import pyspark.sql.functions as F
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator


conf = SparkConf() \
  .setAppName("ALSPractice") \
  .set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)

sqlContext = SQLContext(sc)

# Users 0, 1, 2, 3 - Items 0, 1, 2, 3, 4, 5 - Ratings 0.0-5.0
dfCounts2 = sqlContext.createDataFrame([(0,0,5.0), (0,1,5.0),            (0,3,0.0), (0,4,0.0), 
                                        (1,0,5.0),            (1,2,4.0), (1,3,0.0), (1,4,0.0),
                                        (2,0,0.0),            (2,2,0.0), (2,3,5.0), (2,4,5.0),
                                        (3,0,0.0), (3,1,0.0),            (3,3,4.0)            ],
                                       ["user", "item", "rating"])

dfCountsTest2 = sqlContext.createDataFrame([(0,0), (0,1), (0,2), (0,3), (0,4),
                                            (1,0), (1,1), (1,2), (1,3), (1,4),
                                            (2,0), (2,1), (2,2), (2,3), (2,4),
                                            (3,0), (3,1), (3,2), (3,3), (3,4)], ["user", "item"])

# Normalize rating data to [0,1] range based on max rating
colmax = dfCounts2.select(F.max('rating')).collect()[0].asDict().values()[0]
normalize = udf(lambda x: x/colmax, FloatType())
dfCountsNorm = dfCounts2.withColumn('ratingNorm', normalize(col('rating')))

alsImplicit = ALS(implicitPrefs=True)
defaultModelImplicit = alsImplicit.fit(dfCountsNorm)
preds = defaultModelImplicit.transform(dfCountsTest2)

evaluatorR2 = RegressionEvaluator(metricName="rmse", labelCol="ratingNorm")
evaluatorR2.evaluate(defaultModelImplicit.transform(dfCountsNorm))

preds = defaultModelImplicit.transform(dfCountsTest2)

What I don't understand is why the following doesn't work. I'm using the same estimator, the same evaluator and fitting the same data. Why would these work above but not within the CrossValidator:

paramMapImplicit = ParamGridBuilder() \
                    .addGrid(alsImplicit.rank, [8, 12]) \
                    .addGrid(alsImplicit.maxIter, [10, 15]) \
                    .addGrid(alsImplicit.regParam, [1.0, 10.0]) \
                    .addGrid(alsImplicit.alpha, [2.0,3.0]) \
                    .build()

cv = CrossValidator(estimator=alsImplicit, estimatorParamMaps=paramMapImplicit, evaluator=evaluatorR2)
cvModel = cv.fit(dfCountsNorm)

725

asked May 16 '16 18:05

ilyab

1 Answers

Ignoring technical issues, strictly speaking neither method is correct given the input generated by ALS with implicit feedback.

you cannot use RegressionEvaluator because, as you already know, prediction can be interpreted as a confidence value and is represented as a floating point number in range [0, 1] and label column is just an unbound integer. These values are clearly not comparable.
you cannot use BinaryClassificationEvaluator because even if the prediction can be interpreted as probability label doesn't represent binary decision. Moreover prediction column has invalid type and couldn't be used directly with BinaryClassificationEvaluator

You can try to convert one of the columns so input fit the requirements but this is is not really a justified approach from a theoretical perspective and introduces additional parameters which are hard to tune.

map label column to [0, 1] range and use RMSE.

convert label column to binary indicator with fixed threshold and extend ALS / ALSModel to return expected column type. Assuming threshold value is 1 it could be something like this

from pyspark.ml.recommendation import *
from pyspark.sql.functions import udf, col
from pyspark.mllib.linalg import DenseVector, VectorUDT

class BinaryALS(ALS):
    def fit(self, df):
        assert self.getImplicitPrefs()
        model = super(BinaryALS, self).fit(df)
        return ALSBinaryModel(model._java_obj)

class ALSBinaryModel(ALSModel):
    def transform(self, df):
        transformed = super(ALSBinaryModel, self).transform(df)
        as_vector = udf(lambda x: DenseVector([1 - x, x]), VectorUDT())
        return transformed.withColumn(
            "rawPrediction", as_vector(col("prediction")))

# Add binary label column
with_binary = dfCounts.withColumn(
    "label_binary", (col("rating") > 0).cast("double"))

als_binary_model = BinaryALS(implicitPrefs=True).fit(with_binary)

evaluatorB = BinaryClassificationEvaluator(
    metricName="areaUnderROC", labelCol="label_binary")

evaluatorB.evaluate(als_binary_model.transform(with_binary))
## 1.0

Generally speaking, material about evaluating recommender systems with implicit feedbacks is kind of missing in textbooks, I suggest you take a read on eliasah's answer about evaluating these kind of recommenders.

answered Sep 24 '22 08:09

zero323

Related questions
                            
                                Why does a virtualenv environment contain argparse, distribute and wsgiref? [duplicate]
                            
                                Django South Error: "there is no enabled application matching 'myapp'"
                            
                                Automatically remove hot/dead pixels from an image in python
                            
                                Naming convention in Collections: why are some lowercase and others CapWords?
                            
                                Reading output from child process using python
                            
                                When does Python create new list objects for empty lists?
                            
                                Data size in memory vs. on disk
                            
                                How does pylint remember scores from previous runs?
                            
                                How do I import an .accdb file into Python and use the data?
                            
                                Produce a RA vs DEC equatorial coordinates plot with python
                            
                                Split a list into two sublists in all possible ways
                            
                                Why is the subprocess.Popen argument length limit smaller than what the OS reports?
                            
                                next run time missed by some seconds in job scheduling in apscheuler
                            
                                Intersection of nD line with convex hull in Python
                            
                                grid zorder seems not to take effect (matplotlib)
                            
                                Does pyodbc support any form of named parameters?
                            
                                Are constant computations cached in Python?
                            
                                Anaconda3 2.4 with python 3.5 installation error (procedure entry not found; Windows 10)
                            
                                How to limit the maximum number of running Celery tasks by name
                            
                                Installing via `setup.py develop` fails - pip works

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Tuning parameters for implicit pyspark.ml ALS matrix factorization model through pyspark.ml CrossValidator

Tags:

python

apache-spark

pyspark

apache-spark-ml

ilyab

People also ask

1 Answers

zero323

Recent Activity

Donate For Us