The input data I have for recommendation looks like:
[(u'97990079', u'18_34', 2),
 (u'585853655', u'11_8', 1),
 (u'1398696913', u'6_20', 1),
 (u'612168869', u'7_16', 1),
 (u'2272846159', u'11_17', 2)]
which is following the format as (user_id, item_id, score).
If I understand correctly, ALS in spark must convert user_id, item_id to integer before training? If so, the only solutions I can think now is to use dictionaries and map every user_id and item_id to integer like 
dictionary for item_id : {'18_34': 1, '18_35':2, ...}
dictionary for user_id : {'97990079':1, '585853655':2, ...}
But I was wondering if there is other elegant way to do that? Thanks!
One way you can handle this is to use ML transformers. First lets convert your data to a DataFrame:
ratings_df = sqlContext.createDataFrame([
    (u'97990079', u'18_34', 2), (u'585853655', u'11_8', 1),
    (u'1398696913', u'6_20', 1), (u'612168869', u'7_16', 1),
    (u'2272846159', u'11_17', 2)],
    ("user_id", "item_id_str", "rating"))
Next we'll need a StringIndexer
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="item_id_str", outputCol="item_id")
Finally lets transform DataFrame using indexer:
from pyspark.sql.functions import col
transformed = (indexer
    .fit(ratings_df)
    .transform(ratings_df)
    .withColumn("user_id", col("user_id").cast("integer"))
    .select("user_id", "item_id", "rating"))
and convert to RDD[Rating]:
from pyspark.mllib.recommendation import Rating
ratings_rdd = transformed.map(lambda r: Rating(r.user_id, r.item_id, r.rating))
In newer versions of Spark you can skip conversions, and use ml.recommendation.ALS directly:
from pyspark.ml.recommendation import ALS
als = (ALS(userCol="user_id", itemCol="item_id", ratingCol="rating")
  .fit(transformed))
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With