I have a dataset of <code>(user, product, review)</code>, and want to feed it into mllib's ALS algorithm. The algorithm needs users and products to be numbers, while mine are String usernames and String SKUs. Right now, I get the distinct users and SKUs, then assign numeric IDs to them outside of Spark. I was wondering whether there was a better way of doing this. The one approach I've thought of is to write a custom RDD that essentially enumerates 1 through <code>n</code>, then call zip on the two RDDs.

Starting with Spark 1.0 there are two methods you can use to solve this easily: <ul> <li> <code>RDD.zipWithIndex</code> is just like <code>Seq.zipWithIndex</code>, it adds contiguous (<code>Long</code>) numbers. This needs to count the elements in each partition first, so your input will be evaluated twice. Cache your input RDD if you want to use this.</li> <li> <code>RDD.zipWithUniqueId</code> also gives you unique <code>Long</code> IDs, but they are not guaranteed to be contiguous. (They will only be contiguous if each partition has the same number of elements.) The upside is that this does not need to know anything about the input, so it will not cause double-evaluation.</li> </ul>

How to assign unique contiguous numbers to elements in a Spark RDD

1 Answers

Starting with Spark 1.0 there are two methods you can use to solve this easily:

RDD.zipWithIndex is just like Seq.zipWithIndex, it adds contiguous (Long) numbers. This needs to count the elements in each partition first, so your input will be evaluated twice. Cache your input RDD if you want to use this.
RDD.zipWithUniqueId also gives you unique Long IDs, but they are not guaranteed to be contiguous. (They will only be contiguous if each partition has the same number of elements.) The upside is that this does not need to know anything about the input, so it will not cause double-evaluation.

102

answered Oct 02 '22 15:10

Daniel Darabos

Related questions
                            
                                What is a task in Spark? How does the Spark worker execute the jar file?
                            
                                Difference between DataSet API and DataFrame API [duplicate]
                            
                                Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)
                            
                                How to optimize shuffle spill in Apache Spark application
                            
                                What is the Spark DataFrame method `toPandas` actually doing?
                            
                                Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD?
                            
                                Installing of SparkR
                            
                                Flattening Rows in Spark
                            
                                dataframe: how to groupBy/count then filter on count in Scala
                            
                                Spark Window Functions - rangeBetween dates
                            
                                What is the difference between cube, rollup and groupBy operators?
                            
                                Reduce a key-value pair into a key-list pair with Apache Spark
                            
                                How to deal with executor memory and driver memory in Spark?
                            
                                How to reduce the verbosity of Spark's runtime output?
                            
                                Spark iterate HDFS directory
                            
                                Spark unionAll multiple dataframes
                            
                                get datatype of column using pyspark
                            
                                Spark specify multiple column conditions for dataframe join
                            
                                How to export data from Spark SQL to CSV
                            
                                What's the difference between Spark ML and MLLIB packages

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to assign unique contiguous numbers to elements in a Spark RDD

Tags:

apache-spark

apache-spark-mllib

Dilum Ranatunga

People also ask

1 Answers

Daniel Darabos

Recent Activity

Donate For Us