Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to assign unique contiguous numbers to elements in a Spark RDD

I have a dataset of (user, product, review), and want to feed it into mllib's ALS algorithm.

The algorithm needs users and products to be numbers, while mine are String usernames and String SKUs.

Right now, I get the distinct users and SKUs, then assign numeric IDs to them outside of Spark.

I was wondering whether there was a better way of doing this. The one approach I've thought of is to write a custom RDD that essentially enumerates 1 through n, then call zip on the two RDDs.

like image 386
Dilum Ranatunga Avatar asked May 29 '14 17:05

Dilum Ranatunga


People also ask

How do I create a sequence number in Spark?

The row_number() is a window function in Spark SQL that assigns a row number (sequence number) to each row in the result Dataset. This function is used with Window. partitionBy() which partitions the data into windows frames and orderBy() clause to sort the rows in each partition.

Which method should be used when a given RDD is to be divided into number of partitions?

In Scala and Java, you can determine how an RDD is partitioned using its partitioner property (or partitioner() method in Java).

What is parallelize in RDD?

Parallelize is a method to create an RDD from an existing collection (For e.g Array) present in the driver. The elements present in the collection are copied to form a distributed dataset on which we can operate on in parallel. In this topic, we are going to learn about Spark Parallelize.


1 Answers

Starting with Spark 1.0 there are two methods you can use to solve this easily:

  • RDD.zipWithIndex is just like Seq.zipWithIndex, it adds contiguous (Long) numbers. This needs to count the elements in each partition first, so your input will be evaluated twice. Cache your input RDD if you want to use this.
  • RDD.zipWithUniqueId also gives you unique Long IDs, but they are not guaranteed to be contiguous. (They will only be contiguous if each partition has the same number of elements.) The upside is that this does not need to know anything about the input, so it will not cause double-evaluation.
like image 102
Daniel Darabos Avatar answered Oct 02 '22 15:10

Daniel Darabos