Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark: Randomize rows in dataframe

I have a dataframe and I want to randomize rows in the dataframe. I tried sampling the data by giving a fraction of 1, which didn't work (interestingly this works in Pandas).

like image 399
harshit Avatar asked Apr 22 '16 20:04

harshit


People also ask

How do you shuffle rows in PySpark DataFrame?

How do you shuffle rows in Pyspark DataFrame? shuffle() is used to shuffle the values in an array for all rows in the array type column of the pyspark DataFrame. It will return a new array with shuffled values. It takes the array type column name as a parameter.

How do you random sample a PySpark DataFrame?

PySpark RDD also provides sample() function to get a random sampling, it also has another signature takeSample() that returns an Array[T]. PySpark RDD sample() function returns the random sampling similar to DataFrame and takes a similar types of parameters but in a different order.

How do you shuffle data on Spark?

The Spark SQL shuffle is a mechanism for redistributing or re-partitioning data so that the data is grouped differently across partitions, based on your data size you may need to reduce or increase the number of partitions of RDD/DataFrame using spark. sql. shuffle. partitions configuration or through code.


1 Answers

It works in Pandas because taking sample in local systems is typically solved by shuffling data. Spark from the other hand avoids shuffling by performing linear scans over the data. It means that sampling in Spark only randomizes members of the sample not an order.

You can order DataFrame by a column of random numbers:

from pyspark.sql.functions import rand 

df = sc.parallelize(range(20)).map(lambda x: (x, )).toDF(["x"])
df.orderBy(rand()).show(3)

## +---+
## |  x|
## +---+
## |  2|
## |  7|
## | 14|
## +---+
## only showing top 3 rows

but it is:

  • expensive - because it requires full shuffle and it something you typically want to avoid.
  • suspicious - because order of values in a DataFrame is not something you can really depend on in non-trivial cases and since DataFrame doesn't support indexing it is relatively useless without collecting.
like image 87
zero323 Avatar answered Sep 18 '22 14:09

zero323