Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to shuffle the rows in a Spark dataframe?

I have a dataframe like this:

+---+---+
|_c0|_c1|
+---+---+
|1.0|4.0|
|1.0|4.0|
|2.1|3.0|
|2.1|3.0|
|2.1|3.0|
|2.1|3.0|
|3.0|6.0|
|4.0|5.0|
|4.0|5.0|
|4.0|5.0|
+---+---+

and I would like to shuffle all the rows using Spark in Scala.

How can I do this without going back to RDD?

like image 664
Laure D Avatar asked Apr 26 '17 14:04

Laure D


People also ask

How do you shuffle rows in Pyspark Dataframe?

Shuffle DataFrame Randomly by Rows and Columns You can use df. sample(frac=1, axis=1). sample(frac=1). reset_index(drop=True) to shuffle rows and columns randomly.

How do you shuffle rows in a Dataframe?

One of the easiest ways to shuffle a Pandas Dataframe is to use the Pandas sample method. The df. sample method allows you to sample a number of rows in a Pandas Dataframe in a random order. Because of this, we can simply specify that we want to return the entire Pandas Dataframe, in a random order.

How do I shuffle Dataframe in spark?

if you want a "true" shuffle then you have to move data across the network. E.g. each row has equal chances to be at any place in dataset. But if you need just to shuffle within partition, you can use: df. mapPartitions(new scala.

How do I set shuffle partition in spark?

Spark Default Shuffle Partition sql. shuffle. partitions which is by default set to 200 . You can change this default shuffle partition value using conf method of the SparkSession object or using Spark Submit Command Configurations.

How to randomly shuffle Dataframe rows in pandas?

We will be using the sample() method of the pandas module to to randomly shuffle DataFrame rows in Pandas. Algorithm : Import the pandas and numpy modules. Create a DataFrame. Shuffle the rows of the DataFrame using the sample() method with the parameter frac as 1, it determines what fraction of total instances need to be returned.

What is the difference between Spark shuffle and Spark data frame?

Spark data frames are the partitions of Shuffle operations. The original data frame partitions differ with the number of data frame partitions. The data moving from one partition to the other partition process in order to mat up, aggregate, join, or spread out in other ways is called a shuffle.

What is a data frame in spark?

Spark data frames are the partitions of Shuffle operations. The original data frame partitions differ with the number of data frame partitions. The data moving from one partition to the other partition process in order to mat up, aggregate, join, or spread out in other ways is called a shuffle. The syntax for Shuffle in Spark Architecture:

What is the use of Shuffle Index in Dataframe?

The shuffle indices are used to select rows using the .iloc [] method. You can shuffle the rows of a DataFrame by indexing with a shuffled index. For instance, df.iloc [np.random.permutation (df.index)].reset_index (drop=True).


1 Answers

You need to use orderBy method of the dataframe:

import org.apache.spark.sql.functions.rand
val shuffledDF = dataframe.orderBy(rand())
like image 118
prudenko Avatar answered Oct 17 '22 21:10

prudenko