Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark - How to get random values from a DataFrame column

I have one column in a DataFrame which I need to select 3 random values in Pyspark. Could anyone help-me, please?

+---+
| id|
+---+
|123| 
|245| 
| 12|
|234|
+---+

Desire:

Array with 3 random values get from that column:

**output**: [123, 12, 234]
like image 782
Thaise Avatar asked Oct 04 '17 12:10

Thaise


People also ask

How do you take a random sample from a DataFrame in PySpark?

PySpark RDD also provides sample() function to get a random sampling, it also has another signature takeSample() that returns an Array[T]. PySpark RDD sample() function returns the random sampling similar to DataFrame and takes a similar types of parameters but in a different order.

How do you generate 10 random numbers in PySpark?

The randint() method to generates a whole number (integer). You can use randint(0,50) to generate a random number between 0 and 50. To generate random integers between 0 and 9, you can use the function randrange(min,max) . Change the parameters of randint() to generate a number between 1 and 10.

How do you generate a random number in PySpark?

To generate random number in Python, randint() function is used. This function is defined in random module.


1 Answers

You can order in random order using rand() function first:

 df.select('id').orderBy(rand()).limit(3).collect()

For more information on rand() function, check out pyspark.sql.functions.rand.

like image 177
geopet85 Avatar answered Sep 26 '22 07:09

geopet85