I have one column in a DataFrame which I need to select 3 random values in Pyspark. Could anyone help-me, please?
+---+
| id|
+---+
|123|
|245|
| 12|
|234|
+---+
Desire:
Array with 3 random values get from that column:
**output**: [123, 12, 234]
PySpark RDD also provides sample() function to get a random sampling, it also has another signature takeSample() that returns an Array[T]. PySpark RDD sample() function returns the random sampling similar to DataFrame and takes a similar types of parameters but in a different order.
The randint() method to generates a whole number (integer). You can use randint(0,50) to generate a random number between 0 and 50. To generate random integers between 0 and 9, you can use the function randrange(min,max) . Change the parameters of randint() to generate a number between 1 and 10.
To generate random number in Python, randint() function is used. This function is defined in random module.
You can order in random order using rand()
function first:
df.select('id').orderBy(rand()).limit(3).collect()
For more information on rand()
function, check out pyspark.sql.functions.rand.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With