I have one column in a DataFrame which I need to select 3 random values in Pyspark. Could anyone help-me, please? <pre class="prettyprint"><code>+---+ | id| +---+ |123| |245| | 12| |234| +---+ </code></pre> Desire: Array with 3 random values get from that column: <pre class="prettyprint"><code>**output**: [123, 12, 234] </code></pre>

You can order in random order using <code>rand()</code> function first: <pre class="prettyprint"><code> df.select('id').orderBy(rand()).limit(3).collect() </code></pre> For more information on <code>rand()</code> function, check out pyspark.sql.functions.rand.

Pyspark - How to get random values from a DataFrame column

Tags:

pyspark

pyspark-sql

spark-dataframe

I have one column in a DataFrame which I need to select 3 random values in Pyspark. Could anyone help-me, please?

+---+
| id|
+---+
|123| 
|245| 
| 12|
|234|
+---+

Desire:

Array with 3 random values get from that column:

**output**: [123, 12, 234]

782

asked Oct 04 '17 12:10

Thaise

1 Answers

You can order in random order using rand() function first:

 df.select('id').orderBy(rand()).limit(3).collect()

For more information on rand() function, check out pyspark.sql.functions.rand.

177

answered Sep 26 '22 07:09

geopet85

Related questions
                            
                                How to enable Tungsten optimization in Spark 2?
                            
                                How to enable spark-history server for standalone cluster non hdfs mode
                            
                                AssertionError: all exprs should be Column
                            
                                TypeError: 'DataFrameReader' object is not callable
                            
                                Using when and otherwise while converting boolean values to strings in Pyspark
                            
                                Transpose a dataframe in Pyspark
                            
                                How to specify join types in AWS Glue?
                            
                                Pyspark KMeans clustering features column IllegalArgumentException
                            
                                Count occurrences of a list of substrings in a pyspark df column
                            
                                How to save csv files faster from pyspark dataframe?
                            
                                Pyspark Failed to find data source: kafka
                            
                                Pyspark: how to extract hour from timestamp
                            
                                SparkSQL sql syntax for nth item in array
                            
                                Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found (Spark 1.6 Windows)
                            
                                boto3 cannot create client on pyspark worker?
                            
                                Is it possible to filter Spark DataFrames to return all rows where a column value is in a list using pyspark?
                            
                                How can I split a timestamp column into date and time in spark
                            
                                Spark and profiling or execution plan
                            
                                How can I build a CoordinateMatrix in Spark using a DataFrame?
                            
                                Dummy Encoding using Pyspark [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With