How can I get a random row from a PySpark DataFrame? I only see the method <code>sample()</code> which takes a fraction as parameter. Setting this fraction to <code>1/numberOfRows</code> leads to random results, where sometimes I won't get any row. On <code>RDD</code> there is a method <code>takeSample()</code> that takes as a parameter the number of elements you want the sample to contain. I understand that this might be slow, as you have to count each partition, but is there a way to get something like this on a DataFrame?

You can simply call <code>takeSample</code> on a <code>RDD</code>: <pre class="prettyprint"><code>df = sqlContext.createDataFrame( [(1, "a"), (2, "b"), (3, "c"), (4, "d")], ("k", "v")) df.rdd.takeSample(False, 1, seed=0) ## [Row(k=3, v='c')] </code></pre> If you don't want to collect you can simply take a higher fraction and limit: <pre class="prettyprint"><code>df.sample(False, 0.1, seed=0).limit(1) </code></pre> <hr> Don't pass a <code>seed</code>, and you should get a different DataFrame each time.

How take a random row from a PySpark DataFrame?

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

How can I get a random row from a PySpark DataFrame? I only see the method sample() which takes a fraction as parameter. Setting this fraction to 1/numberOfRows leads to random results, where sometimes I won't get any row.

On RDD there is a method takeSample() that takes as a parameter the number of elements you want the sample to contain. I understand that this might be slow, as you have to count each partition, but is there a way to get something like this on a DataFrame?

493

asked Nov 30 '15 16:11

DanT

1 Answers

You can simply call takeSample on a RDD:

df = sqlContext.createDataFrame(     [(1, "a"), (2, "b"), (3, "c"), (4, "d")], ("k", "v")) df.rdd.takeSample(False, 1, seed=0) ## [Row(k=3, v='c')]

If you don't want to collect you can simply take a higher fraction and limit:

df.sample(False, 0.1, seed=0).limit(1)

Don't pass a seed, and you should get a different DataFrame each time.

156

answered Sep 21 '22 07:09

zero323

Related questions
                            
                                python: create list of tuples from lists [duplicate]
                            
                                numpy: multiply arrays rowwise
                            
                                Plot histogram with colors taken from colormap
                            
                                Can I avoid circular imports in Flask and SQLAlchemy
                            
                                scipy csr_matrix: understand indptr
                            
                                How to change the file name of an uploaded file in Django?
                            
                                How to do string formatting with placeholders in Java (like in Python)?
                            
                                Call function based on argparse
                            
                                Get the number of all keys in a dictionary of dictionaries in Python
                            
                                How to set max output width in numpy?
                            
                                Difference between 'related_name' and 'related_query_name' attributes in Django?
                            
                                What's the simplest cross-platform way to pop up graphical dialogs in Python?
                            
                                Running python script from inside virtualenv bin is not working
                            
                                Dynamically add/create subplots in matplotlib
                            
                                Setting a value in a nested Python dictionary given a list of indices and value
                            
                                How can I get argparse in Python 2.6?
                            
                                raise LinAlgError("SVD did not converge") LinAlgError: SVD did not converge in matplotlib pca determination
                            
                                How to unpack a Series of tuples in Pandas?
                            
                                assertRaises in python unit-test not catching the exception [duplicate]
                            
                                NameError: name 'true' is not defined [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With