How can I get a random row from a PySpark DataFrame? I only see the method sample()
which takes a fraction as parameter. Setting this fraction to 1/numberOfRows
leads to random results, where sometimes I won't get any row.
On RDD
there is a method takeSample()
that takes as a parameter the number of elements you want the sample to contain. I understand that this might be slow, as you have to count each partition, but is there a way to get something like this on a DataFrame?
PySpark RDD also provides sample() function to get a random sampling, it also has another signature takeSample() that returns an Array[T]. PySpark RDD sample() function returns the random sampling similar to DataFrame and takes a similar types of parameters but in a different order.
Selecting rows using the filter() function The first option you have when it comes to filtering DataFrame rows is pyspark. sql. DataFrame. filter() function that performs filtering based on the specified conditions.
You can simply call takeSample
on a RDD
:
df = sqlContext.createDataFrame( [(1, "a"), (2, "b"), (3, "c"), (4, "d")], ("k", "v")) df.rdd.takeSample(False, 1, seed=0) ## [Row(k=3, v='c')]
If you don't want to collect you can simply take a higher fraction and limit:
df.sample(False, 0.1, seed=0).limit(1)
Don't pass a seed
, and you should get a different DataFrame each time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With