Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How take a random row from a PySpark DataFrame?

How can I get a random row from a PySpark DataFrame? I only see the method sample() which takes a fraction as parameter. Setting this fraction to 1/numberOfRows leads to random results, where sometimes I won't get any row.

On RDD there is a method takeSample() that takes as a parameter the number of elements you want the sample to contain. I understand that this might be slow, as you have to count each partition, but is there a way to get something like this on a DataFrame?

like image 493
DanT Avatar asked Nov 30 '15 16:11

DanT


People also ask

How do you take a random sample from Pyspark DataFrame?

PySpark RDD also provides sample() function to get a random sampling, it also has another signature takeSample() that returns an Array[T]. PySpark RDD sample() function returns the random sampling similar to DataFrame and takes a similar types of parameters but in a different order.

How do I select rows from spark DataFrame?

Selecting rows using the filter() function The first option you have when it comes to filtering DataFrame rows is pyspark. sql. DataFrame. filter() function that performs filtering based on the specified conditions.


1 Answers

You can simply call takeSample on a RDD:

df = sqlContext.createDataFrame(     [(1, "a"), (2, "b"), (3, "c"), (4, "d")], ("k", "v")) df.rdd.takeSample(False, 1, seed=0) ## [Row(k=3, v='c')] 

If you don't want to collect you can simply take a higher fraction and limit:

df.sample(False, 0.1, seed=0).limit(1) 

Don't pass a seed, and you should get a different DataFrame each time.

like image 156
zero323 Avatar answered Sep 21 '22 07:09

zero323