I have a dataframe with multiple thousands of records, and I'd like to randomly select 1000 rows into another dataframe for demoing. How can I do this in Java?
Thank you!
You can shuffle the rows and then take the top ones:
import org.apache.spark.sql.functions.rand
dataset.orderBy(rand()).limit(n)
You can try sample () method. Unfourtunatelly you must give there not a number, but fraction. You can write function like this:
def getRandom (dataset : Dataset[_], n : Int) = {
val count = dataset.count();
val howManyTake = if (count > n) n else count;
dataset.sample(0, 1.0*howManyTake/count).limit (n)
}
Explanation: we must take a fraction of data. If we have 2000 rows and you want to get 100 rows, we must have 0.5 of total rows. If you want to get more rows than there are in DataFrame, you must get 1.0. limit () function is invoked to make sure that rounding is ok and you didn't get more rows than you specified.
Edit: I see in other answer the takeSample method. But remember:
dataset.rdd.takeSample(0, 1000, System.currentTimeMilis()).toDF()
takeSample will collect all values. I would prefer this in pyspark
df.sample(withReplacement=False, fraction=desired_fraction)
Here is doc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With