Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to take the first 1000 rows of a Spark Dataframe?

I am using the randomSplitfunction to get a small amount of a dataframe to use in dev purposes and I end up just taking the first df that is returned by this function.

val df_subset = data.randomSplit(Array(0.00000001, 0.01), seed = 12345)(0) 

If I use df.take(1000) then I end up with an array of rows- not a dataframe, so that won't work for me.

Is there a better, simpler way to take say the first 1000 rows of the df and store it as another df?

like image 649
Michael Discenza Avatar asked Dec 10 '15 16:12

Michael Discenza


People also ask

How do you show more than 20 rows in PySpark?

By default Spark with Scala, Java, or with Python (PySpark), fetches only 20 rows from DataFrame show() but not all rows and the column value is truncated to 20 characters, In order to fetch/display more than 20 rows and column full value from Spark/PySpark DataFrame, you need to pass arguments to the show() method.

What does First () do in Spark?

In Spark, the First function always returns the first element of the dataset. It is similar to take(1).


1 Answers

The method you are looking for is .limit.

Returns a new Dataset by taking the first n rows. The difference between this function and head is that head returns an array while limit returns a new Dataset.

Example usage:

df.limit(1000) 
like image 71
Markon Avatar answered Sep 24 '22 12:09

Markon