I am using the randomSplit
function to get a small amount of a dataframe to use in dev purposes and I end up just taking the first df that is returned by this function.
val df_subset = data.randomSplit(Array(0.00000001, 0.01), seed = 12345)(0)
If I use df.take(1000)
then I end up with an array of rows- not a dataframe, so that won't work for me.
Is there a better, simpler way to take say the first 1000 rows of the df and store it as another df?
By default Spark with Scala, Java, or with Python (PySpark), fetches only 20 rows from DataFrame show() but not all rows and the column value is truncated to 20 characters, In order to fetch/display more than 20 rows and column full value from Spark/PySpark DataFrame, you need to pass arguments to the show() method.
In Spark, the First function always returns the first element of the dataset. It is similar to take(1).
The method you are looking for is .limit.
Returns a new Dataset by taking the first n rows. The difference between this function and head is that head returns an array while limit returns a new Dataset.
Example usage:
df.limit(1000)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With