I want to access the first 100 rows of a spark data frame and write the result back to a CSV file.
Why is take(100)
basically instant, whereas
df.limit(100) .repartition(1) .write .mode(SaveMode.Overwrite) .option("header", true) .option("delimiter", ";") .csv("myPath")
takes forever. I do not want to obtain the first 100 records per partition but just any 100 records.
Why is take()
so much faster than limit()
?
Description. The LIMIT clause is used to constrain the number of rows returned by the SELECT statement. In general, this clause is used in conjunction with ORDER BY to ensure that the results are deterministic.
By default Spark with Scala, Java, or with Python (PySpark), fetches only 20 rows from DataFrame show() but not all rows and the column value is truncated to 20 characters, In order to fetch/display more than 20 rows and column full value from Spark/PySpark DataFrame, you need to pass arguments to the show() method.
Although it still is answered, I want to share what I learned.
myDataFrame.take(10)
-> results in an Array of Rows. This is an action and performs collecting the data (like collect does).
myDataFrame.limit(10)
-> results in a new Dataframe. This is a transformation and does not perform collecting the data.
I do not have an explanation why then limit takes longer, but this may have been answered above. This is just a basic answer to what the difference is between take and limit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With