Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark access first n rows - take vs limit

I want to access the first 100 rows of a spark data frame and write the result back to a CSV file.

Why is take(100) basically instant, whereas

df.limit(100)       .repartition(1)       .write       .mode(SaveMode.Overwrite)       .option("header", true)       .option("delimiter", ";")       .csv("myPath") 

takes forever. I do not want to obtain the first 100 records per partition but just any 100 records.

Why is take() so much faster than limit()?

like image 294
Georg Heiler Avatar asked Oct 19 '17 14:10

Georg Heiler


People also ask

How do you use limit on Spark?

Description. The LIMIT clause is used to constrain the number of rows returned by the SELECT statement. In general, this clause is used in conjunction with ORDER BY to ensure that the results are deterministic.

How can I show more than 20 rows in Spark?

By default Spark with Scala, Java, or with Python (PySpark), fetches only 20 rows from DataFrame show() but not all rows and the column value is truncated to 20 characters, In order to fetch/display more than 20 rows and column full value from Spark/PySpark DataFrame, you need to pass arguments to the show() method.


1 Answers

Although it still is answered, I want to share what I learned.

myDataFrame.take(10) 

-> results in an Array of Rows. This is an action and performs collecting the data (like collect does).

myDataFrame.limit(10) 

-> results in a new Dataframe. This is a transformation and does not perform collecting the data.

I do not have an explanation why then limit takes longer, but this may have been answered above. This is just a basic answer to what the difference is between take and limit.

like image 94
Kaspatoo Avatar answered Oct 04 '22 06:10

Kaspatoo