spark access first n rows - take vs limit

Tags:

I want to access the first 100 rows of a spark data frame and write the result back to a CSV file.

Why is take(100) basically instant, whereas

df.limit(100)       .repartition(1)       .write       .mode(SaveMode.Overwrite)       .option("header", true)       .option("delimiter", ";")       .csv("myPath")

takes forever. I do not want to obtain the first 100 records per partition but just any 100 records.

Why is take() so much faster than limit()?

294

asked Oct 19 '17 14:10

Georg Heiler

1 Answers

Although it still is answered, I want to share what I learned.

myDataFrame.take(10)

-> results in an Array of Rows. This is an action and performs collecting the data (like collect does).

myDataFrame.limit(10)

-> results in a new Dataframe. This is a transformation and does not perform collecting the data.

I do not have an explanation why then limit takes longer, but this may have been answered above. This is just a basic answer to what the difference is between take and limit.

answered Oct 04 '22 06:10

Kaspatoo

Related questions
                            
                                Warnings while building Scala/Spark project with SBT
                            
                                Spark DataFrame: does groupBy after orderBy maintain that order?
                            
                                Difference between createOrReplaceTempView and registerTempTable
                            
                                Adding a group count column to a PySpark dataframe
                            
                                how to get max(date) from given set of data grouped by some fields using pyspark?
                            
                                Google Dataflow vs Apache Spark
                            
                                Building a row from a dict in pySpark
                            
                                Column name with dot spark
                            
                                How to uncache RDD?
                            
                                Spark Equivalent of IF Then ELSE
                            
                                apache spark - check if file exists
                            
                                Would Spark unpersist the RDD itself when it realizes it won't be used anymore?
                            
                                Debugging "Managed memory leak detected" in Spark 1.6.0
                            
                                How to check status of Spark applications from the command line?
                            
                                Spark 2.0 Dataset vs DataFrame
                            
                                Methods for writing Parquet files using Python?
                            
                                Extremely slow S3 write times from EMR/ Spark
                            
                                The value of "spark.yarn.executor.memoryOverhead" setting?
                            
                                What are the differences between saveAsTable and insertInto in different SaveMode(s)?
                            
                                Create a custom Transformer in PySpark ML

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

spark access first n rows - take vs limit

Tags:

limit

apache-spark

apache-spark-sql

Georg Heiler

People also ask

1 Answers

Kaspatoo

Recent Activity

Donate For Us