Is dataframe.show() an action in spark?

Tags:

apache-spark

I have the following code:

val df_in = sqlcontext.read.json(jsonFile) // the file resides in hdfs

//some operations in here to create df as df_in with two more columns "terms1" and "terms2" 

val intersectUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => {     seq1 intersect seq2 } ) //intersects two sequences
val symmDiffUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { (seq1 diff seq2) ++ (seq2 diff seq1) } ) //compute the difference of two sequences

val df1 = (df.withColumn("termsInt", intersectUDF(df("terms1"), df1("terms2") ) )
             .withColumn("termsDiff", symmDiffUDF(df("terms1"),     df1("terms2") ) )
             .where( size(col("termsInt")) >0 && size(col("termsDiff")) > 0 && size(col("termsDiff")) <= 2 )
             .cache()
           ) // add the intersection and difference columns and filter the resulting DF 

df1.show()
df1.count()

The app is working properly and fast until the show() but in the count() step, it creates 40000 tasks.

My understanding is that df1.show() should be triggering the full df1 creation and df1.count() should be very fast. What am I missing here? why is count() that slow?

Thank you very much in advance, Roxana

876

asked Oct 13 '16 15:10

2 Answers

show is indeed an action, but it is smart enough to know when it doesn't have to run everything. If you had an orderBy it would take very long too, but in this case all your operations are map operations and so there's no need to calculate the whole final table. However, count needs to physically go through the whole table in order to count it and that's why it's taking so long. You could test what I'm saying by adding an orderBy to df1's definition - then it should take long.

EDIT: Also, the 40k tasks are likely due to the amount of partitions your DF is partitioned into. Try using df1.repartition(<a sensible number here, depending on cluster and DF size>) and trying out count again.

184

answered Oct 22 '22 18:10

Felipe Gerard

show() by default shows only 20 rows. If the 1st partition returned more than 20 rows, then the rest partitions will not be executed.

Note show has a lot of variations. If you run show(false) which means show all results, all partitions will be executed and may take more time. So, show() equals show(20) which is a partial action.

answered Oct 22 '22 16:10

Leon

Related questions
                            
                                How can I save an RDD into HDFS and later read it back?
                            
                                How to get all columns after groupby on Dataset<Row> in spark sql 2.1.0
                            
                                How to create a copy of a dataframe in pyspark?
                            
                                Encountering " WARN ProcfsMetricsGetter: Exception when trying to compute pagesize" error when running Spark
                            
                                Is there an "Explain RDD" in spark
                            
                                How to extract application ID from the PySpark context
                            
                                Case class equality in Apache Spark
                            
                                How to connect HBase and Spark using Python?
                            
                                Writing files to local system with Spark in Cluster mode
                            
                                How to filter one spark dataframe against another dataframe
                            
                                How do I collect a single column in Spark?
                            
                                How to set the number of partitions/nodes when importing data into Spark
                            
                                Spark Error: Not enough space to cache partition rdd_8_2 in memory! Free memory is 58905314 bytes
                            
                                Spark when union a lot of RDD throws stack overflow error
                            
                                Spark SQL filter multiple fields
                            
                                Use Spark to list all files in a Hadoop HDFS directory?
                            
                                Apache Drill vs Spark [closed]
                            
                                Building a StructType from a dataframe in pyspark
                            
                                How to select last row and also how to access PySpark dataframe by index?
                            
                                How to connect to remote hive server from spark [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With