How to get the number of records written (using DataFrameWriter's save operation)?

Tags:

Is there any way to get the number of records written when using spark to save records? While I know it isn't in the spec currently, I'd like to be able to do something like:

val count = df.write.csv(path)

Alternatively, being able to do an inline count (preferably without just using a standard accumulator) of the results of a step would be (almost) as effective. i.e.:

dataset.countTo(count_var).filter({function}).countTo(filtered_count_var).collect()

Any ideas?

584

asked May 12 '17 09:05

Loki

1 Answers

I'd use SparkListener that can intercept onTaskEnd or onStageCompleted events that you could use to access task metrics.

Task metrics give you the accumulators Spark uses to display metrics in SQL tab (in Details for Query).

web UI / Details for Query

For example, the following query:

spark.
  read.
  option("header", true).
  csv("../datasets/people.csv").
  limit(10).
  write.
  csv("people")

gives exactly 10 output rows so Spark knows it (and you could too).

enter image description here

You could also explore Spark SQL's QueryExecutionListener:

The interface of query execution listener that can be used to analyze execution metrics.

You can register a QueryExecutionListener using ExecutionListenerManager that's available as spark.listenerManager.

scala> :type spark.listenerManager
org.apache.spark.sql.util.ExecutionListenerManager

scala> spark.listenerManager.
clear   clone   register   unregister

I think it's closer to the "bare metal", but haven't used that before.

@D3V (in the comments section) mentioned accessing the numOutputRows SQL metrics using QueryExecution of a structured query. Something worth considering.

scala> :type q
org.apache.spark.sql.DataFrame

scala> :type q.queryExecution.executedPlan.metrics
Map[String,org.apache.spark.sql.execution.metric.SQLMetric]

q.queryExecution.executedPlan.metrics("numOutputRows").value

155

answered Oct 21 '22 15:10

Jacek Laskowski

Related questions
                            
                                How to run sequence over List[F[G[A]]] to get F[G[List[A]]]
                            
                                How to add WebJars to my Play app?
                            
                                Why does the "contains" method on "Option" use a second type with lower bound instead of an "Any" type?
                            
                                Intellij code style to align single-line comments
                            
                                How to call a stored procedure and get return value in Slick (using Scala)
                            
                                Reading very large files (~ 1 TB) in sequential blocks [duplicate]
                            
                                Spark-Shell: Howto define JAR loading order
                            
                                Typeclasses in Haskell v. Scala
                            
                                Spark: Input a vector
                            
                                Strange behavior of type inference in function with upper bound
                            
                                Implicit abstract class constructor parameter and inheritance in Scala
                            
                                How memory allocation takes place in scala
                            
                                How to used named parameters with a curried function in scala
                            
                                sbt idiomatic way to add settings
                            
                                How to merge iterator parser
                            
                                How to compile/eval a Scala expression at runtime?
                            
                                How do I create an sbt task to generate code, then include these generated managed sources in my root project?
                            
                                Transactional method in Scala Play with Slick (similar to Spring @Transactional, maybe?)
                            
                                Kryo: deserialize old version of class
                            
                                Method References like in Java 8 in Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get the number of records written (using DataFrameWriter's save operation)?

Tags:

scala

apache-spark

apache-spark-sql

Loki

People also ask

1 Answers

Jacek Laskowski

Recent Activity

Donate For Us