I'm wondering if there is a way to know the number of lines written by a Spark save operation. I know that it's enough to do a count on the RDD before writing it, but I'd like to know if there is a way to have the same info without doing it.
Thank you, Marco
If you really want you can add custom listener and extract number of written rows from outputMetrics. Very simple example can look like this:
import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd}
var recordsWrittenCount = 0L
sc.addSparkListener(new SparkListener() {
override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
synchronized {
recordsWrittenCount += taskEnd.taskMetrics.outputMetrics.recordsWritten
}
}
})
sc.parallelize(1 to 10, 2).saveAsTextFile("/tmp/foobar")
recordsWrittenCount
// Long = 10
but this part of the API is intended for internal usage.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With