How to find spark RDD/Dataframe size?

Tags:

I know how to find the file size in scala.But how to find a RDD/dataframe size in spark?

Scala:

object Main extends App {   val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()   println(file.length) }

Spark:

val distFile = sc.textFile(file) println(distFile.length)

but if i process it not getting file size. How to find the RDD size?

659

asked Jan 26 '16 06:01

Venu A Positive

1 Answers

If you are simply looking to count the number of rows in the rdd, do:

val distFile = sc.textFile(file) println(distFile.count)

If you are interested in the bytes, you can use the SizeEstimator:

import org.apache.spark.util.SizeEstimator println(SizeEstimator.estimate(distFile))

https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html

184

answered Sep 20 '22 02:09

Glennie Helles Sindholt

Related questions
                            
                                Scala console won't work, IntelliJ
                            
                                Scala: InputStream to Array[Byte]
                            
                                Why no i++ in Scala?
                            
                                VerifyError: Uninitialized object exists on backward branch / JVM Spec 4.10.2.4
                            
                                Optimal way to create a ml pipeline in Apache Spark for dataset with high number of columns
                            
                                Generating a class from string and instantiating it in Scala 2.10
                            
                                Meaning of additional colon in Scala class parametrization
                            
                                Using scala constructor to set variable defined in trait
                            
                                Why does sbt compile fail with StackOverflowError?
                            
                                How to use Scala's this typing, abstract types, etc. to implement a Self type?
                            
                                Multiple packages definition
                            
                                Fetching distinct values on a column using Spark DataFrame
                            
                                How to convert DataFrame to RDD in Scala?
                            
                                Scala Map implementation keeping entries in insertion order?
                            
                                Spark - extracting single value from DataFrame
                            
                                What is Scala's Simple Build Tool (sbt) and why is it used?
                            
                                How to use / refer to the negation of a boolean function in Scala?
                            
                                Throw Custom Exception
                            
                                Apache Spark - foreach Vs foreachPartition When to use What?
                            
                                Suppress "discarded non-Unit value" warning

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to find spark RDD/Dataframe size?

Tags:

scala

apache-spark

rdd

Venu A Positive

People also ask

1 Answers

Glennie Helles Sindholt

Recent Activity

Donate For Us