I know how to find the file size in scala.But how to find a RDD/dataframe size in spark?
Scala:
object Main extends App { val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString() println(file.length) }
Spark:
val distFile = sc.textFile(file) println(distFile.length)
but if i process it not getting file size. How to find the RDD size?
Similar to Python Pandas you can get the Size and Shape of the PySpark (Spark with Python) DataFrame by running count() action to get the number of rows on DataFrame and len(df. columns()) to get the number of columns.
PySpark (Spark with Python) Similarly, in PySpark you can get the current length/size of partitions by running getNumPartitions() of RDD class, so to use with DataFrame first you need to convert to RDD.
(a) If parent RDD has a partitioner on aggregation key(s), then the number of partitions in the aggregated RDD is equal to the number of partitions in the parent RDD. (b) If parent RDD does not have a partitioner, then the number of partitions in the aggregated RDD is equal to the value of 'spark.
If you are simply looking to count the number of rows in the rdd
, do:
val distFile = sc.textFile(file) println(distFile.count)
If you are interested in the bytes, you can use the SizeEstimator
:
import org.apache.spark.util.SizeEstimator println(SizeEstimator.estimate(distFile))
https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With