Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find spark RDD/Dataframe size?

I know how to find the file size in scala.But how to find a RDD/dataframe size in spark?

Scala:

object Main extends App {   val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()   println(file.length) } 

Spark:

val distFile = sc.textFile(file) println(distFile.length) 

but if i process it not getting file size. How to find the RDD size?

like image 659
Venu A Positive Avatar asked Jan 26 '16 06:01

Venu A Positive


People also ask

How do I know my Spark data frame size?

Similar to Python Pandas you can get the Size and Shape of the PySpark (Spark with Python) DataFrame by running count() action to get the number of rows on DataFrame and len(df. columns()) to get the number of columns.

How do I check my PySpark partition size?

PySpark (Spark with Python) Similarly, in PySpark you can get the current length/size of partitions by running getNumPartitions() of RDD class, so to use with DataFrame first you need to convert to RDD.

How do I know how many partitions I have in RDD?

(a) If parent RDD has a partitioner on aggregation key(s), then the number of partitions in the aggregated RDD is equal to the number of partitions in the parent RDD. (b) If parent RDD does not have a partitioner, then the number of partitions in the aggregated RDD is equal to the value of 'spark.


1 Answers

If you are simply looking to count the number of rows in the rdd, do:

val distFile = sc.textFile(file) println(distFile.count) 

If you are interested in the bytes, you can use the SizeEstimator:

import org.apache.spark.util.SizeEstimator println(SizeEstimator.estimate(distFile)) 

https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html

like image 184
Glennie Helles Sindholt Avatar answered Sep 20 '22 02:09

Glennie Helles Sindholt