Find size of data stored in rdd from a text file in apache spark

Tags:

I am new to Apache Spark (version 1.4.1). I wrote a small code to read a text file and stored its data in Rdd .

Is there a way by which I can get the size of data in rdd .

This is my code :

import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.util.SizeEstimator
import org.apache.spark.sql.Row

object RddSize {

  def main(args: Array[String]) {

    val sc = new SparkContext("local", "data size")
    val FILE_LOCATION = "src/main/resources/employees.csv"
    val peopleRdd = sc.textFile(FILE_LOCATION)

    val newRdd = peopleRdd.filter(str => str.contains(",M,"))
    //Here I want to find whats the size remaining data
  }
}

I want to get size of data before filter Transformation (peopleRdd) and after it (newRdd).

579

asked Aug 24 '15 09:08

bob

2 Answers

There are multiple way to get the RDD size

1.Add the spark listener in your spark context

SparkDriver.getContext.addSparkListener(new SparkListener() {
override def onStageCompleted(stageCompleted: SparkListenerStageCompleted) {
  val map = stageCompleted.stageInfo.rddInfos
  map.foreach(row => {
      println("rdd memSize " + row.memSize)
      println("rdd diskSize " + row.diskSize)
   })
}})

2. Save you rdd as text file.

myRDD.saveAsTextFile("person.txt")

and call Apache Spark REST API.

/applications/[app-id]/stages

3. You can also try SizeEstimater

val rddSize = SizeEstimator.estimate(myRDD)

answered Sep 28 '22 15:09

Gabber

I'm not sure you need to do this. You could cache the rdd and check the size in the Spark UI. But lets say that you do want to do this programmatically, here is a solution.

    def calcRDDSize(rdd: RDD[String]): Long = {
        //map to the size of each string, UTF-8 is the default
        rdd.map(_.getBytes("UTF-8").length.toLong) 
           .reduce(_+_) //add the sizes together
    }

You can then call this function for your two RDDs:

println(s"peopleRdd is [${calcRDDSize(peopleRdd)}] bytes in size")
println(s"newRdd is [${calcRDDSize(newRdd)}] bytes in size")

This solution should work even if the file size is larger than the memory available in the cluster.

answered Sep 28 '22 15:09

Patrick McGloin

Related questions
                            
                                SortedSet map does not always preserve element ordering in result?
                            
                                Scala String Interpolation in println - Accessing elements using dot notation
                            
                                Intercept / Decorate a PartialFunction
                            
                                Scala switch which continue matching next cases after successful match
                            
                                Scala case class arguments instantiation from array
                            
                                What is the best way to check the type of a Scala variable? [duplicate]
                            
                                Which exception to throw when I find my data in inconsistent state in Scala?
                            
                                Getting data out of a Future in Scala
                            
                                What is the idiomatic approach to perform elementwise addition to an Array of Arrays in Scala
                            
                                Scala: Why foldLeft can't work for an concat of two list?
                            
                                Kafka partition key not working properly‏
                            
                                Scala Slick and SQLite
                            
                                Load Spark data locally Incomplete HDFS URI
                            
                                Slick error while compiling table definitions: could not find implicit value for parameter tm
                            
                                Scala Stdin.readLine() does not seem to work as expected
                            
                                Convert scala.List[scala.Long] to List<java.util.Long>
                            
                                Scala how to sum a list of futures
                            
                                Unbound Wildcard Type
                            
                                RDD to LabeledPoint conversion
                            
                                Scala types: Class A is not equal to the T where T is: type T = A

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find size of data stored in rdd from a text file in apache spark

Tags:

scala

apache-spark

apache-spark-1.4

bob

People also ask

2 Answers

Gabber

Patrick McGloin

Recent Activity

Donate For Us