Memory efficient way of union a sequence of RDDs from Files in Apache Spark

Tags:

I'm currently trying to train a set of Word2Vec Vectors on the UMBC Webbase Corpus (around 30GB of text in 400 files).

I often run into out of memory situations even on 100 GB plus Machines. I run Spark in the application itself. I tried to tweak a little bit, but I am not able to perform this operation on more than 10 GB of textual data. The clear bottleneck of my implementation is the union of the previously computed RDDs, that where the out of memory exception comes from.

Maybe one you have the experience to come up with a more memory efficient implementation than this:

 object SparkJobs {
  val conf = new SparkConf()
    .setAppName("TestApp")
    .setMaster("local[*]")
    .set("spark.executor.memory", "100g")
    .set("spark.rdd.compress", "true")

  val sc = new SparkContext(conf)


  def trainBasedOnWebBaseFiles(path: String): Unit = {
    val folder: File = new File(path)

    val files: ParSeq[File] = folder.listFiles(new TxtFileFilter).toIndexedSeq.par


    var i = 0;
    val props = new Properties();
    props.setProperty("annotators", "tokenize, ssplit");
    props.setProperty("nthreads","2")
    val pipeline = new StanfordCoreNLP(props);

    //preprocess files parallel
    val training_data_raw: ParSeq[RDD[Seq[String]]] = files.map(file => {
      //preprocess line of file
      println(file.getName() +"-" + file.getTotalSpace())
      val rdd_lines: Iterator[Option[Seq[String]]] = for (line <- Source.fromFile(file,"utf-8").getLines) yield {
          //performs some preprocessing like tokenization, stop word filtering etc.
          processWebBaseLine(pipeline, line)    
      }
      val filtered_rdd_lines = rdd_lines.filter(line => line.isDefined).map(line => line.get).toList
      println(s"File $i done")
      i = i + 1
      sc.parallelize(filtered_rdd_lines).persist(StorageLevel.MEMORY_ONLY_SER)

    })

    val rdd_file =  sc.union(training_data_raw.seq)

    val starttime = System.currentTimeMillis()
    println("Start Training")
    val word2vec = new Word2Vec()

    word2vec.setVectorSize(100)
    val model: Word2VecModel = word2vec.fit(rdd_file)

    println("Training time: " + (System.currentTimeMillis() - starttime))
    ModelUtil.storeWord2VecModel(model, Config.WORD2VEC_MODEL_PATH)  
  }}
}

977

asked Feb 05 '15 11:02

dice89

1 Answers

Like Sarvesh points out in the comments, it is probably too much data for a single machine. Use more machines. We typically see the need for 20–30 GB of memory to work with a file of 1 GB. By this (extremely rough) estimate you'd need 600–800 GB of memory for the 30 GB input. (You can get a more accurate estimate by loading a part of the data.)

As a more general comment, I'd suggest you avoid using rdd.union and sc.parallelize. Use instead sc.textFile with a wildcard to load all files into a single RDD.

answered Nov 01 '22 18:11

Daniel Darabos

Related questions
                            
                                What is the reason this type parameter syntax doesn't compile?
                            
                                Concurrency in Play 2.1 or above
                            
                                idiomatic way to declare protected method in Scala when allowing for composition?
                            
                                Scala: "override protected val" results in error when defining case class constructor
                            
                                Creating Read[T] and Write[T] for Abstract Class
                            
                                How to control the way Swagger generates the model/schema for a type
                            
                                Running tests from jar with "sbt testOnly" in SBT?
                            
                                Convert a Seq[String] to a case class in a typesafe way
                            
                                Fast and safe conversion from string to numeric types
                            
                                Is there any consideration for maven project and sbt project and play framework to share one single repository?
                            
                                Why does this code typecheck in Scala 2.11 and what can I do about it?
                            
                                Why does subproject not compile after migrating from 2.2 to 2.3?
                            
                                Scala Akka Logging with SLF4J MDC
                            
                                Play framework, Scala: authenticate User by Role
                            
                                Create a temporary file from a base64 string with rapture-io
                            
                                Convert RDD of Vector in LabeledPoint using Scala - MLLib in Apache Spark
                            
                                Usage of gatling feeders
                            
                                Scala - console based development workflow
                            
                                Why is NoClassDefFoundError thrown with "run" but works fine with "dist"?
                            
                                How to set-up the sbt-proguard plugin in Build.scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Memory efficient way of union a sequence of RDDs from Files in Apache Spark

Tags:

scala

nlp

apache-spark

word2vec

bigdata

dice89

People also ask

1 Answers

Daniel Darabos

Recent Activity

Donate For Us