Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to setup Apache Spark to use local hard disk when data does not fit in RAM in local mode?

I have 50 GB dataset which doesn't fit in 8 GB RAM of my work computer but it has 1 TB local hard disk.

The below link from offical documentation mentions that Spark can use local hard disk if data doesnt fit in the memory.

http://spark.apache.org/docs/latest/hardware-provisioning.html

Local Disks

While Spark can perform a lot of its computation in memory, it still uses local disks to store data that doesn’t fit in RAM, as well as to preserve intermediate output between stages.

For me computational time is not at all a priority but fitting the data into a single computer's RAM/hard disk for processing is more important due to lack of alternate options.

Note: I am looking for a solution which doesn't include the below items

  1. Increase the RAM
  2. Sample & reduce data size
  3. Use cloud or cluster computers

My end objective is to use Spark MLLIB to build machine learning models. I am looking for real-life, practical solutions that people successfully used Spark to operate on data that doesn't fit in RAM in standalone/local mode in a single computer. Have someone done this successfully without major limitations?

Questions

  1. SAS have similar capability of out-of-core processing using which it can use both RAM & local hard disk for model building etc. Can Spark be made to work in the same way when data is more than RAM size?

  2. SAS writes persistent the complete dataset to hardisk in ".sas7bdat" format can Spark do similar persistent to hard disk?

  3. If this is possible, how to install and configure Spark for this purpose?
like image 478
GeorgeOfTheRF Avatar asked May 17 '16 03:05

GeorgeOfTheRF


People also ask

What happens if data do not fit in memory in Spark?

Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.

Does Spark use disk instead of memory?

While Spark can perform a lot of its computation in memory, it still uses local disks to store data that doesn't fit in RAM, as well as to preserve intermediate output between stages.

How do I run Apache Spark locally?

It's easy to run locally on one machine — all you need is to have java installed on your system PATH , or the JAVA_HOME environment variable pointing to a Java installation. Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.7+ and R 3.5+. Java 8 prior to version 8u201 support is deprecated as of Spark 3.2.


1 Answers

Look at http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence You can use various persistence models as per your need. MEMORY_AND_DISK is what will solve your problem . If you want a better performance, use MEMORY_AND_DISK_SER which stores data in serialized fashion.

like image 54
Preeti Khurana Avatar answered Oct 24 '22 17:10

Preeti Khurana