How to setup Apache Spark to use local hard disk when data does not fit in RAM in local mode?

Tags:

I have 50 GB dataset which doesn't fit in 8 GB RAM of my work computer but it has 1 TB local hard disk.

The below link from offical documentation mentions that Spark can use local hard disk if data doesnt fit in the memory.

http://spark.apache.org/docs/latest/hardware-provisioning.html

Local Disks

While Spark can perform a lot of its computation in memory, it still uses local disks to store data that doesn’t fit in RAM, as well as to preserve intermediate output between stages.

For me computational time is not at all a priority but fitting the data into a single computer's RAM/hard disk for processing is more important due to lack of alternate options.

Note: I am looking for a solution which doesn't include the below items

Increase the RAM
Sample & reduce data size
Use cloud or cluster computers

My end objective is to use Spark MLLIB to build machine learning models. I am looking for real-life, practical solutions that people successfully used Spark to operate on data that doesn't fit in RAM in standalone/local mode in a single computer. Have someone done this successfully without major limitations?

Questions

SAS have similar capability of out-of-core processing using which it can use both RAM & local hard disk for model building etc. Can Spark be made to work in the same way when data is more than RAM size?
SAS writes persistent the complete dataset to hardisk in ".sas7bdat" format can Spark do similar persistent to hard disk?
If this is possible, how to install and configure Spark for this purpose?

478

asked May 17 '16 03:05

GeorgeOfTheRF

1 Answers

Look at http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence You can use various persistence models as per your need. MEMORY_AND_DISK is what will solve your problem . If you want a better performance, use MEMORY_AND_DISK_SER which stores data in serialized fashion.

answered Oct 24 '22 17:10

Preeti Khurana

Related questions
                            
                                Is there a common place to store data schemas in Hadoop?
                            
                                how to clean the hadoop-hdfs logs under /var/log/hadoop-hdfs
                            
                                Writing to multiple folders in hadoop?
                            
                                how can i provide password to SQOOP through OOZIE to connect to MS-SQL?
                            
                                Connecting to remote Mapr Hive via JDBC
                            
                                What does "Error: Could not find or load main class org.apache.hadoop.util.RunJar"?
                            
                                How to run HDFS cluster without DNS
                            
                                Cannot access to Tracking UI for ApplicationMaster, Connection refused
                            
                                Hbase vs Cassandra: Which is better for a timeseries data storage?
                            
                                How to save/export a Spark ML Lib model to PMML?
                            
                                How to unit test Java Hbase API
                            
                                Hadoop ORC file - How it works - How to fetch metadata
                            
                                Iterator behaviour in flink reduceGroup
                            
                                Equivalent of Distributed Cache in Spark? [duplicate]
                            
                                Using CSV Serde with Hive create table converts all field types to string
                            
                                Ever increasing physical memory for a Spark application in YARN
                            
                                How do I specify multiple libpath in oozie job?
                            
                                How can I Read and Transfer chunks of file with Hadoop WebHDFS?
                            
                                Spark/Hadoop - Not able to save to s3 with server side encryption
                            
                                dep interpreter not found

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to setup Apache Spark to use local hard disk when data does not fit in RAM in local mode?

Tags:

machine-learning

apache-spark

hadoop

sas

bigdata

GeorgeOfTheRF

People also ask

1 Answers

Preeti Khurana

Recent Activity

Donate For Us