Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark out of memory

I have a folder with 150 G of txt files (around 700 files, on average each 200 MB).

I'm using scala to process the files and calculate some aggregate statistics in the end. I see two possible approaches to do that:

  • manually loop through all the files, do the calculations per file and merge the results in the end
  • read the whole folder to one RDD, do all the operations on this single RDD and let spark do all the parallelization

I'm leaning towards the second approach as it seems cleaner (no need for parallelization specific code), but I'm wondering if my scenario will fit the constraints imposed by my hardware and data. I have one workstation with 16 threads and 64 GB of RAM available (so the parallelization will be strictly local between different processor cores). I might scale the infrastructure with more machines later on, but for now I would just like to focus on tunning the settings for this one workstation scenario.

The code I'm using: - reads TSV files, and extracts meaningful data to (String, String, String) triplets - afterwards some filtering, mapping and grouping is performed - finally, the data is reduced and some aggregates are calculated

I've been able to run this code with a single file (~200 MB of data), however I get a java.lang.OutOfMemoryError: GC overhead limit exceeded and/or a Java out of heap exception when adding more data (the application breaks with 6GB of data but I would like to use it with 150 GB of data).

I guess I would have to tune some parameters to make this work. I would appreciate any tips on how to approach this problem (how to debug for memory demands). I've tried increasing the 'spark.executor.memory' and using a smaller number of cores (the rational being that each core needs some heap space), but this didn't solve my problems.

I don't need the solution to be very fast (it can easily run for a few hours even days if needed). I'm also not caching any data, but just saving them to the file system in the end. If you think it would be more feasible to just go with the manual parallelization approach, I could do that as well.

like image 200
Igor Avatar asked Jul 04 '14 09:07

Igor


1 Answers

Me and my team had processed a csv data sized over 1 TB over 5 machine @32GB of RAM each successfully. It depends heavily what kind of processing you're doing and how.

  1. If you repartition an RDD, it requires additional computation that has overhead above your heap size, try loading the file with more paralelism by decreasing split-size in TextInputFormat.SPLIT_MINSIZE and TextInputFormat.SPLIT_MAXSIZE (if you're using TextInputFormat) to elevate the level of paralelism.

  2. Try using mapPartition instead of map so you can handle the computation inside a partition. If the computation uses a temporary variable or instance and you're still facing out of memory, try lowering the number of data per partition (increasing the partition number)

  3. Increase the driver memory and executor memory limit using "spark.executor.memory" and "spark.driver.memory" in spark configuration before creating Spark Context

Note that Spark is a general-purpose cluster computing system so it's unefficient (IMHO) using Spark in a single machine

like image 155
Averman Avatar answered Oct 04 '22 15:10

Averman