Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How spark read a large file (petabyte) when file can not be fit in spark's main memory

What will happen for large files in these cases?

1) Spark gets a location from NameNode for data . Will Spark stop in this same time because data size is too long as per information from NameNode?

2) Spark do partition of data as per datanode block size but all data can not be stored into main memory. Here we are not using StorageLevel. So what will happen here?

3) Spark do partition the data, some data will store on main memory once this main memory store's data will process again spark will load other data from disc.

like image 906
Arpit Rai Avatar asked Oct 09 '17 04:10

Arpit Rai


People also ask

What happens if data do not fit in-memory in Spark?

Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.

Why does Apache Spark primarily store its data in-memory?

Spark holds intermediate results in memory rather than writing them to disk which is very useful especially when you need to work on the same dataset multiple times. It's designed to be an execution engine that works both in-memory and on-disk.


2 Answers

First of all, Spark only starts reading in the data when an action (like count, collect or write) is called. Once an action is called, Spark loads in data in partitions - the number of concurrently loaded partitions depend on the number of cores you have available. So in Spark you can think of 1 partition = 1 core = 1 task. Note that all concurrently loaded partitions have to fit into memory, or you will get an OOM.

Assuming that you have several stages, Spark then runs the transformations from the first stage on the loaded partitions only. Once it has applied the transformations on the data in the loaded partitions, it stores the output as shuffle-data and then reads in more partitions. It then applies the transformations on these partitions, stores the output as shuffle-data, reads in more partitions and so forth until all data has been read.

If you apply no transformation but only do for instance a count, Spark will still read in the data in partitions, but it will not store any data in your cluster and if you do the count again it will read in all the data once again. To avoid reading in data several times, you might call cache or persist in which case Spark will try to store the data in you cluster. On cache (which is the same as persist(StorageLevel.MEMORY_ONLY) it will store all partitions in memory - if it doesn't fit in memory you will get an OOM. If you call persist(StorageLevel.MEMORY_AND_DISK) it will store as much as it can in memory and the rest will be put on disk. If data doesn't fit on disk either the OS will usually kill your workers.

Note that Spark has its own little memory management system. Some of the memory that you assign to your Spark job is used to hold the data being worked on and some of the memory is used for storage if you call cache or persist.

I hope this explanation helps :)

like image 131
Glennie Helles Sindholt Avatar answered Sep 20 '22 01:09

Glennie Helles Sindholt


This is quoted directly from Apache Spark FAQ (FAQ | Apache Spark)

Does my data need to fit in memory to use Spark?

No. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.

In Apache Spark if the data does not fits into the memory then Spark simply persists that data to disk.

The persist method in Apache Spark provides six persist storage level to persist the data.

MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER  (Java and Scala), MEMORY_AND_DISK_SER  (Java and Scala), DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, OFF_HEAP. 

The OFF_HEAP storage is under experimentation.

like image 39
Swadeshi Avatar answered Sep 20 '22 01:09

Swadeshi