What will happen for large files in these cases? 1) Spark gets a location from NameNode for data . Will Spark stop in this same time because data size is too long as per information from NameNode? 2) Spark do partition of data as per datanode block size but all data can not be stored into main memory. Here we are not using StorageLevel. So what will happen here? 3) Spark do partition the data, some data will store on main memory once this main memory store's data will process again spark will load other data from disc.

First of all, Spark only starts reading in the data when an action (like <code>count</code>, <code>collect</code> or <code>write</code>) is called. Once an action is called, Spark loads in data in partitions - the number of concurrently loaded partitions depend on the number of cores you have available. So in Spark you can think of 1 partition = 1 core = 1 task. Note that all concurrently loaded partitions have to fit into memory, or you will get an OOM. Assuming that you have several stages, Spark then runs the transformations from the first stage on the loaded partitions only. Once it has applied the transformations on the data in the loaded partitions, it stores the output as shuffle-data and then reads in more partitions. It then applies the transformations on these partitions, stores the output as shuffle-data, reads in more partitions and so forth until all data has been read. If you apply no transformation but only do for instance a <code>count</code>, Spark will still read in the data in partitions, but it will not store any data in your cluster and if you do the <code>count</code> again it will read in all the data once again. To avoid reading in data several times, you might call <code>cache</code> or <code>persist</code> in which case Spark will try to store the data in you cluster. On <code>cache</code> (which is the same as <code>persist(StorageLevel.MEMORY_ONLY)</code> it will store all partitions in memory - if it doesn't fit in memory you will get an OOM. If you call <code>persist(StorageLevel.MEMORY_AND_DISK)</code> it will store as much as it can in memory and the rest will be put on disk. If data doesn't fit on disk either the OS will usually kill your workers. Note that Spark has its own little memory management system. Some of the memory that you assign to your Spark job is used to hold the data being worked on and some of the memory is used for storage if you call <code>cache</code> or <code>persist</code>. I hope this explanation helps :)

This is quoted directly from Apache Spark FAQ (FAQ | Apache Spark) <blockquote> Does my data need to fit in memory to use Spark? No. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level. </blockquote> In Apache Spark if the data does not fits into the memory then Spark simply persists that data to disk. The persist method in Apache Spark provides six persist storage level to persist the data. <pre class="prettyprint"><code>MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER (Java and Scala), MEMORY_AND_DISK_SER (Java and Scala), DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, OFF_HEAP. </code></pre> The OFF_HEAP storage is under experimentation.

How spark read a large file (petabyte) when file can not be fit in spark's main memory

2 Answers

First of all, Spark only starts reading in the data when an action (like count, collect or write) is called. Once an action is called, Spark loads in data in partitions - the number of concurrently loaded partitions depend on the number of cores you have available. So in Spark you can think of 1 partition = 1 core = 1 task. Note that all concurrently loaded partitions have to fit into memory, or you will get an OOM.

Assuming that you have several stages, Spark then runs the transformations from the first stage on the loaded partitions only. Once it has applied the transformations on the data in the loaded partitions, it stores the output as shuffle-data and then reads in more partitions. It then applies the transformations on these partitions, stores the output as shuffle-data, reads in more partitions and so forth until all data has been read.

If you apply no transformation but only do for instance a count, Spark will still read in the data in partitions, but it will not store any data in your cluster and if you do the count again it will read in all the data once again. To avoid reading in data several times, you might call cache or persist in which case Spark will try to store the data in you cluster. On cache (which is the same as persist(StorageLevel.MEMORY_ONLY) it will store all partitions in memory - if it doesn't fit in memory you will get an OOM. If you call persist(StorageLevel.MEMORY_AND_DISK) it will store as much as it can in memory and the rest will be put on disk. If data doesn't fit on disk either the OS will usually kill your workers.

Note that Spark has its own little memory management system. Some of the memory that you assign to your Spark job is used to hold the data being worked on and some of the memory is used for storage if you call cache or persist.

I hope this explanation helps :)

131

answered Sep 20 '22 01:09

Glennie Helles Sindholt

This is quoted directly from Apache Spark FAQ (FAQ | Apache Spark)

Does my data need to fit in memory to use Spark?

No. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.

In Apache Spark if the data does not fits into the memory then Spark simply persists that data to disk.

The persist method in Apache Spark provides six persist storage level to persist the data.

MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER  (Java and Scala), MEMORY_AND_DISK_SER  (Java and Scala), DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, OFF_HEAP.

The OFF_HEAP storage is under experimentation.

answered Sep 20 '22 01:09

Swadeshi

Related questions
                            
                                Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) [duplicate]
                            
                                How to Define Custom partitioner for Spark RDDs of equally sized partition where each partition has equal number of elements?
                            
                                Why does Spark job fail with "too many open files"?
                            
                                How do I run graphx with Python / pyspark?
                            
                                What is the difference between sort and orderBy functions in Spark
                            
                                Shipping Python modules in pyspark to other nodes
                            
                                How to do left outer join in spark sql?
                            
                                Spark dataframe get column value into a string variable
                            
                                Differences between null and NaN in spark? How to deal with it?
                            
                                Best Practice to launch Spark Applications via Web Application?
                            
                                Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database
                            
                                Explode in PySpark
                            
                                Iterate rows and columns in Spark dataframe
                            
                                Apache Hadoop Yarn - Underutilization of cores
                            
                                How to save a spark DataFrame as csv on disk?
                            
                                How to use AND or OR condition in when in Spark
                            
                                Read multiline JSON in Apache Spark
                            
                                Map can not be serializable in scala?
                            
                                Trim string column in PySpark dataframe
                            
                                SparkSQL: How to deal with null values in user defined function?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How spark read a large file (petabyte) when file can not be fit in spark's main memory

Tags:

apache-spark

rdd

partition

Arpit Rai

People also ask

2 Answers

Glennie Helles Sindholt

Swadeshi

Recent Activity

Donate For Us