Spark master memory requirements related to data size

Tags:

apache-spark

Are Spark master memory requirements related to the size of the processed data?

The Spark driver and Spark workers/executors deal with processed data directly (and execute application code), so their memory needs can be linked to the size of the processed data. But is the Spark master in any way affected by the data size? It seems to me that it isn't, because it just manages the Spark workers and doesn't work with the data itself directly.

981

asked Mar 07 '17 21:03

Markus Miller

1 Answers

Spark main data entities like DataFrames or DataSets are based on RDD, or Resilient Distributed Datasets. They are distributed meaning the processing generally takes place in the executors.

Some RDD actions will end with data on the driver process though. Most notably collect and other actions that use it (like show, take or toPandas if you are using python). collect, as the name implies, will collect some or all of the rows of the distributed datasets and materialize them in the driver process. At this point, yes, you will need to take into account the memory footprint of your data.

This is why you will generally want to reduce as much as possible the data you collect. You can groupBy, filter, and many other transformations so that if you need to process the data in the driver, it's the most refined possible.

113

answered Oct 01 '22 19:10

Manu Valdés

Related questions
                            
                                How do I get a PySpark DataFrame made using HiveContext in Spark 1.5.2?
                            
                                Integrating Spark SQL and Apache Drill through JDBC
                            
                                How to load Tuple from Cassandra table?
                            
                                Spark ML VectorAssembler() dealing with thousands of columns in dataframe
                            
                                Finding connected components of a particular node instead of the whole graph (GraphFrame/GraphX)
                            
                                filter pushdown using spark-sql on map type column in parquet
                            
                                How to save file in Feather format\storage from Spark?
                            
                                Pyspark Column.isin() for a large set
                            
                                run Spark-Submit on YARN but Imbalance (only 1 node is working)
                            
                                Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/spark/Logging
                            
                                Real-time analysis of event logs with Elasticsearch
                            
                                Apache Spark Maven Dependencies for release and develop an app
                            
                                How to implement Stanford CoreNLP wrapper for Apache Spark using sparklyr?
                            
                                Using Pycuda with PySpark - nvcc not found
                            
                                Spark UI DAG stage disconnected
                            
                                Large scheduler delay in Apache Spark tasks using deploy mode cluster
                            
                                Spark HashingTF result explanation
                            
                                About a java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
                            
                                Cosine similarity of word2vec more than 1
                            
                                How to write a dataframe in pyspark having null values to CSV

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With