Spark when union a lot of RDD throws stack overflow error

Tags:

1 Answers

Use SparkContext.union(...) instead to union many RDDs at once.

You don't want to do it one at a time like that since RDD.union() creates a new step in the lineage (an extra set of stack frames on any computation) for each RDD, whereas SparkContext.union() makes it all at once. This will insure not getting a stack-overflow error.

181

answered Nov 18 '22 10:11

Sean Owen

Related questions
                            
                                error: not found: type SparkConf
                            
                                How to submit a spark job on a remote master node in yarn client mode?
                            
                                How to read Avro file in PySpark
                            
                                Spark: coalesce very slow even the output data is very small
                            
                                Convert Dataframe to a Map(Key-Value) in Spark
                            
                                Why does df.limit keep changing in Pyspark?
                            
                                argmax in Spark DataFrames: how to retrieve the row with the maximum value
                            
                                How can I save an RDD into HDFS and later read it back?
                            
                                How to get all columns after groupby on Dataset<Row> in spark sql 2.1.0
                            
                                How to create a copy of a dataframe in pyspark?
                            
                                Encountering " WARN ProcfsMetricsGetter: Exception when trying to compute pagesize" error when running Spark
                            
                                Is there an "Explain RDD" in spark
                            
                                How to extract application ID from the PySpark context
                            
                                Case class equality in Apache Spark
                            
                                How to connect HBase and Spark using Python?
                            
                                Writing files to local system with Spark in Cluster mode
                            
                                How to filter one spark dataframe against another dataframe
                            
                                How do I collect a single column in Spark?
                            
                                How to set the number of partitions/nodes when importing data into Spark
                            
                                Spark Error: Not enough space to cache partition rdd_8_2 in memory! Free memory is 58905314 bytes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark when union a lot of RDD throws stack overflow error

Tags:

apache-spark

rdd

worldterminator

People also ask

1 Answers

Sean Owen

Recent Activity

Donate For Us