run reduceByKey on huge data in spark

Tags:

1 Answers

It would be helpful if you posted the logs, but one option would be to specify a larger number of partitions when reading in the initial text file (e.g. sc.textFile(path, 200000)) rather than re-partitioning after reading . Another important thing is to make sure that your input file is splittable (some compression options make it not splittable, and in that case Spark may have to read it on a single machine causing OOMs).

Some other options are, since you aren't caching any of the data, would be reducing the amount of memory Spark is setting aside for caching (controlled with with spark.storage.memoryFraction), also since you are only working with tuples of strings I'd recommend using the org.apache.spark.serializer. KryoSerializer serializer.

131

answered Oct 19 '22 15:10

Holden

Related questions
                            
                                DataFrame partitionBy on nested columns
                            
                                PySpark distributing module imports
                            
                                Spark problems with imports in Python
                            
                                Divide elements of column by a sum of elements (of same column) grouped by elements of another column
                            
                                What algorithm is used in spark decision tree (is ID3, C4.5 or CART)
                            
                                Delete files after processing with Spark Structured Streaming
                            
                                Spark build in hive MySQL metastore isn't being used
                            
                                PySpark: PicklingError: Could not serialize object: TypeError: can't pickle CompiledFFI objects
                            
                                Spark 2.2.0 - How to write/read DataFrame to DynamoDB
                            
                                PySpark Window Function: multiple conditions in orderBy on rangeBetween/rowsBetween
                            
                                best practice for debugging python-spark code
                            
                                How SBT test task manages class path and how to correctly start a Java process from SBT test
                            
                                Why spark executor cores are not equal with active tasks in spark web UI？
                            
                                The group member's supported protocols are incompatible with those of existing members
                            
                                How can I convince spark not to make an exchange when the join key is a super-set of the bucketBy key?
                            
                                Can AWS Glue crawl Delta Lake table data?
                            
                                Spark atop of Docker not accepting jobs
                            
                                Why does Spark shuffle store intermediate data on disk?
                            
                                Get all Apache Spark executor logs
                            
                                HashMap as a Broadcast Variable in Spark Streaming?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

run reduceByKey on huge data in spark

Tags:

apache-spark

user2848932

People also ask

1 Answers

Holden

Recent Activity

Donate For Us