Spark: Best practice for retrieving big data from RDD to local machine

Tags:

apache-spark

I've got big RDD(1gb) in yarn cluster. On local machine, which use this cluster I have only 512 mb. I'd like to iterate over values in RDD on my local machine. I can't use collect(), because it would create too big array locally which more then my heap. I need some iterative way. There is method iterator(), but it requires some additional information, I can't provide.

UDP: commited toLocalIterator method

674

asked Feb 11 '14 09:02

epahomov

1 Answers

Update: RDD.toLocalIterator method that appeared after the original answer has been written is a more efficient way to do the job. It uses runJob to evaluate only a single partition on each step.

TL;DR And the original answer might give a rough idea how it works:

First of all, get the array of partition indexes:

val parts = rdd.partitions

Then create smaller rdds filtering out everything but a single partition. Collect the data from smaller rdds and iterate over values of a single partition:

for (p <- parts) {     val idx = p.index     val partRdd = rdd.mapPartitionsWithIndex(a => if (a._1 == idx) a._2 else Iterator(), true)     //The second argument is true to avoid rdd reshuffling     val data = partRdd.collect //data contains all values from a single partition                                 //in the form of array     //Now you can do with the data whatever you want: iterate, save to a file, etc. }

I didn't try this code, but it should work. Please write a comment if it won't compile. Of cause, it will work only if the partitions are small enough. If they aren't, you can always increase the number of partitions with rdd.coalesce(numParts, true).

115

answered Oct 24 '22 00:10

Wildfire

Related questions
                            
                                Pivot String column on Pyspark Dataframe
                            
                                Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?
                            
                                What is the difference between rowsBetween and rangeBetween?
                            
                                Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python
                            
                                How do I split an RDD into two or more RDDs?
                            
                                Encoder error while trying to map dataframe row to updated row
                            
                                How to convert unix timestamp to date in Spark
                            
                                NoClassDefFoundError com.apache.hadoop.fs.FSDataInputStream when execute spark-shell
                            
                                Drop spark dataframe from cache
                            
                                Why does spark-submit and spark-shell fail with "Failed to find Spark assembly JAR. You need to build Spark before running this program."?
                            
                                Spark using python: How to resolve Stage x contains a task of very large size (xxx KB). The maximum recommended task size is 100 KB
                            
                                How can I connect to a postgreSQL database into Apache Spark using scala?
                            
                                Cleanest, most efficient syntax to perform DataFrame self-join in Spark
                            
                                SparkSQL vs Hive on Spark - Difference and pros and cons?
                            
                                Compute size of Spark dataframe - SizeEstimator gives unexpected results
                            
                                build.sbt: how to add spark dependencies
                            
                                Why spark-shell fails with NullPointerException?
                            
                                Pyspark convert a standard list to data frame [duplicate]
                            
                                What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?
                            
                                Adding a new column in Data Frame derived from other columns (Spark)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With