How to create Spark RDD from an iterator?

Tags:

Is it a requirement for source to be re-readable(or capable to read many times) to offer resilience for RDD? In other words, since iterators are fundamentally read-once, is it even possible to create Resilient Distributed Datasets(RDD) from iterators?

276

asked Jun 26 '15 12:06

Thamme Gowda

1 Answers

As somebody else said, you could do something with spark streaming, but as for pure spark, you can't, and the reason is that what you're asking goes against spark's model. Let me explain. To distribute and parallelize work, spark has to divide it in chunks. When reading from HDFS, that 'chunking' is done for Spark by HDFS, since HDFS files are organized in blocks. Spark will generally generate one task per block. Now, iterators only provide sequential access to your data, so it's impossible for spark to organize it in chunks without reading it all in memory.

It may be possible to build a RDD that has a single iterable partition, but even then, it is impossible to say if the implementation of the Iterable could be sent to workers. When using sc.parallelize() spark creates partitions that implement serializable so each partition can be sent to a different worker. The iterable could be over a network connection, or file in the local FS, so they cannot be sent to the workers unless they are buffered in memory.

142

answered Sep 18 '22 20:09

Roberto Congiu

Related questions
                            
                                spark - scala - How can I check if a table exists in hive
                            
                                How to add multiple columns using UDF?
                            
                                Sampling a large distributed data set using pyspark / spark
                            
                                Spark-Obtaining file name in RDDs
                            
                                Spark SQL broadcast hash join
                            
                                Why would I want .union over .unionAll in Spark for SchemaRDDs?
                            
                                Spark textFile vs wholeTextFiles
                            
                                Spark off heap memory leak on Yarn with Kafka direct stream
                            
                                Slow Performance with Apache Spark Gradient Boosted Tree training runs
                            
                                Why does Spark task take a long time to find block locally?
                            
                                How to evaluate a classifier with PySpark 2.4.5
                            
                                How to set preferences for ALS implicit feedback in Collaborative Filtering?
                            
                                Spark execution memory monitoring [closed]
                            
                                Writing more than 50 millions from Pyspark df to PostgresSQL, best efficient approach
                            
                                Spark: Writing to Avro file
                            
                                Apache Spark: pyspark crash for large dataset
                            
                                Understanding Spark's closures and their serialization
                            
                                apache spark MLLib: how to build labeled points for string features?
                            
                                How to suppress parquet log messages in Spark?
                            
                                Apache spark: setting spark.eventLog.enabled and spark.eventLog.dir at submit or Spark start

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to create Spark RDD from an iterator?

Tags:

apache-spark

spark-streaming

Thamme Gowda

People also ask

1 Answers

Roberto Congiu

Recent Activity

Donate For Us