How to get an Iterator of Rows using Dataframe in SparkSQL

Tags:

I have an application in SparkSQL which returns large number of rows that are very difficult to fit in memory so I will not be able to use collect function on DataFrame, is there a way using which I can get all this rows as an Iterable instaed of the entire rows as list.

I am executing this SparkSQL application using yarn-client.

385

asked Oct 06 '15 10:10

Sachin Janani

1 Answers

Generally speaking transferring all the data to the driver looks a pretty bad idea and most of the time there is a better solution out there but if you really want to go with this you can use toLocalIterator method on a RDD:

val df: org.apache.spark.sql.DataFrame = ???
df.cache // Optional, to avoid repeated computation, see docs for details
val iter: Iterator[org.apache.spark.sql.Row]  = df.rdd.toLocalIterator

185

answered Sep 29 '22 18:09

zero323

Related questions
                            
                                Group by and order by in Spark SQL
                            
                                CodeGen grows beyond 64 KB error when normalizing large PySpark dataframe
                            
                                How to have Apache Spark running on GPU?
                            
                                Read parquet into spark dataset ignoring missing fields [duplicate]
                            
                                How to get the number of records written (using DataFrameWriter's save operation)?
                            
                                Spark - csv read option
                            
                                YARN applications cannot start when specifying YARN node labels
                            
                                Connection from Spark to snowflake
                            
                                Comparing two data frames in Spark (performance)
                            
                                What is the difference between partitioning and bucketing in Spark?
                            
                                How we save a Huge pyspark dataframe?
                            
                                Efficient reading nested parquet column in Spark
                            
                                How to submit multiple spark jobs to single AWS EMR cluster
                            
                                Implementing a recursive algorithm in pyspark to find pairings within a dataframe
                            
                                PySpark "illegal reflective access operation" when executed in terminal
                            
                                Accesing Hdfs from Spark gives TokenCache error Can't get Master Kerberos principal for use as renewer
                            
                                pyspark: Save schemaRDD as json file
                            
                                Where does Spark actually persist RDDs on disk?
                            
                                Spark, MLlib: Adjusting classifier descrimination threshold
                            
                                Spark SQL 1.5 build failure

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get an Iterator of Rows using Dataframe in SparkSQL

Tags:

apache-spark

apache-spark-sql

Sachin Janani

People also ask

1 Answers

zero323

Recent Activity

Donate For Us