Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get an Iterator of Rows using Dataframe in SparkSQL

I have an application in SparkSQL which returns large number of rows that are very difficult to fit in memory so I will not be able to use collect function on DataFrame, is there a way using which I can get all this rows as an Iterable instaed of the entire rows as list.

I am executing this SparkSQL application using yarn-client.

like image 385
Sachin Janani Avatar asked Oct 06 '15 10:10

Sachin Janani


People also ask

What is the use of to local iterator in spark?

To Local Iterator Method Microsoft. Spark. Sql Returns an iterator that contains all of the rows in this DataFrame . The iterator will consume as much memory as the largest partition in this DataFrame . With prefetch it may consume up to the memory of the 2 largest partitions. Returns an iterator that contains all of the rows in this DataFrame .

How do you iterate over a row in a Dataframe?

Iteration over rows using itertuples() In order to iterate over rows, we apply a function itertuples() this function return a tuple for each row in the DataFrame. The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.

What is a Dataframe in spark?

DataFrame is a collection of rows with a schema that is the result of executing a structured query (once it will have been executed). DataFrame uses the immutable, in-memory, resilient, distributed and parallel capabilities of RDD, and applies a structure called schema to the data. In Spark 2.0.0 DataFrame is a mere type alias for Dataset [Row].

How to iterate through Dataframe in pyspark?

Below are some examples to iterate through DataFrame using for each. If you have a small dataset, you can also Convert PySpark DataFrame to Pandas and use pandas to iterate through. Use spark.sql.execution.arrow.enabled config to enable Apache Arrow with Spark.


1 Answers

Generally speaking transferring all the data to the driver looks a pretty bad idea and most of the time there is a better solution out there but if you really want to go with this you can use toLocalIterator method on a RDD:

val df: org.apache.spark.sql.DataFrame = ???
df.cache // Optional, to avoid repeated computation, see docs for details
val iter: Iterator[org.apache.spark.sql.Row]  = df.rdd.toLocalIterator 
like image 185
zero323 Avatar answered Sep 29 '22 18:09

zero323