Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to loop through each row of dataFrame in pyspark

E.g

sqlContext = SQLContext(sc)  sample=sqlContext.sql("select Name ,age ,city from user") sample.show() 

The above statement prints theentire table on terminal. But I want to access each row in that table using for or while to perform further calculations.

like image 560
Arti Berde Avatar asked Apr 01 '16 06:04

Arti Berde


People also ask

How do you iterate over rows and columns in PySpark Dataframe?

iterrows() This method is used to iterate the columns in the given PySpark DataFrame. It can be used with for loop and takes column names through the row iterator and index to iterate columns. Finally, it will display the rows according to the specified indices.

What does .collect do in PySpark?

Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.

What is Rlike in PySpark?

rlike() is similar to like() but with regex (regular expression) support. It can be used on Spark SQL Query expression as well. It is similar to regexp_like() function of SQL.


2 Answers

You simply cannot. DataFrames, same as other distributed data structures, are not iterable and can be accessed using only dedicated higher order function and / or SQL methods.

You can of course collect

for row in df.rdd.collect():     do_something(row) 

or convert toLocalIterator

for row in df.rdd.toLocalIterator():     do_something(row) 

and iterate locally as shown above, but it beats all purpose of using Spark.

like image 188
2 revs, 2 users 70% Avatar answered Sep 21 '22 21:09

2 revs, 2 users 70%


To "loop" and take advantage of Spark's parallel computation framework, you could define a custom function and use map.

def customFunction(row):     return (row.name, row.age, row.city)  sample2 = sample.rdd.map(customFunction) 

or

sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city)) 

The custom function would then be applied to every row of the dataframe. Note that sample2 will be a RDD, not a dataframe.

Map may be needed if you are going to perform more complex computations. If you just need to add a simple derived column, you can use the withColumn, with returns a dataframe.

sample3 = sample.withColumn('age2', sample.age + 2) 
like image 21
David Avatar answered Sep 20 '22 21:09

David