Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I enumerate rows in groups with Spark/Python?

I'd like to enumerate grouped values just like with Pandas:

Enumerate each row for each group in a DataFrame

What is a way in Spark/Python?

like image 493
Gerenuk Avatar asked Mar 09 '16 13:03

Gerenuk


People also ask

How do you select 10 rows in PySpark DataFrame?

In Spark/PySpark, you can use show() action to get the top/first N (5,10,100 ..) rows of the DataFrame and display them on a console or a log, there are also several Spark Actions like take() , tail() , collect() , head() , first() that return top and last n rows as a list of Rows (Array[Row] for Scala).

How does PySpark show grouped data?

When we perform groupBy() on PySpark Dataframe, it returns GroupedData object which contains below aggregate functions. count() – Use groupBy() count() to return the number of rows for each group. mean() – Returns the mean of values for each group. max() – Returns the maximum of values for each group.


2 Answers

With row_number window function:

from pyspark.sql.functions import row_number
from pyspark.sql import Window

w = Window.partitionBy("some_column").orderBy("some_other_column")
df.withColumn("rn", row_number().over(w))
like image 170
zero323 Avatar answered Sep 28 '22 11:09

zero323


You can achieve this on rdd level by doing:

rdd = sc.parallelize(['a', 'b', 'c'])
df = spark.createDataFrame(rdd.zipWithIndex())
df.show()

It will result: +---+---+ | _1| _2| +---+---+ | a| 0| | b| 1| | c| 2| +---+---+ If you only need unique ID, not real continuous indexing, you may also use zipWithUniqueId() which is more efficient, since done locally on each partition.

like image 37
Elior Malul Avatar answered Sep 28 '22 11:09

Elior Malul