I'd like to enumerate grouped values just like with Pandas: Enumerate each row for each group in a DataFrame What is a way in Spark/Python?

You can achieve this on rdd level by doing: <pre class="prettyprint"><code>rdd = sc.parallelize(['a', 'b', 'c']) df = spark.createDataFrame(rdd.zipWithIndex()) df.show() </code></pre> It will result: <code> +---+---+ | _1| _2| +---+---+ | a| 0| | b| 1| | c| 2| +---+---+ </code> If you only need unique ID, not real continuous indexing, you may also use <code>zipWithUniqueId()</code> which is more efficient, since done locally on each partition.

How can I enumerate rows in groups with Spark/Python?

2 Answers

With row_number window function:

from pyspark.sql.functions import row_number
from pyspark.sql import Window

w = Window.partitionBy("some_column").orderBy("some_other_column")
df.withColumn("rn", row_number().over(w))

170

answered Sep 28 '22 11:09

zero323

You can achieve this on rdd level by doing:

rdd = sc.parallelize(['a', 'b', 'c'])
df = spark.createDataFrame(rdd.zipWithIndex())
df.show()

It will result: +---+---+ | _1| _2| +---+---+ | a| 0| | b| 1| | c| 2| +---+---+ If you only need unique ID, not real continuous indexing, you may also use zipWithUniqueId() which is more efficient, since done locally on each partition.

answered Sep 28 '22 11:09

Elior Malul

Related questions
                            
                                Specify time interval in Django TimeStampedModel while Querying [duplicate]
                            
                                How can I get file's 'lastModified' in Google Drive API v3?
                            
                                Jupyter Notebook load_ext signature
                            
                                From Matlab to Python - Solve function
                            
                                reshape numpy 3D array to 2D
                            
                                Using Django with virtualenv, get error ImportError: No module named 'django.core.servers.fastcgi'
                            
                                Find word not surrounded by alpha char
                            
                                Strange behavior of capturing group in regular expression
                            
                                How to access individual predictions in Spark RandomForest?
                            
                                Python: How to mock class attribute initializer function
                            
                                How to clear a plot in a `while` loop when using PyQtGraph?
                            
                                Python sorting dictionaries: Key [Ascending] and then Value [Descending]
                            
                                Matplotlib normalize colorbar (Python)
                            
                                Summary statistics on Large csv file using python pandas
                            
                                Count of unequal elements across numpy arrays
                            
                                Replacing punctuation except intra-word dashes with a space
                            
                                Should I generate *.pyc files when deploying?
                            
                                Scrapy + Splash + ScrapyJS
                            
                                Changing multiple characters by other characters in a string [duplicate]
                            
                                Keras - is it possible to view the weights and biases of models in Tensorboard

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I enumerate rows in groups with Spark/Python?

Tags:

python

apache-spark

Gerenuk

People also ask

2 Answers

zero323

Elior Malul

Recent Activity

Donate For Us