Pyspark add sequential and deterministic index to dataframe

Tags:

pyspark

I need to add an index column to a dataframe with three very simple constraints:

start from 0
be sequential
be deterministic

I'm sure I'm missing something obvious because the examples I'm finding look very convoluted for such a simple task, or use non-sequential, non deterministic increasingly monotonic id's. I don't want to zip with index and then have to separate the previously separated columns that are now in a single column because my dataframes are in the terabytes and it just seems unnecessary. I don't need to partition by anything, nor order by anything, and the examples I'm finding do this (using window functions and row_number). All I need is a simple 0 to df.count sequence of integers. What am I missing here?

1, 2, 3, 4, 5

948

asked Sep 13 '18 16:09

xv70

1 Answers

What I mean is: how can I add a column with an ordered, monotonically increasing by 1 sequence 0:df.count? (from comments)

You can use row_number() here, but for that you'd need to specify an orderBy(). Since you don't have an ordering column, just use monotonically_increasing_id().

from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql import Window

df = df.withColumn(
    "index",
    row_number().over(Window.orderBy(monotonically_increasing_id()))-1
)

Also, row_number() starts at 1, so you'd have to subtract 1 to have it start from 0. The last value will be df.count - 1.

I don't want to zip with index and then have to separate the previously separated columns that are now in a single column

You can use zipWithIndex if you follow it with a call to map, to avoid having all of the separated columns turn into a single column:

cols = df.columns
df = df.rdd.zipWithIndex().map(lambda row: (row[1],) + tuple(row[0])).toDF(["index"] + cols

133

answered Sep 20 '22 12:09

pault

Related questions
                            
                                Python equivalence of R's match() for indexing
                            
                                MongoError: E11000 duplicate key error collection for unique compound index
                            
                                PostgreSQL Upsert with a WHERE clause
                            
                                Concatenated index in postgresql
                            
                                Getting null value from intent in deep link
                            
                                Remove rows of a dataframe based on the row number
                            
                                Possible to use a BRIN Index on a Primary Key in PostgreSQL
                            
                                temporary blocking google crawler, will it prevent future indexing?
                            
                                Data structure / algorithm for query: filter by A, sort by B, return N results
                            
                                Bitwise operations in Postgres
                            
                                Determining which columns to index in MySQL in CakePHP
                            
                                MySQL when can I use HASH instead of BTREE
                            
                                is there an easy way to find index array zeros in Fortran?
                            
                                How to view history of queries (all OR over a long period) performed on database which is hosted on Azure?
                            
                                What is the difference between a secondary index and an inverted index in Cassandra?
                            
                                Is it okay to have non sequential ids as primary keys for a table in your database?
                            
                                MATLAB: Splitting a matrix based on multiple values
                            
                                How do I replace values along z-axis in Numpy 3D array based on 2D index and 1D value vector
                            
                                Why does OpenGL not support multiple index buffering?
                            
                                firestore read count with where condition -indexed [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With