how to add Row id in pySpark dataframes [duplicate]

Tags:

I have a csv file; which i convert to DataFrame(df) in pyspark; after some transformation; I want to add a column in df; which should be simple row id (starting from 0 or 1 to N).

I converted df in rdd and use "zipwithindex". I converted resulting rdd back to df. this approach works but it generated 250k tasks and takes a lot of time in execution. I was wondering if there is other way to do it which takes less runtime.

following is snippet of my code; the csv file I am processing is BIG; contains billions of rows.

Click to copy

debug_csv_rdd = (sc.textFile("debug.csv")   .filter(lambda x: x.find('header') == -1)   .map(lambda x : x.replace("NULL","0")).map(lambda p: p.split(','))   .map(lambda x:Row(c1=int(x[0]),c2=int(x[1]),c3=int(x[2]),c4=int(x[3]))))  debug_csv_df = sqlContext.createDataFrame(debug_csv_rdd) debug_csv_df.registerTempTable("debug_csv_table") sqlContext.cacheTable("debug_csv_table")  r0 = sqlContext.sql("SELECT c2 FROM debug_csv_table WHERE c1 = 'str'") r0.registerTempTable("r0_table")  r0_1 = (r0.flatMap(lambda x:x)     .zipWithIndex()     .map(lambda x: Row(c1=x[0],id=int(x[1]))))  r0_df=sqlContext.createDataFrame(r0_2) r0_df.show(10)

232

asked Aug 19 '15 04:08

ankit patel

1 Answers

You can use also use a function from sql package. It will generate a unique id, however it will not be sequential as it depends on the number of partitions. I believe it is available in Spark 1.5 +

Click to copy

from pyspark.sql.functions import monotonicallyIncreasingId  # This will return a new DF with all the columns + id res = df.withColumn("id", monotonicallyIncreasingId())

Edit: 19/1/2017

As commented by @Sean

Use monotonically_increasing_id() instead from Spark 1.6 and on

answered Oct 27 '22 23:10

Arkadi T

Related questions
                            
                                Value IS NOT NULL in codeigniter
                            
                                Item in RecyclerView not filling it's width match_parent
                            
                                Should I delete the move constructor and the move assignment of a smart pointer?
                            
                                Laravel Lumen Ensure JSON response
                            
                                Rails: PG::InsufficientPrivilege: ERROR: permission denied for relation schema_migrations
                            
                                Error: Please provide a path to the Android SDK
                            
                                How is True < 2 implemented?
                            
                                Examples of Delegates in Swift [closed]
                            
                                How to speed up Gensim Word2vec model load time?
                            
                                Spring RestTemplate POST Request with URL encoded data
                            
                                GraphQL mutation: Invariant Violation: Must contain a query definition
                            
                                How to check if PDF is scanned image or contains text

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With