Spark: Merge 2 dataframes by adding row index/number on both dataframes

Tags:

Q: Is there is any way to merge two dataframes or copy a column of a dataframe to another in PySpark?

For example, I have two Dataframes:

DF1              
C1                    C2                                                        
23397414             20875.7353   
5213970              20497.5582   
41323308             20935.7956   
123276113            18884.0477   
76456078             18389.9269

the seconde dataframe

DF2
C3                       C4
2008-02-04               262.00                 
2008-02-05               257.25                 
2008-02-06               262.75                 
2008-02-07               237.00                 
2008-02-08               231.00

Then i want to add C3 of DF2 to DF1 like this:

New DF              
    C1                    C2          C3                                              
    23397414             20875.7353   2008-02-04
    5213970              20497.5582   2008-02-05
    41323308             20935.7956   2008-02-06
    123276113            18884.0477   2008-02-07
    76456078             18389.9269   2008-02-08

I hope this example was clear.

771

asked Nov 09 '16 13:11

MrGildarts

1 Answers

rownum + window function i.e solution 1 or zipWithIndex.map i.e solution 2 should help in this case.

Solution 1 : You can use window functions to get this kind of

Then I would suggest you to add rownumber as additional column name to Dataframe say df1.

  DF1              
    C1                    C2                 columnindex                                             
    23397414             20875.7353            1
    5213970              20497.5582            2
    41323308             20935.7956            3
    123276113            18884.0477            4
    76456078             18389.9269            5

the second dataframe

DF2
C3                       C4             columnindex
2008-02-04               262.00            1        
2008-02-05               257.25            2      
2008-02-06               262.75            3      
2008-02-07               237.00            4          
2008-02-08               231.00            5

Now .. do inner join of df1 and df2 that's all... you will get below ouput

something like this

from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber

w = Window().orderBy()

df1 = ....  // as showed above df1

df2 = ....  // as shown above df2


df11 =  df1.withColumn("columnindex", rowNumber().over(w))
  df22 =  df2.withColumn("columnindex", rowNumber().over(w))

newDF = df11.join(df22, df11.columnindex == df22.columnindex, 'inner').drop(df22.columnindex)
newDF.show()



New DF              
    C1                    C2          C3                                              
    23397414             20875.7353   2008-02-04
    5213970              20497.5582   2008-02-05
    41323308             20935.7956   2008-02-06
    123276113            18884.0477   2008-02-07
    76456078             18389.9269   2008-02-08

Solution 2 : Another good way(probably this is best :)) in scala, which you can translate to pyspark :

/**
* Add Column Index to dataframe 
*/
def addColumnIndex(df: DataFrame) = sqlContext.createDataFrame(
  // Add Column index
  df.rdd.zipWithIndex.map{case (row, columnindex) => Row.fromSeq(row.toSeq :+ columnindex)},
  // Create schema
  StructType(df.schema.fields :+ StructField("columnindex", LongType, false))
)

// Add index now...
val df1WithIndex = addColumnIndex(df1)
val df2WithIndex = addColumnIndex(df2)

 // Now time to join ...
val newone = df1WithIndex
  .join(df2WithIndex , Seq("columnindex"))
  .drop("columnindex")

125

answered Sep 21 '22 05:09

Ram Ghadiyaram

Related questions
                            
                                Specifying an external configuration file for Apache Spark
                            
                                PySpark 1.5 How to Truncate Timestamp to Nearest Minute from seconds
                            
                                Spark 1.6-Failed to locate the winutils binary in the hadoop binary path
                            
                                Spark - Random Number Generation
                            
                                Could not bind on a random free port error while trying to connect to spark master
                            
                                EntityTooLarge error when uploading a 5G file to Amazon S3
                            
                                How to get ID of a map task in Spark?
                            
                                pyspark matrix with dummy variables
                            
                                Spark column string replace when present in other column (row)
                            
                                Converting a Spark Dataframe to a Scala Map collection
                            
                                How to change the column type from String to Date in DataFrames?
                            
                                Remove rows from dataframe based on condition in pyspark
                            
                                Matrix Transpose on RowMatrix in Spark
                            
                                PySpark computing correlation
                            
                                How to update column based on a condition (a value in a group)?
                            
                                AuthorizationException: User not allowed to impersonate User
                            
                                How to CROSS JOIN 2 dataframe?
                            
                                Installing Apache Spark on Ubuntu 14.04
                            
                                Partition data for efficient joining for Spark dataframe/dataset
                            
                                Spark Option: inferSchema vs header = true

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: Merge 2 dataframes by adding row index/number on both dataframes

Tags:

apache-spark

apache-spark-sql

pyspark