How to concatenate/append multiple Spark dataframes column wise in Pyspark?

Question

How to do pandas equivalent of pd.concat([df1,df2],axis='columns') using Pyspark dataframes? I googled and couldn't find a good solution.

DF1
var1        
     3      
     4      
     5      

DF2
var2    var3     
  23      31
  44      45
  52      53

Expected output dataframe
var1        var2    var3
     3        23      31
     4        44      45
     5        52      53

Edited to include expected output

Devi · Accepted Answer

Equivalent of accepted answer using pyspark would be

from pyspark.sql.types import StructType

spark = SparkSession.builder().master("local").getOrCreate()
df1 = spark.sparkContext.parallelize([(1, "a"),(2, "b"),(3, "c")]).toDF(["id", "name"])
df2 = spark.sparkContext.parallelize([(7, "x"),(8, "y"),(9, "z")]).toDF(["age", "address"])

schema = StructType(df1.schema.fields + df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0]+x[1])
spark.createDataFrame(df1df2, schema).show()

How to concatenate/append multiple Spark dataframes column wise in Pyspark?

Tags:

python

apache-spark

apache-spark-sql

pyspark

pyspark-sql

GeorgeOfTheRF

1 Answers

Devi

Recent Activity

Donate For Us

How to concatenate/append multiple Spark dataframes column wise in Pyspark?

Tags:

python

apache-spark

apache-spark-sql

pyspark

pyspark-sql

GeorgeOfTheRF

1 Answers

Devi

Related questions

Recent Activity

Donate For Us