Let's say I have a spark data frame df1
, with several columns (among which the column id
) and data frame df2
with two columns, id
and other
.
Is there a way to replicate the following command:
sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id")
by using only pyspark functions such as join()
, select()
and the like?
I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter.
Here In first dataframe (dataframe1) , the columns ['ID', 'NAME', 'Address'] and second dataframe (dataframe2 ) columns are ['ID','Age']. Now we have to add the Age column to the first dataframe and NAME and Address in the second dataframe, we can do this by using lit() function. This function is available in pyspark.
You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.
Asterisk (*
) works with alias. Ex:
from pyspark.sql.functions import * df1 = df1.alias('df1') df2 = df2.alias('df2') df1.join(df2, df1.id == df2.id).select('df1.*')
Not sure if the most efficient way, but this worked for me:
from pyspark.sql.functions import col df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])
The trick is in:
[col('a.'+xx) for xx in a.columns] : all columns in a [col('b.other1'),col('b.other2')] : some columns of b
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With