Let's say I have a spark data frame <code>df1</code>, with several columns (among which the column <code>id</code>) and data frame <code>df2</code> with two columns, <code>id</code> and <code>other</code>. Is there a way to replicate the following command: <pre class="prettyprint"><code>sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id") </code></pre> by using only pyspark functions such as <code>join()</code>, <code>select()</code> and the like? I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter.

Asterisk (<code>*</code>) works with alias. Ex: <pre class="prettyprint"><code>from pyspark.sql.functions import * df1 = df1.alias('df1') df2 = df2.alias('df2') df1.join(df2, df1.id == df2.id).select('df1.*') </code></pre>

Join two data frames, select all columns from one and some columns from the other

Tags:

apache-spark-sql

pyspark

Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other.

Is there a way to replicate the following command:

sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id")

by using only pyspark functions such as join(), select() and the like?

I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter.

204

asked Mar 21 '16 13:03

Francesco Sambo

2 Answers

Asterisk (*) works with alias. Ex:

from pyspark.sql.functions import *  df1 = df1.alias('df1') df2 = df2.alias('df2')  df1.join(df2, df1.id == df2.id).select('df1.*')

194

answered Sep 22 '22 06:09

maxcnunes

Not sure if the most efficient way, but this worked for me:

from pyspark.sql.functions import col  df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])

The trick is in:

[col('a.'+xx) for xx in a.columns] : all columns in a  [col('b.other1'),col('b.other2')] : some columns of b

answered Sep 24 '22 06:09

Pablo Estevez

Related questions
                            
                                Filter df when values matches part of a string in pyspark
                            
                                Removing duplicate columns after a DF join in Spark
                            
                                How to perform union on two DataFrames with different amounts of columns in spark?
                            
                                how to loop through each row of dataFrame in pyspark
                            
                                How do I convert an array (i.e. list) column to Vector
                            
                                How to join on multiple columns in Pyspark?
                            
                                Create Spark DataFrame. Can not infer schema for type: <type 'float'>
                            
                                How to make good reproducible Apache Spark examples
                            
                                How to use JDBC source to write and read data in (Py)Spark?
                            
                                Cannot find col function in pyspark
                            
                                pyspark dataframe filter or include based on list
                            
                                How to find median and quantiles using Spark
                            
                                Pyspark: Split multiple array columns into rows
                            
                                Is it possible to get the current spark context settings in PySpark?
                            
                                Pyspark: Exception: Java gateway process exited before sending the driver its port number
                            
                                How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?
                            
                                How to link PyCharm with PySpark?
                            
                                Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame
                            
                                Updating a dataframe column in spark
                            
                                How to fix 'TypeError: an integer is required (got type bytes)' error when trying to run pyspark after installing spark 2.4.4

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With