I have two DataFrames in Spark SQL (D1 and D2).
I am trying to inner join both of them D1.join(D2, "some column")
and get back data of only D1, not the complete data set.
Both D1 and D2 are having the same columns.
Could some one please help me on this?
I am using Spark 1.6.
You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.
Here In first dataframe (dataframe1) , the columns ['ID', 'NAME', 'Address'] and second dataframe (dataframe2 ) columns are ['ID','Age']. Now we have to add the Age column to the first dataframe and NAME and Address in the second dataframe, we can do this by using lit() function. This function is available in pyspark.
Let say you want to join on "id" column. Then you could write :
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select($"d1.*")
As an alternate answer, you could also do the following without adding aliases:
d1.join(d2, d1("id") === d2("id"))
.select(d1.columns.map(c => d1(c)): _*)
You could use left_semi
:
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id", "left_semi")
Semi-join takes only rows from the left dataset where joining condition is met.
There's also another interesting join type: left_anti
, which works similarily to left_semi
but takes only those rows where the condition is not met.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With