I want to use join with 3 dataframe, but there are some columns we don't need or have some duplicate name with other dataframes, so I want to drop some columns like below:
result_df = (aa_df.join(bb_df, 'id', 'left')
.join(cc_df, 'id', 'left')
.withColumnRenamed(bb_df.status, 'user_status'))
Please note that status
column is in two dataframes, i.e. aa_df
and bb_df
.
The above doesn't work. I also tried to use withColumn
, but the new column is created, and the old column is still existed.
Method 1: Using drop() function We can join the dataframes using joins like inner join and after this join, we can use the drop method to remove one duplicate column.
Removing duplicate columns after join in PySpark If we want to drop the duplicate column, then we have to specify the duplicate column in the join function. Here we are simply using join to join two dataframes and then drop duplicate columns.
Drop multiple column in pyspark using drop() function. Drop function with list of column names as argument drops those columns.
Duplicate columns with inner Join.
I want to use join with 3 dataframe, but there are some columns we don't need or have some duplicate name with other dataframes
That's a fine use case for aliasing a Dataset using alias
or as
operators.
alias(alias: String): Dataset[T] or alias(alias: Symbol): Dataset[T] Returns a new Dataset with an alias set. Same as as.
as(alias: String): Dataset[T] or as(alias: Symbol): Dataset[T] Returns a new Dataset with an alias set.
(And honestly I did only now see the Symbol
-based variants.)
NOTE There are two as
operators, as
for aliasing and as
for type mapping. Consult the Dataset API.
After you've aliases a Dataset, you can reference columns using [alias].[columnName]
format. This is particularly handy with joins and star column dereferencing using *
.
val ds1 = spark.range(5)
scala> ds1.as('one).select($"one.*").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
val ds2 = spark.range(10)
// Using joins with aliased datasets
// where clause is in a longer form to demo how ot reference columns by alias
scala> ds1.as('one).join(ds2.as('two)).where($"one.id" === $"two.id").show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
so I want to drop some columns like below
My general recommendation is not to drop
columns, but select
what you want to include in the result. That makes life more predictable as you know what you get (not what you don't). I was told that our brains work by positives which could also make a point for select
.
So, as you asked and I showed in the above example, the result has two columns of the same name id
. The question is how to have only one.
There are at least two answers with using the variant of join
operator with the join columns or condition included (as you did show in your question), but that would not answer your real question about "dropping unwanted columns", would it?
Given I prefer select
(over drop
), I'd do the following to have a single id
column:
val q = ds1.as('one)
.join(ds2.as('two))
.where($"one.id" === $"two.id")
.select("one.*") // <-- select columns from "one" dataset
scala> q.show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
Regardless of the reasons why you asked the question (which could also be answered with the points I raised above), let me answer the (burning) question how to use withColumnRenamed
when there are two matching columns (after join
).
Let's assume you ended up with the following query and so you've got two id
columns (per join side).
val q = ds1.as('one)
.join(ds2.as('two))
.where($"one.id" === $"two.id")
scala> q.show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
withColumnRenamed
won't work for this use case since it does not accept aliased column names.
scala> q.withColumnRenamed("one.id", "one_id").show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
You could select
the columns you're interested in as follows:
scala> q.select("one.id").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
scala> q.select("two.*").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
If you are trying to rename the status
column of bb_df
dataframe then you can do so while joining as
result_df = aa_df.join(bb_df.withColumnRenamed('status', 'user_status'),'id', 'left').join(cc_df, 'id', 'left')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With