How to rename duplicated columns after join? [duplicate]

Tags:

I want to use join with 3 dataframe, but there are some columns we don't need or have some duplicate name with other dataframes, so I want to drop some columns like below:

result_df = (aa_df.join(bb_df, 'id', 'left')
  .join(cc_df, 'id', 'left')
  .withColumnRenamed(bb_df.status, 'user_status'))

Please note that status column is in two dataframes, i.e. aa_df and bb_df.

The above doesn't work. I also tried to use withColumn, but the new column is created, and the old column is still existed.

908

asked May 11 '18 07:05

Frank

2 Answers

I want to use join with 3 dataframe, but there are some columns we don't need or have some duplicate name with other dataframes

That's a fine use case for aliasing a Dataset using alias or as operators.

alias(alias: String): Dataset[T] or alias(alias: Symbol): Dataset[T] Returns a new Dataset with an alias set. Same as as.

as(alias: String): Dataset[T] or as(alias: Symbol): Dataset[T] Returns a new Dataset with an alias set.

(And honestly I did only now see the Symbol-based variants.)

NOTE There are two as operators, as for aliasing and as for type mapping. Consult the Dataset API.

After you've aliases a Dataset, you can reference columns using [alias].[columnName] format. This is particularly handy with joins and star column dereferencing using *.

val ds1 = spark.range(5)
scala> ds1.as('one).select($"one.*").show
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+

val ds2 = spark.range(10)
// Using joins with aliased datasets
// where clause is in a longer form to demo how ot reference columns by alias
scala> ds1.as('one).join(ds2.as('two)).where($"one.id" === $"two.id").show
+---+---+
| id| id|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
+---+---+

so I want to drop some columns like below

My general recommendation is not to drop columns, but select what you want to include in the result. That makes life more predictable as you know what you get (not what you don't). I was told that our brains work by positives which could also make a point for select.

So, as you asked and I showed in the above example, the result has two columns of the same name id. The question is how to have only one.

There are at least two answers with using the variant of join operator with the join columns or condition included (as you did show in your question), but that would not answer your real question about "dropping unwanted columns", would it?

Given I prefer select (over drop), I'd do the following to have a single id column:

val q = ds1.as('one)
  .join(ds2.as('two))
  .where($"one.id" === $"two.id")
  .select("one.*") // <-- select columns from "one" dataset
scala> q.show
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+

Regardless of the reasons why you asked the question (which could also be answered with the points I raised above), let me answer the (burning) question how to use withColumnRenamed when there are two matching columns (after join).

Let's assume you ended up with the following query and so you've got two id columns (per join side).

val q = ds1.as('one)
  .join(ds2.as('two))
  .where($"one.id" === $"two.id")
scala> q.show
+---+---+
| id| id|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
+---+---+

withColumnRenamed won't work for this use case since it does not accept aliased column names.

scala> q.withColumnRenamed("one.id", "one_id").show
+---+---+
| id| id|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
+---+---+

You could select the columns you're interested in as follows:

scala> q.select("one.id").show
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+

scala> q.select("two.*").show
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+

112

answered Oct 12 '22 11:10

Jacek Laskowski

If you are trying to rename the status column of bb_df dataframe then you can do so while joining as

result_df = aa_df.join(bb_df.withColumnRenamed('status', 'user_status'),'id', 'left').join(cc_df, 'id', 'left')

answered Oct 12 '22 11:10

Ramesh Maharjan

Related questions
                            
                                Inserting Analytic data from Spark to Postgres
                            
                                PySpark & MLLib: Class Probabilities of Random Forest Predictions
                            
                                spark-streaming and connection pool implementation
                            
                                How can I use proto3 with Hadoop/Spark?
                            
                                Spark Scala : Unable to import sqlContext.implicits._
                            
                                Spark saveAsTextFile() results in Mkdirs failed to create for half of the directory
                            
                                Low JDBC write speed from Spark to MySQL
                            
                                Multiple consecutive join with pyspark
                            
                                Performance impact of RDD API vs UDFs mixed with DataFrame API
                            
                                (Spark) object {name} is not a member of package org.apache.spark.ml
                            
                                How to pass parameters / properties to Spark jobs with spark-submit
                            
                                How does range partitioner work in Spark?
                            
                                How to add new field to struct column?
                            
                                Stop Structured Streaming query gracefully
                            
                                Spark broadcasted variable returns NullPointerException when run in Amazon EMR cluster
                            
                                Convert scala list to DataFrame or DataSet
                            
                                Can't find spark submit when typing spark-shell
                            
                                spark-class: line 71...No such file or directory
                            
                                Convert Row to map in spark scala
                            
                                Error when Spark 2.2.0 standalone mode write Dataframe to local single-node Kafka

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to rename duplicated columns after join? [duplicate]

Tags:

apache-spark

apache-spark-sql

pyspark

Frank

People also ask

2 Answers

Jacek Laskowski

Ramesh Maharjan

Recent Activity

Donate For Us