I am getting one issue like "sql.AnalysisException: cannot resolve column_name" when performing one join operation using dataframe API. Though the column name exists and same join operation is working fine when tried with SQL format of HiveContext. In the following code base,
DataFrame df= df1
.join(df2, df1.col("MERCHANT").equalTo(df2.col("MERCHANT")))
.select(df2.col("MERCH_ID"), df1.col("MERCHANT")));
I have tried with "alias" function too, but got the same problem "Can't resolve column name." and throwing following exception.
resolved attribute(s) MERCH_ID#738 missing from MERCHANT#737,MERCHANT#928,MERCH_ID#929,MER_LOC#930 in operator !Project [MERCH_ID#738,MERCHANT#737];
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
Spark Version: 1.6
The problem has been faced in both Scala and Java Spark.
In Scala, the issue got resolved using 'alias', but in Java, I am still getting the error.
One of the simplest ways to solve an “ambiguous name column” error — without changing column name — is to give the tables you want to join an alias. This sends a clear information to the SQL Machine the columns are different.
Solution: Renaming the one of the ambiguous column name into differrent name will sort out this issue. But in Spark, we don't have a direct method to handle this use case and we need to make use of df. columns to get the duplicate columns count and index and to rename the duplicate column in Spark Dataframe.
Method 1: Using drop() function dataframe is the first dataframe. dataframe1 is the second dataframe. inner specifies inner join. drop() will delete the common column and delete first dataframe column.
import org.apache.spark.sql.functions;
Dataset joinedDF = DF1.join(DF2, DF1.col("_c1").equalTo(DF2.col("_c1")));
here _c1 represents the column 1, if the column does not have a header
From my experience it is best to avoid DataFrame.col
and DataFrame.apply
unless necessary for disambiguation (aliasing is still better). Please try using independent Column
objects:
import org.apache.spark.sql.functions;
DataFrame df= df1.alias("df1").
.join(df2.alias("df2"), functions.col("df1.MERCHANT").equalTo(functions.col("df2.MERCHANT")))
.select(functions.col("df2.MERCH_ID"), functions.col("df2.MERCHANT")));
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With