Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DataFrame fail to find the column name after join condition

I am getting one issue like "sql.AnalysisException: cannot resolve column_name" when performing one join operation using dataframe API. Though the column name exists and same join operation is working fine when tried with SQL format of HiveContext. In the following code base,

DataFrame df= df1
  .join(df2, df1.col("MERCHANT").equalTo(df2.col("MERCHANT")))
  .select(df2.col("MERCH_ID"), df1.col("MERCHANT")));

I have tried with "alias" function too, but got the same problem "Can't resolve column name." and throwing following exception.

resolved attribute(s) MERCH_ID#738 missing from MERCHANT#737,MERCHANT#928,MERCH_ID#929,MER_LOC#930 in operator !Project [MERCH_ID#738,MERCHANT#737];

at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)

Spark Version: 1.6

The problem has been faced in both Scala and Java Spark.

In Scala, the issue got resolved using 'alias', but in Java, I am still getting the error.

like image 969
abhijit nag Avatar asked Aug 25 '17 13:08

abhijit nag


People also ask

How do you resolve ambiguous column names in join operation?

One of the simplest ways to solve an “ambiguous name column” error — without changing column name — is to give the tables you want to join an alias. This sends a clear information to the SQL Machine the columns are different.

How do you resolve ambiguous column name in PySpark?

Solution: Renaming the one of the ambiguous column name into differrent name will sort out this issue. But in Spark, we don't have a direct method to handle this use case and we need to make use of df. columns to get the duplicate columns count and index and to rename the duplicate column in Spark Dataframe.

How remove duplicate columns after join spark?

Method 1: Using drop() function dataframe is the first dataframe. dataframe1 is the second dataframe. inner specifies inner join. drop() will delete the common column and delete first dataframe column.


2 Answers

import org.apache.spark.sql.functions;

Dataset joinedDF = DF1.join(DF2, DF1.col("_c1").equalTo(DF2.col("_c1")));

here _c1 represents the column 1, if the column does not have a header

like image 168
Arun Mohan Avatar answered Oct 20 '22 01:10

Arun Mohan


From my experience it is best to avoid DataFrame.col and DataFrame.apply unless necessary for disambiguation (aliasing is still better). Please try using independent Column objects:

import org.apache.spark.sql.functions;

DataFrame df= df1.alias("df1").
  .join(df2.alias("df2"), functions.col("df1.MERCHANT").equalTo(functions.col("df2.MERCHANT")))
  .select(functions.col("df2.MERCH_ID"), functions.col("df2.MERCHANT")));
like image 32
Alper t. Turker Avatar answered Oct 20 '22 02:10

Alper t. Turker