Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Join Multiple Columns in Spark SQL using Java for filtering in DataFrame

  • DataFrame a = contains column x,y,z,k
  • DataFrame b = contains column x,y,a

    a.join(b,<condition to use in java to use x,y >) ??? 
    

I tried using

a.join(b,a.col("x").equalTo(b.col("x")) && a.col("y").equalTo(b.col("y"),"inner")

But Java is throwing error saying && is not allowed.

like image 767
Gokul Avatar asked Feb 04 '16 21:02

Gokul


People also ask

How do I combine columns in Spark data frame?

Using concat() Function to Concatenate DataFrame Columns Spark SQL functions provide concat() to concatenate two or more DataFrame columns into a single Column. It can also take columns of different Data Types and concatenate them into a single column. for example, it supports String, Int, Boolean and also arrays.

How do I select multiple columns in Spark data frame?

You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.

How do I add multiple columns to a DataFrame in Spark?

You can add multiple columns to Spark DataFrame in several ways if you wanted to add a known set of columns you can easily do by chaining withColumn() or on select(). However, sometimes you may need to add multiple columns after applying some transformations n that case you can use either map() or foldLeft().

How do I add a filter to a DataFrame in Spark?

Spark filter() or where() function is used to filter the rows from DataFrame or Dataset based on the given one or multiple conditions or SQL expression. You can use where() operator instead of the filter if you are coming from SQL background. Both these functions operate exactly the same.


2 Answers

Spark SQL provides a group of methods on Column marked as java_expr_ops which are designed for Java interoperability. It includes and (see also or) method which can be used here:

a.col("x").equalTo(b.col("x")).and(a.col("y").equalTo(b.col("y"))
like image 77
zero323 Avatar answered Sep 28 '22 13:09

zero323


If you want to use Multiple columns for join, you can do something like this:

a.join(b,scalaSeq, joinType)

You can store your columns in Java-List and convert List to Scala seq. Conversion of Java-List to Scala-Seq:

scalaSeq = JavaConverters.asScalaIteratorConverter(list.iterator()).asScala().toSeq();

Example: a = a.join(b, scalaSeq, "inner");

Note: Dynamic columns will be easily supported in this way.

like image 33
user15059143 Avatar answered Sep 28 '22 14:09

user15059143