Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Schema order change after join operation in Spark (JAVA)

I am using Spark with Java, and when I make a join between two dataframe, the order of schema is different in the result.

I need to preserve the order because I want to insert data into a HBase table after.

In Scala there is a solution using list of seq, and I was wondering how to do so with Java ?

like image 842
kulssaka Avatar asked Feb 19 '26 18:02

kulssaka


2 Answers

You can also create a Scala Seq in Java using the following method:

import scala.collection.JavaConversions;
import scala.collection.Seq;
import static java.util.Arrays.asList;

Seq<String> seq = JavaConversions.asScalaBuffer(asList("col_1","col_2"));
like image 195
Liang Zulin Avatar answered Feb 21 '26 08:02

Liang Zulin


The solution I found is to create a Array of Columns (from org.apache.spark.sql.Column). Hopefully when you do the select, it preserves the array order. Since I never found this solution elsewhere, I decided to post it here.

//after making a join into my DF called "joinedDF" I do this:
//example of schema from string
String schemaFull= "id_meta;source_name_meta;base_name_meta;..."; 
String[] strColumns = schemaFull.split(";");
org.apache.spark.sql.Column[] selectedCols = new org.apache.spark.sql.Column[strColumns.length];
for (int i=0; i < strColumns.length; i++){
    selectedCols[i] = col(strColumns[i]);
}           
joinedDF = joinedDF.select(selectedCols);
like image 25
kulssaka Avatar answered Feb 21 '26 07:02

kulssaka



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!