I am using Spark with Java, and when I make a join between two dataframe, the order of schema is different in the result.
I need to preserve the order because I want to insert data into a HBase table after.
In Scala there is a solution using list of seq, and I was wondering how to do so with Java ?
You can also create a Scala Seq in Java using the following method:
import scala.collection.JavaConversions;
import scala.collection.Seq;
import static java.util.Arrays.asList;
Seq<String> seq = JavaConversions.asScalaBuffer(asList("col_1","col_2"));
The solution I found is to create a Array of Columns (from org.apache.spark.sql.Column). Hopefully when you do the select, it preserves the array order. Since I never found this solution elsewhere, I decided to post it here.
//after making a join into my DF called "joinedDF" I do this:
//example of schema from string
String schemaFull= "id_meta;source_name_meta;base_name_meta;...";
String[] strColumns = schemaFull.split(";");
org.apache.spark.sql.Column[] selectedCols = new org.apache.spark.sql.Column[strColumns.length];
for (int i=0; i < strColumns.length; i++){
selectedCols[i] = col(strColumns[i]);
}
joinedDF = joinedDF.select(selectedCols);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With