Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

join in a dataframe spark java

First of all, thank you for the time in reading my question.

My question is the following: In Spark with Java, i load in two dataframe the data of two csv files.

These dataframes will have the following information.

Dataframe Airport

Id | Name    | City
-----------------------
1  | Barajas | Madrid

Dataframe airport_city_state

City | state
----------------
Madrid | España

I want to join these two dataframes so that it looks like this:

dataframe result

Id | Name    | City   | state
--------------------------
1  | Barajas | Madrid | España

Where dfairport.city = dfaiport_city_state.city

But I can not clarify with the syntax so I can do the join correctly. A little code of how I have created the variables:

 // Load the csv, you have to specify that you have header and what delimiter you have
Dataset <Row> dfairport = Load.Csv (sqlContext, data_airport);
Dataset <Row> dfairport_city_state = Load.Csv (sqlContext,   data_airport_city_state);


// Change the name of the columns in the csv dataframe to match the columns in the database
// Once they match the name we can insert them
Dfairport
.withColumnRenamed ("leg_key", "id")
.withColumnRenamed ("leg_name", "name")
.withColumnRenamed ("leg_city", "city")

dfairport_city_state
.withColumnRenamed("city", "ciudad")
.withColumnRenamed("state", "estado");
like image 598
Alejandro Reina Avatar asked Mar 26 '17 19:03

Alejandro Reina


2 Answers

You can use join method with column name to join two dataframes, e.g.:

Dataset <Row> dfairport = Load.Csv (sqlContext, data_airport);
Dataset <Row> dfairport_city_state = Load.Csv (sqlContext,   data_airport_city_state);

Dataset <Row> joined = dfairport.join(dfairport_city_state, dfairport_city_state("City"));

There is also an overloaded version that allows you to specify the join type as third argument, e.g.:

Dataset <Row> joined = dfairport.join(dfairport_city_state, dfairport_city_state("City"), "left_outer");

Here's more on joins.

like image 69
Darshan Mehta Avatar answered Sep 25 '22 08:09

Darshan Mehta


First, thank you very much for your response.

I have tried both of my solutions but none of them work, I get the following error: The method dfairport_city_state (String) is undefined for the type ETL_Airport

I can not access a specific column of the dataframe for join.

EDIT: Already got to do the join, I put here the solution in case someone else helps;)

Thanks for everything and best regards

//Join de tablas en las que comparten ciudad
Dataset <Row> joined = dfairport.join(dfairport_city_state, dfairport.col("leg_city").equalTo(dfairport_city_state.col("city")));
like image 31
Alejandro Reina Avatar answered Sep 25 '22 08:09

Alejandro Reina