There are two DataFrames (Scala, Apache Spark 1.6.1)
1) Matches
MatchID | Player1 | Player2
--------------------------------
1 | John Wayne | John Doe
2 | Ive Fish | San Simon
2) Personal Data
Player | BirthYear
--------------------------------
John Wayne | 1986
Ive Fish | 1990
San Simon | 1974
john Doe | 1995
How could create a new DataFrame with 'BirthYear' for the both players
MatchID | Player1 | Player2 | BYear_P1 |BYear_P2 | Diff
-------------------------------------------------------------
1 | John Wayne | John Doe | 1986 | 1995 | 9
2 | Ive Fish | San Simon | 1990 | 1974 | 16
?
I tried
val df = MatchesDF.join(PersonalDF, MatchesDF("Player1") === PersonalDF("Player"))
then join again for the second player
val resDf = df.join(PersonalDF, df("Player2") === PersonalDF("Player"))
but it's VERY time consuming operation.
May be another way to do it in Scala and Apache Spark?
In order to explain join with multiple tables, we will use Inner join, this is the default join in Spark and it's mostly used, this joins two DataFrames/Datasets on key columns, and where keys don't match the rows get dropped from both datasets.
PySpark Merge Two DataFrames with Different Columns resolves columns by name (not by position). In other words, unionByName() is used to merge two DataFrame's by column names instead of by position.
Here we create an empty DataFrame where data is to be added, then we convert the data to be added into a Spark DataFrame using createDataFrame() and further convert both DataFrames to a Pandas DataFrame using toPandas() and use the append() function to add the non-empty data frame to the empty DataFrame and ignore the ...
This is a solution using spark's dataframe functions:
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.abs
val matches = sqlContext.sparkContext.parallelize(Row(1, "John Wayne", "John Doe"), Row(2, "Ive Fish", "San Simon")))
val players = sqlContext.sparkContext.parallelize(Seq(
Row("John Wayne", 1986),
Row("Ive Fish", 1990),
Row("San Simon", 1974),
Row("John Doe", 1995)
))
val matchesDf = sqlContext.createDataFrame(matches, StructType(Seq(
StructField("matchId", IntegerType, nullable = false),
StructField("player1", StringType, nullable = false),
StructField("player2", StringType, nullable = false)))
).as('matches)
val playersDf = sqlContext.createDataFrame(players, StructType(Seq(
StructField("player", StringType, nullable = false),
StructField("birthYear", IntegerType, nullable = false)
))).as('players)
matchesDf
.join(playersDf, $"matches.player1" === $"players.player")
.select($"matches.matchId" as "matchId", $"matches.player1" as "player1", $"matches.player2" as "player2", $"players.birthYear" as "player1BirthYear")
.join(playersDf, $"player2" === $"players.player")
.select($"matchId" as "MatchID", $"player1" as "Player1", $"player2" as "Player2", $"player1BirthYear" as "BYear_P1", $"players.birthYear" as "BYear_P2")
.withColumn("Diff", abs('BYear_P2.minus('BYear_P1)))
.show()
+-------+----------+---------+--------+--------+----+
|MatchID| Player1| Player2|BYear_P1|BYear_P2|Diff|
+-------+----------+---------+--------+--------+----+
| 1|John Wayne| John Doe| 1986| 1995| 9|
| 2| Ive Fish|San Simon| 1990| 1974| 16|
+-------+----------+---------+--------+--------+----+
This should perform better:
case class Match(matchId: Int, player1: String, player2: String)
case class Player(name: String, birthYear: Int)
val matches = Seq(
Match(1, "John Wayne", "John Doe"),
Match(2, "Ive Fish", "San Simon")
)
val players = Seq(
Player("John Wayne", 1986),
Player("Ive Fish", 1990),
Player("San Simon", 1974),
Player("John Doe", 1995)
)
val matchesDf = sqlContext.createDataFrame(matches)
val playersDf = sqlContext.createDataFrame(players)
matchesDf.registerTempTable("matches")
playersDf.registerTempTable("players")
sqlContext.sql(
"select matchId, player1, player2, p1.birthYear, p2.birthYear, abs(p1.birthYear-p2.birthYear) " +
"from matches m inner join players p1 inner join players p2 " +
"where m.player1 = p1.name and m.player2 = p2.name").show()
+-------+----------+---------+---------+---------+---+
|matchId| player1| player2|birthYear|birthYear|_c5|
+-------+----------+---------+---------+---------+---+
| 1|John Wayne| John Doe| 1986| 1995| 9|
| 2| Ive Fish|San Simon| 1990| 1974| 16|
+-------+----------+---------+---------+---------+---+
I didn't find the way to express join of 3 tables in Scala DSL.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With