I'm looking to make a join on a RDD and a cassandra table which have not the same name for the same key ex (simplified):
case class User(id : String, name : String)
and
case class Home( address : String, user_id : String)
If would like to do :
rdd[Home].joinWithCassandraTable("testspark","user").on(SomeColumns("id"))
How can I precise the name of the field on which the join will be made. And I don't want to map the rdd to have only the right id because I would like to join all values after the joinWithCassandraTable.
You can use the "as" syntax just like in a select to change the mapping of what the joined columns are.
An example
sc.cassandraTable[Home]("ks","home").joinWithCassandraTable("ks","user").on(SomeColumns("id" as "user_id")).collect
Will map the "id" column from the user table to the "user_id" field from the Home case class.
You could try changing the column name when you read in the Cassandra table so that it matched the RDD field you want to join on:
For example:
import org.apache.spark.sql.cassandra.CassandraSQLContext
val sc: SparkContext = ...
val cc = new CassandraSQLContext(sc)
val rdd: SchemaRDD = cc.sql("SELECT user_id AS id, <other columns> from testspark.user WHERE ...")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With