Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use Map to replace column values in Spark

I have to map a list of columns to another column in a Spark dataset: think something like this

val translationMap: Map[Column, Column] = Map(
  lit("foo") -> lit("bar"),
  lit("baz") -> lit("bab")
)

And I have a dataframe like this one:

val df = Seq("foo", "baz").toDF("mov")

So I intend to perform the translation like this:

df.select(
  col("mov"),
  translationMap(col("mov"))
)

but this piece of code spits the following error

key not found: movs
java.util.NoSuchElementException: key not found: movs

Is there a way to perform such translation without concatenating hundreds of whens? think that translationMap could have lots of pairs key-value.

like image 607
mrbolichi Avatar asked Jun 12 '19 08:06

mrbolichi


People also ask

How do I change a column value in spark?

You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.

How do I create a column map in spark?

2.1 Using Spark DataTypes. We can create a map column using createMapType() function on the DataTypes class. This method takes two arguments keyType and valueType as mentioned above and these two arguments should be of a type that extends DataType.

How do I change the DataType of a column in a spark?

To change the Spark SQL DataFrame column type from one data type to another data type you should use cast() function of Column class, you can use this on withColumn(), select(), selectExpr(), and SQL expression.

What is the function of the map () in spark?

Spark Map function takes one element as input process it according to custom code (specified by the developer) and returns one element at a time. Map transforms an RDD of length N into another RDD of length N. The input and output RDDs will typically have the same number of records.


1 Answers

Instead of Map[Column, Column] you should use a Column containing a map literal:

import org.apache.spark.sql.functions.typedLit

val translationMap: Column = typedLit(Map(
  "foo" -> "bar",
  "baz" -> "bab"
))

The rest of your code can stay as-is:

df.select(
  col("mov"),
  translationMap(col("mov"))
).show
+---+---------------------------------------+
|mov|keys: [foo,baz], values: [bar,bab][mov]|
+---+---------------------------------------+
|foo|                                    bar|
|baz|                                    bab|
+---+---------------------------------------+
like image 89
user10938362 Avatar answered Oct 03 '22 02:10

user10938362