Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala: How can I replace value in Dataframes using scala

For example I want to replace all numbers equal to 0.2 in a column to 0. How can I do that in Scala? Thanks

Edit:

|year| make|model| comment            |blank| |2012|Tesla| S   | No comment         |     |  |1997| Ford| E350|Go get one now th...|     |  |2015|Chevy| Volt| null               | null|  

This is my Dataframe I'm trying to change Tesla in make column to S

like image 746
Tong Avatar asked Sep 02 '15 15:09

Tong


People also ask

How do I change a value in a DataFrame in Scala?

Spark withColumn() function of the DataFrame is used to update the value of a column. withColumn() function takes 2 arguments; first the column you wanted to update and the second the value you wanted to update with. If the column name specified not found, it creates a new column with the value specified.

How do I change DataFrame data in spark?

To change the Spark SQL DataFrame column type from one data type to another data type you should use cast() function of Column class, you can use this on withColumn(), select(), selectExpr(), and SQL expression.

What is the difference between == and === in Scala?

=== and == are just functions as any other. They have no special meaning whatsoever.

What does AGG do in Scala?

agg. (Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting DataFrame will also contain the grouping columns. The available aggregate methods are avg , max , min , sum , count .


2 Answers

Spark 1.6.2, Java code (sorry), this will change every instance of Tesla to S for the entire dataframe without passing through an RDD:

dataframe.withColumn("make", when(col("make").equalTo("Tesla"), "S")                              .otherwise(col("make")                      ); 

Edited to add @marshall245 "otherwise" to ensure non-Tesla columns aren't converted to NULL.

like image 84
Azeroth2b Avatar answered Sep 18 '22 02:09

Azeroth2b


Building off of the solution from @Azeroth2b. If you want to replace only a couple of items and leave the rest unchanged. Do the following. Without using the otherwise(...) method, the remainder of the column becomes null.

import org.apache.spark.sql.functions._  val newsdf =   sdf.withColumn(     "make",     when(col("make") === "Tesla", "S").otherwise(col("make"))   ); 

Old DataFrame

+-----+-----+  | make|model|  +-----+-----+  |Tesla|    S|  | Ford| E350|  |Chevy| Volt|  +-----+-----+  

New Datarame

+-----+-----+ | make|model| +-----+-----+ |    S|    S| | Ford| E350| |Chevy| Volt| +-----+-----+ 
like image 27
marshall245 Avatar answered Sep 19 '22 02:09

marshall245