Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Updating a dataframe column in spark

Looking at the new spark DataFrame API, it is unclear whether it is possible to modify dataframe columns.

How would I go about changing a value in row x column y of a dataframe?

In pandas this would be:

df.ix[x,y] = new_value 

Edit: Consolidating what was said below, you can't modify the existing dataframe as it is immutable, but you can return a new dataframe with the desired modifications.

If you just want to replace a value in a column based on a condition, like np.where:

from pyspark.sql import functions as F  update_func = (F.when(F.col('update_col') == replace_val, new_value)                 .otherwise(F.col('update_col'))) df = df.withColumn('new_column_name', update_func) 

If you want to perform some operation on a column and create a new column that is added to the dataframe:

import pyspark.sql.functions as F import pyspark.sql.types as T  def my_func(col):     do stuff to column here     return transformed_value  # if we assume that my_func returns a string my_udf = F.UserDefinedFunction(my_func, T.StringType())  df = df.withColumn('new_column_name', my_udf('update_col')) 

If you want the new column to have the same name as the old column, you could add the additional step:

df = df.drop('update_col').withColumnRenamed('new_column_name', 'update_col') 
like image 837
Luke Avatar asked Mar 17 '15 21:03

Luke


People also ask

How do I change my spark data?

Update the column value Spark withColumn() function of the DataFrame is used to update the value of a column. withColumn() function takes 2 arguments; first the column you wanted to update and the second the value you wanted to update with.

How do I add values to a column in PySpark?

In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .

Can you edit the contents of an existing spark DataFrame?

Edit: Consolidating what was said below, you can't modify the existing dataframe as it is immutable, but you can return a new dataframe with the desired modifications. if you want to access the DataFrame by index, you need to build an index first.

How do you transform columns in PySpark?

PySpark withColumn() function of DataFrame can also be used to change the value of an existing column. In order to change the value, pass an existing column name as a first argument and a value to be assigned as a second argument to the withColumn() function. Note that the second argument should be Column type .


1 Answers

While you cannot modify a column as such, you may operate on a column and return a new DataFrame reflecting that change. For that you'd first create a UserDefinedFunction implementing the operation to apply and then selectively apply that function to the targeted column only. In Python:

from pyspark.sql.functions import UserDefinedFunction from pyspark.sql.types import StringType  name = 'target_column' udf = UserDefinedFunction(lambda x: 'new_value', StringType()) new_df = old_df.select(*[udf(column).alias(name) if column == name else column for column in old_df.columns]) 

new_df now has the same schema as old_df (assuming that old_df.target_column was of type StringType as well) but all values in column target_column will be new_value.

like image 112
karlson Avatar answered Oct 09 '22 06:10

karlson