Updating a dataframe column in spark

Tags:

Looking at the new spark DataFrame API, it is unclear whether it is possible to modify dataframe columns.

How would I go about changing a value in row x column y of a dataframe?

In pandas this would be:

df.ix[x,y] = new_value

Edit: Consolidating what was said below, you can't modify the existing dataframe as it is immutable, but you can return a new dataframe with the desired modifications.

If you just want to replace a value in a column based on a condition, like np.where:

from pyspark.sql import functions as F  update_func = (F.when(F.col('update_col') == replace_val, new_value)                 .otherwise(F.col('update_col'))) df = df.withColumn('new_column_name', update_func)

If you want to perform some operation on a column and create a new column that is added to the dataframe:

import pyspark.sql.functions as F import pyspark.sql.types as T  def my_func(col):     do stuff to column here     return transformed_value  # if we assume that my_func returns a string my_udf = F.UserDefinedFunction(my_func, T.StringType())  df = df.withColumn('new_column_name', my_udf('update_col'))

If you want the new column to have the same name as the old column, you could add the additional step:

df = df.drop('update_col').withColumnRenamed('new_column_name', 'update_col')

837

asked Mar 17 '15 21:03

Luke

1 Answers

While you cannot modify a column as such, you may operate on a column and return a new DataFrame reflecting that change. For that you'd first create a UserDefinedFunction implementing the operation to apply and then selectively apply that function to the targeted column only. In Python:

from pyspark.sql.functions import UserDefinedFunction from pyspark.sql.types import StringType  name = 'target_column' udf = UserDefinedFunction(lambda x: 'new_value', StringType()) new_df = old_df.select(*[udf(column).alias(name) if column == name else column for column in old_df.columns])

new_df now has the same schema as old_df (assuming that old_df.target_column was of type StringType as well) but all values in column target_column will be new_value.

112

answered Oct 09 '22 06:10

karlson

Related questions
                            
                                Should Python class filenames also be camelCased?
                            
                                How do I set up Vim autoindentation properly for editing Python files?
                            
                                Convert Variable Name to String?
                            
                                Python variables as keys to dict
                            
                                How do you add additional files to a wheel?
                            
                                Access self from decorator
                            
                                Logging variable data with new format string
                            
                                How do threads work in Python, and what are common Python-threading specific pitfalls?
                            
                                Catch Ctrl+C / SIGINT and exit multiprocesses gracefully in python [duplicate]
                            
                                get dataframe row count based on conditions
                            
                                Accuracy Score ValueError: Can't Handle mix of binary and continuous target
                            
                                How slow is Python's string concatenation vs. str.join?
                            
                                Cannot pass an argument to python with "#!/usr/bin/env python"
                            
                                Pandas dataframe read_csv on bad data
                            
                                Convert JSON array to Python list
                            
                                Python get proper line ending
                            
                                How to build URLs in Python [closed]
                            
                                Monkey patching a class in another module in Python
                            
                                How to pass a variable to magic ´run´ function in IPython
                            
                                Pandas: Drop consecutive duplicates

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Updating a dataframe column in spark

Tags:

python

apache-spark

apache-spark-sql

pyspark

spark-dataframe

Luke

People also ask

1 Answers

karlson

Recent Activity

Donate For Us