Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assign value to specific cell in PySpark dataFrame

I want to change a value in a specific cell of my Spark DataFrame using PySpark.

Trivial example - I create a mock Spark DataFrame:

df = spark.createDataFrame(
    [
     (1, 1.87, 'new_york'), 
     (4, 2.76, 'la'), 
     (6, 3.3, 'boston'), 
     (8, 4.1, 'detroit'), 
     (2, 5.70, 'miami'), 
     (3, 6.320, 'atlanta'), 
     (1, 6.1, 'houston')
    ],
    ('variable_1', "variable_2", "variable_3")
)

Runnning display(df) I get this table:

variable_1   variable_2   variable_3
    1           1.87    new_york
    4           2.76    la
    6           3.3     boston
    8           4.1     detroit
    2           5.7     miami
    3           6.32    atlanta
    1           6.1     houston

Let´s say for example, I would like to assign a new value for the cell in the 4th row and 3rd column, i.e. changing detroit for new_orleans. I know assignments as df.iloc[4, 3] = 'new_orleans' or df.loc[4, 'detroit'] = 'new_orleans' are not valid in Spark.

A valid answer to my question using when would be:

from pyspark.sql.functions import when
targetDf = df.withColumn("variable_3", \
              when(((df["variable_1"] == 8) & (df["variable_2"] == 4.1)) , 'new_orleans').otherwise(df["variable_3"]))

My question is: could this be done in a more practical way in PySpark without the necessity of entering all the values and column names of the row where I want to change just 1 single cell (maybe achieving the same without using the when function)?

Thanks in advance for your help and @useruser9806664 for his feedback.

like image 779
NuValue Avatar asked May 17 '18 13:05

NuValue


People also ask

How do you assign a value to a DataFrame in PySpark?

You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.

How do I change a column value in Spark DataFrame?

Spark withColumn() function of the DataFrame is used to update the value of a column. withColumn() function takes 2 arguments; first the column you wanted to update and the second the value you wanted to update with. If the column name specified not found, it creates a new column with the value specified.

What does describe () do in PySpark?

DESCRIBE FUNCTION statement returns the basic metadata information of an existing function. The metadata information includes the function name, implementing class and the usage details. If the optional EXTENDED option is specified, the basic metadata information is returned along with the extended usage information.


1 Answers

Spark DataFrames are immutable, don't provide random access and, strictly speaking, unordered. As a result:

  • You cannot assign anything (because immutable property).
  • You cannot access specific row (because no random access).
  • Row "indcies" are not well defined (because unordered).

What you can do, is creating a new dataframe with new column, replacing existing, using some conditional expression, which is already covered by the answers you found.

Also, monotonically_increasing_id doesn't add indices (row numbers). It adds monotonically increasing numbers, not necessarily consecutive ones or starting from any particular value (in case of empty partitions).

like image 161
user9806664 Avatar answered Nov 15 '22 07:11

user9806664