I want to change a value in a specific cell of my Spark DataFrame
using PySpark
.
Trivial example - I create a mock Spark DataFrame
:
df = spark.createDataFrame(
[
(1, 1.87, 'new_york'),
(4, 2.76, 'la'),
(6, 3.3, 'boston'),
(8, 4.1, 'detroit'),
(2, 5.70, 'miami'),
(3, 6.320, 'atlanta'),
(1, 6.1, 'houston')
],
('variable_1', "variable_2", "variable_3")
)
Runnning display(df)
I get this table:
variable_1 variable_2 variable_3
1 1.87 new_york
4 2.76 la
6 3.3 boston
8 4.1 detroit
2 5.7 miami
3 6.32 atlanta
1 6.1 houston
Let´s say for example, I would like to assign a new value for the cell in the 4th row and 3rd column, i.e. changing detroit
for new_orleans
. I know assignments as df.iloc[4, 3] = 'new_orleans'
or df.loc[4, 'detroit'] = 'new_orleans'
are not valid in Spark
.
A valid answer to my question using when
would be:
from pyspark.sql.functions import when
targetDf = df.withColumn("variable_3", \
when(((df["variable_1"] == 8) & (df["variable_2"] == 4.1)) , 'new_orleans').otherwise(df["variable_3"]))
My question is: could this be done in a more practical way in PySpark
without the necessity of entering all the values and column names of the row where I want to change just 1 single cell (maybe achieving the same without using the when
function)?
Thanks in advance for your help and @useruser9806664 for his feedback.
You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.
Spark withColumn() function of the DataFrame is used to update the value of a column. withColumn() function takes 2 arguments; first the column you wanted to update and the second the value you wanted to update with. If the column name specified not found, it creates a new column with the value specified.
DESCRIBE FUNCTION statement returns the basic metadata information of an existing function. The metadata information includes the function name, implementing class and the usage details. If the optional EXTENDED option is specified, the basic metadata information is returned along with the extended usage information.
Spark DataFrames
are immutable, don't provide random access and, strictly speaking, unordered. As a result:
What you can do, is creating a new dataframe with new column, replacing existing, using some conditional expression, which is already covered by the answers you found.
Also, monotonically_increasing_id
doesn't add indices (row numbers). It adds monotonically increasing numbers, not necessarily consecutive ones or starting from any particular value (in case of empty partitions).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With