Assign value to specific cell in PySpark dataFrame

Tags:

I want to change a value in a specific cell of my Spark DataFrame using PySpark.

Trivial example - I create a mock Spark DataFrame:

df = spark.createDataFrame(
    [
     (1, 1.87, 'new_york'), 
     (4, 2.76, 'la'), 
     (6, 3.3, 'boston'), 
     (8, 4.1, 'detroit'), 
     (2, 5.70, 'miami'), 
     (3, 6.320, 'atlanta'), 
     (1, 6.1, 'houston')
    ],
    ('variable_1', "variable_2", "variable_3")
)

Runnning display(df) I get this table:

variable_1   variable_2   variable_3
    1           1.87    new_york
    4           2.76    la
    6           3.3     boston
    8           4.1     detroit
    2           5.7     miami
    3           6.32    atlanta
    1           6.1     houston

Let´s say for example, I would like to assign a new value for the cell in the 4th row and 3rd column, i.e. changing detroit for new_orleans. I know assignments as df.iloc[4, 3] = 'new_orleans' or df.loc[4, 'detroit'] = 'new_orleans' are not valid in Spark.

A valid answer to my question using when would be:

from pyspark.sql.functions import when
targetDf = df.withColumn("variable_3", \
              when(((df["variable_1"] == 8) & (df["variable_2"] == 4.1)) , 'new_orleans').otherwise(df["variable_3"]))

My question is: could this be done in a more practical way in PySpark without the necessity of entering all the values and column names of the row where I want to change just 1 single cell (maybe achieving the same without using the when function)?

Thanks in advance for your help and @useruser9806664 for his feedback.

779

asked May 17 '18 13:05

NuValue

1 Answers

Spark DataFrames are immutable, don't provide random access and, strictly speaking, unordered. As a result:

You cannot assign anything (because immutable property).
You cannot access specific row (because no random access).
Row "indcies" are not well defined (because unordered).

What you can do, is creating a new dataframe with new column, replacing existing, using some conditional expression, which is already covered by the answers you found.

Also, monotonically_increasing_id doesn't add indices (row numbers). It adds monotonically increasing numbers, not necessarily consecutive ones or starting from any particular value (in case of empty partitions).

161

answered Nov 15 '22 07:11

user9806664

Related questions
                            
                                Pandas - Subtract min date from max date for each group
                            
                                how to uninstall opencv-python package installed by using pip in anaconda?
                            
                                Is it possible to disable the zoom/pan window on a plotly.py candlestick chart?
                            
                                Merging regions in MSER for identifying text lines in OCR
                            
                                How could I retrieve AWS Lambda public IP address by using Python?
                            
                                Regex: don't match string ending with newline (\n) with end-of-line anchor ($)
                            
                                Getting the keys of items with the least counts from a list of tuples of key-value pairs - Python
                            
                                Jinja ignores HTML comments [duplicate]
                            
                                PyCharm Vagrant Couldn't refresh skeletons for remote interpreter
                            
                                Resample Pandas With Minimum Required Number of Observations
                            
                                PyTorch - How to use "toPILImage" correctly
                            
                                Using asyncio to run a function at the start (00 seconds) of every minute
                            
                                How to continuously monitor a new mail in outlook and unread mails of a specific folder in python
                            
                                OpenCV MatchTemplate in C# is too slow compared to Python
                            
                                When would the python tracemalloc module allocations statistics not match what's shown in ps or pmap?
                            
                                Keras: How to get layer index when already know layer name?
                            
                                What does the parenthesis after the function mean
                            
                                django - prefetch only the newest record?
                            
                                How to extract rar files inside google colab
                            
                                What is the best way in python to write docstrings for lambda functions?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Assign value to specific cell in PySpark dataFrame

Tags:

python

dataframe

apache-spark

pyspark

NuValue

People also ask

1 Answers

user9806664

Recent Activity

Donate For Us