pyspark 'DataFrame' object has no attribute '_get_object_id'

Tags:

I am trying to run some code, but getting error:

'DataFrame' object has no attribute '_get_object_id'

The code:

items = [(1,12),(1,float('Nan')),(1,14),(1,10),(2,22),(2,20),(2,float('Nan')),(3,300),
         (3,float('Nan'))]

sc = spark.sparkContext
rdd = sc.parallelize(items)
df = rdd.toDF(["id", "col1"])

import pyspark.sql.functions as func
means = df.groupby("id").agg(func.mean("col1"))

# The error is thrown at this line
df = df.withColumn("col1", func.when((df["col1"].isNull()), means.where(func.col("id")==df["id"])).otherwise(func.col("col1")))

700

asked Aug 05 '19 17:08

Alon

1 Answers

You can't reference a second spark DataFrame inside a function, unless you're using a join. IIUC, you can do the following to achieve your desired result.

Suppose that means is the following:

#means.show()
#+---+---------+
#| id|avg(col1)|
#+---+---------+
#|  1|     12.0|
#|  3|    300.0|
#|  2|     21.0|
#+---+---------+

Join df and means on the id column, then apply your when condition

from pyspark.sql.functions import when

df.join(means, on="id")\
    .withColumn(
        "col1",
        when(
            (df["col1"].isNull()), 
            means["avg(col1)"]
        ).otherwise(df["col1"])
    )\
    .select(*df.columns)\
    .show()
#+---+-----+
#| id| col1|
#+---+-----+
#|  1| 12.0|
#|  1| 12.0|
#|  1| 14.0|
#|  1| 10.0|
#|  3|300.0|
#|  3|300.0|
#|  2| 21.0|
#|  2| 22.0|
#|  2| 20.0|
#+---+-----+

But in this case, I'd actually recommend using a Window with pyspark.sql.functions.mean:

from pyspark.sql import Window
from pyspark.sql.functions import col, mean

df.withColumn(
    "col1",
    when(
        col("col1").isNull(), 
        mean("col1").over(Window.partitionBy("id"))
    ).otherwise(col("col1"))
).show()
#+---+-----+
#| id| col1|
#+---+-----+
#|  1| 12.0|
#|  1| 10.0|
#|  1| 12.0|
#|  1| 14.0|
#|  3|300.0|
#|  3|300.0|
#|  2| 22.0|
#|  2| 20.0|
#|  2| 21.0|
#+---+-----+

101

answered Oct 22 '22 12:10

pault

Related questions
                            
                                Effective-Date-Range One-Hot-Encode groupby
                            
                                Error state Kalman Filter from MATLAB to Python
                            
                                Not found: Container localhost does not exist when I load model with tensorflow and flask
                            
                                Why my one-filter convolutional neural network is unable to learn a simple gaussian kernel?
                            
                                Install from pipfile using pipenv install gives error
                            
                                How Batch learning in Pytorch is performed?
                            
                                How to enable logging of Flask app with `gevent.pywsgi.WSGIServer` and `WebSocketHandler`?
                            
                                Read YAML file as list
                            
                                How to vectorize a loop through a matrix numpy
                            
                                Edit existing PDF's pages in Python
                            
                                Setting the Python path for local project in VS Code without using the settings.json file
                            
                                Sentiment analysis Pipeline, problem getting the correct feature names when feature selection is used
                            
                                Using seaborn lineplot with grouping variable
                            
                                Multiprocessing code fails when run with pdb?
                            
                                Conda SafetyError: file has an incorrect size
                            
                                Loss is NaN on image classification task
                            
                                Fast way to find the closest polygon to a point
                            
                                Keras custom loss function (elastic net)
                            
                                How to reset locale back to original after changing it in Python?
                            
                                Pipenv: dependencies of platform specific packages are installed unconditionally?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pyspark 'DataFrame' object has no attribute '_get_object_id'

Tags:

python

dataframe

apache-spark

pyspark

Alon

People also ask

1 Answers

pault

Recent Activity

Donate For Us