Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark join throws 'function' object has no attribute '_get_object_id' error. How could I fix it?

I am making a query in Spark in Databricks, and I have a problema when I am trying to make a join between two dataframes. The two dataframes that I have are the next ones:

  • "names_df" which has 2 columns: "ID", "title" that refer to the id and the title of films.

    +-------+-----------------------------+
    |ID     |title                        |
    +-------+-----------------------------+
    |1      |Toy Story                    |
    |2      |Jumanji                      |
    |3      |Grumpier Old Men             |
    +-------+-----------------------------+
    
  • "info" which has 3 columns: "movieId", "count", "average" that refer to the id of the film, the number of ranks that it has, and the average of those ratings.

    +-------+-----+------------------+
    |movieId|count|average           |
    +-------+-----+------------------+
    |1831   |7463 |2.5785207021305103|
    |431    |8946 |3.695059244355019 |
    |631    |2193 |2.7273141814865483|
    +-------+-----+------------------+
    

This "info" dataframe was created this way:

info =  ratings_df.groupBy('movieId').agg(F.count(ratings_df.rating).alias("count"), F.avg(ratings_df.rating).alias("average"))

Where "ratings_df" is another dataframe that contains 3 columns: "userId", "movieId" and "rating", that refer to the id of the user that voted, the id of the film that the user voted to, and the rating for that film:

+-------+-------+-------------+
|userId |movieId|rating       |
+-------+-------+-------------+
|1      |2      |3.5          |
|1      |29     |3.5          |
|1      |32     |3.5          |
+-------+-------+-------------+

I am trying to make a join between these two dataframes to get another one with those columns: "movieId", "title", "count", "average":

+-------+-----------------------------+-----+-------+
|average|title                        |count|movieId|
+-------+-----------------------------+-----+-------+
|5.0    |Ella Lola, a la Trilby (1898)|1    |94431  |
|5.0    |Serving Life (2011)          |1    |129034 |
|5.0    |Diplomatic Immunity (2009? ) |1    |107434 |
+-------+-----------------------------+-----+-------+

So the operation I did was the next one:

movie_names_df = info.join(movies_df, info.movieId == movies_df.ID, "inner").select(movies_df.title, info.average, info.movieId, info.count).show()

The problem is that I get the next error message:

AttributeError: 'function' object has no attribute '_get_object_id'

And I know that this error occurs because it consider that info.count is a function, and not an attribute, as I defined previously.

So, how could I make that join correctly to get what I want?

Thank you so much!

like image 389
jartymcfly Avatar asked Sep 07 '16 07:09

jartymcfly


People also ask

How do I join two DataFrames in PySpark?

Summary: Pyspark DataFrames have a join method which takes three parameters: DataFrame on the right side of the join, Which fields are being joined on, and what type of join (inner, outer, left_outer, right_outer, leftsemi). You call the join method from the left side DataFrame object such as df1. join(df2, df1.

What is withColumn in PySpark?

In PySpark, the withColumn() function is widely used and defined as the transformation function of the DataFrame which is further used to change the value, convert the datatype of an existing column, create the new column etc.

What is anti join in PySpark?

The left anti join in PySpark is similar to the join functionality, but it returns only columns from the left DataFrame for non-matched records.


1 Answers

Adding comment as answer since it solved the problem. count is somewhat of a protected keyword in DataFrame API, so naming columns count is dangerous. In your case you could circumvent the error by not using the dot notation, but bracket based column access, e.g.

info["count"]
like image 117
LiMuBei Avatar answered Nov 05 '22 06:11

LiMuBei