Spark join throws 'function' object has no attribute '_get_object_id' error. How could I fix it?

Tags:

I am making a query in Spark in Databricks, and I have a problema when I am trying to make a join between two dataframes. The two dataframes that I have are the next ones:

"names_df" which has 2 columns: "ID", "title" that refer to the id and the title of films.

+-------+-----------------------------+
|ID     |title                        |
+-------+-----------------------------+
|1      |Toy Story                    |
|2      |Jumanji                      |
|3      |Grumpier Old Men             |
+-------+-----------------------------+

"info" which has 3 columns: "movieId", "count", "average" that refer to the id of the film, the number of ranks that it has, and the average of those ratings.

+-------+-----+------------------+
|movieId|count|average           |
+-------+-----+------------------+
|1831   |7463 |2.5785207021305103|
|431    |8946 |3.695059244355019 |
|631    |2193 |2.7273141814865483|
+-------+-----+------------------+

This "info" dataframe was created this way:

info =  ratings_df.groupBy('movieId').agg(F.count(ratings_df.rating).alias("count"), F.avg(ratings_df.rating).alias("average"))

Where "ratings_df" is another dataframe that contains 3 columns: "userId", "movieId" and "rating", that refer to the id of the user that voted, the id of the film that the user voted to, and the rating for that film:

+-------+-------+-------------+
|userId |movieId|rating       |
+-------+-------+-------------+
|1      |2      |3.5          |
|1      |29     |3.5          |
|1      |32     |3.5          |
+-------+-------+-------------+

I am trying to make a join between these two dataframes to get another one with those columns: "movieId", "title", "count", "average":

+-------+-----------------------------+-----+-------+
|average|title                        |count|movieId|
+-------+-----------------------------+-----+-------+
|5.0    |Ella Lola, a la Trilby (1898)|1    |94431  |
|5.0    |Serving Life (2011)          |1    |129034 |
|5.0    |Diplomatic Immunity (2009? ) |1    |107434 |
+-------+-----------------------------+-----+-------+

So the operation I did was the next one:

movie_names_df = info.join(movies_df, info.movieId == movies_df.ID, "inner").select(movies_df.title, info.average, info.movieId, info.count).show()

The problem is that I get the next error message:

AttributeError: 'function' object has no attribute '_get_object_id'

And I know that this error occurs because it consider that info.count is a function, and not an attribute, as I defined previously.

So, how could I make that join correctly to get what I want?

Thank you so much!

389

asked Sep 07 '16 07:09

jartymcfly

1 Answers

Adding comment as answer since it solved the problem. count is somewhat of a protected keyword in DataFrame API, so naming columns count is dangerous. In your case you could circumvent the error by not using the dot notation, but bracket based column access, e.g.

info["count"]

117

answered Nov 05 '22 06:11

LiMuBei

Related questions
                            
                                Pythonic way of comparing all adjacent elements in a list
                            
                                str.contains to create new column in pandas dataframe
                            
                                Slice efficiently pandas datetime index by a specific time
                            
                                Add 'constant' dimension to xarray Dataset
                            
                                jupyter: how to stop execution on errors?
                            
                                Extracting weights from .caffemodel without caffe installed in Python
                            
                                Regex to match PEP440 compliant version strings
                            
                                How to perform cython files compilation in parallel?
                            
                                How to fix Selenium WebDriverException: "The browser appears to have exited"
                            
                                Pandas: No column names in data file
                            
                                How to calculate the similarity of English words that do not appear in WordNet?
                            
                                How I can calculate standard deviation for rows of a dataframe?
                            
                                Split string of digits into lists of even and odd integers
                            
                                tensorflow : .eval() never ends
                            
                                Convert PIL image to bytearray
                            
                                KeyError: nan in dict
                            
                                Categorize Data in a column in dataframe
                            
                                Python 2.7 [Errno 113] No route to host
                            
                                Randomly distribute files into train/test given a ratio
                            
                                Listing more than 100 stacks using boto3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark join throws 'function' object has no attribute '_get_object_id' error. How could I fix it?

Tags:

python

function

sql

join

apache-spark

jartymcfly

People also ask

1 Answers

LiMuBei

Recent Activity

Donate For Us