I am making a query in Spark in Databricks, and I have a problema when I am trying to make a join between two dataframes. The two dataframes that I have are the next ones:
"names_df" which has 2 columns: "ID", "title" that refer to the id and the title of films.
+-------+-----------------------------+
|ID |title |
+-------+-----------------------------+
|1 |Toy Story |
|2 |Jumanji |
|3 |Grumpier Old Men |
+-------+-----------------------------+
"info" which has 3 columns: "movieId", "count", "average" that refer to the id of the film, the number of ranks that it has, and the average of those ratings.
+-------+-----+------------------+
|movieId|count|average |
+-------+-----+------------------+
|1831 |7463 |2.5785207021305103|
|431 |8946 |3.695059244355019 |
|631 |2193 |2.7273141814865483|
+-------+-----+------------------+
This "info" dataframe was created this way:
info = ratings_df.groupBy('movieId').agg(F.count(ratings_df.rating).alias("count"), F.avg(ratings_df.rating).alias("average"))
Where "ratings_df" is another dataframe that contains 3 columns: "userId", "movieId" and "rating", that refer to the id of the user that voted, the id of the film that the user voted to, and the rating for that film:
+-------+-------+-------------+
|userId |movieId|rating |
+-------+-------+-------------+
|1 |2 |3.5 |
|1 |29 |3.5 |
|1 |32 |3.5 |
+-------+-------+-------------+
I am trying to make a join between these two dataframes to get another one with those columns: "movieId", "title", "count", "average":
+-------+-----------------------------+-----+-------+
|average|title |count|movieId|
+-------+-----------------------------+-----+-------+
|5.0 |Ella Lola, a la Trilby (1898)|1 |94431 |
|5.0 |Serving Life (2011) |1 |129034 |
|5.0 |Diplomatic Immunity (2009? ) |1 |107434 |
+-------+-----------------------------+-----+-------+
So the operation I did was the next one:
movie_names_df = info.join(movies_df, info.movieId == movies_df.ID, "inner").select(movies_df.title, info.average, info.movieId, info.count).show()
The problem is that I get the next error message:
AttributeError: 'function' object has no attribute '_get_object_id'
And I know that this error occurs because it consider that info.count is a function, and not an attribute, as I defined previously.
So, how could I make that join correctly to get what I want?
Thank you so much!
Summary: Pyspark DataFrames have a join method which takes three parameters: DataFrame on the right side of the join, Which fields are being joined on, and what type of join (inner, outer, left_outer, right_outer, leftsemi). You call the join method from the left side DataFrame object such as df1. join(df2, df1.
In PySpark, the withColumn() function is widely used and defined as the transformation function of the DataFrame which is further used to change the value, convert the datatype of an existing column, create the new column etc.
The left anti join in PySpark is similar to the join functionality, but it returns only columns from the left DataFrame for non-matched records.
Adding comment as answer since it solved the problem. count
is somewhat of a protected keyword in DataFrame API, so naming columns count is dangerous. In your case you could circumvent the error by not using the dot notation, but bracket based column access, e.g.
info["count"]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With