I want to pivot a spark dataframe, I refer pyspark documentation, and based on pivot
function, the clue is .groupBy('name').pivot('name', values=None)
. Here's my dataset,
In[75]: spDF.show()
Out[75]:
+-----------+-----------+
|customer_id| name|
+-----------+-----------+
| 25620| MCDonnalds|
| 25620| STARBUCKS|
| 25620| nan|
| 25620| nan|
| 25620| MCDonnalds|
| 25620| nan|
| 25620| MCDonnalds|
| 25620|DUNKINDONUT|
| 25620| LOTTERIA|
| 25620| nan|
| 25620| MCDonnalds|
| 25620|DUNKINDONUT|
| 25620|DUNKINDONUT|
| 25620| nan|
| 25620| nan|
| 25620| nan|
| 25620| nan|
| 25620| LOTTERIA|
| 25620| LOTTERIA|
| 25620| STARBUCKS|
+-----------+-----------+
only showing top 20 rows
And then I try to di pivot the table name
In [96]:
spDF.groupBy('name').pivot('name', values=None)
Out[96]:
<pyspark.sql.group.GroupedData at 0x7f0ad03750f0>
And when I try to show them
In [98]:
spDF.groupBy('name').pivot('name', values=None).show()
Out [98]:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-98-94354082e956> in <module>()
----> 1 spDF.groupBy('name').pivot('name', values=None).show()
AttributeError: 'GroupedData' object has no attribute 'show'
I don't know why 'GroupedData'
can't be shown, what should I do to solve the issue?
The pivot()
method returns a GroupedData
object, just like groupBy()
. You cannot use show()
on a GroupedData
object without using an aggregate function (such as sum()
or even count()
) on it before.
See this article for more information
Let's create some test data that resembles your dataset:
data = [
("123", "McDonalds"),
("123", "Starbucks"),
("123", "McDonalds"),
("777", "McDonalds"),
("777", "McDonalds"),
("777", "Dunkin")
]
df = spark.createDataFrame(data, ["customer_id", "name"])
df.show()
+-----------+---------+
|customer_id| name|
+-----------+---------+
| 123|McDonalds|
| 123|Starbucks|
| 123|McDonalds|
| 777|McDonalds|
| 777|McDonalds|
| 777| Dunkin|
+-----------+---------+
Let's pivot the dataset so the customer_ids are columns:
df.groupBy("name").pivot("customer_id").count().show()
+---------+----+----+
| name| 123| 777|
+---------+----+----+
|McDonalds| 2| 2|
|Starbucks| 1|null|
| Dunkin|null| 1|
+---------+----+----+
Now let's pivot the DataFrame so the restaurant names are columns:
df.groupBy("customer_id").pivot("name").count().show()
+-----------+------+---------+---------+
|customer_id|Dunkin|McDonalds|Starbucks|
+-----------+------+---------+---------+
| 777| 1| 2| null|
| 123| null| 2| 1|
+-----------+------+---------+---------+
Code like df.groupBy("name").show()
errors out with the AttributeError: 'GroupedData' object has no attribute 'show'
message. You can only call methods defined in the pyspark.sql.GroupedData
class on instances of the GroupedData
class.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With