Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

'GroupedData' object has no attribute 'show' when doing doing pivot in spark dataframe

I want to pivot a spark dataframe, I refer pyspark documentation, and based on pivot function, the clue is .groupBy('name').pivot('name', values=None). Here's my dataset,

 In[75]:  spDF.show()
 Out[75]:

+-----------+-----------+
|customer_id|       name|
+-----------+-----------+
|      25620| MCDonnalds|
|      25620|  STARBUCKS|
|      25620|        nan|
|      25620|        nan|
|      25620| MCDonnalds|
|      25620|        nan|
|      25620| MCDonnalds|
|      25620|DUNKINDONUT|
|      25620|   LOTTERIA|
|      25620|        nan|
|      25620| MCDonnalds|
|      25620|DUNKINDONUT|
|      25620|DUNKINDONUT|
|      25620|        nan|
|      25620|        nan|
|      25620|        nan|
|      25620|        nan|
|      25620|   LOTTERIA|
|      25620|   LOTTERIA|
|      25620|  STARBUCKS|
+-----------+-----------+
only showing top 20 rows

And then I try to di pivot the table name

In [96]:
spDF.groupBy('name').pivot('name', values=None)
Out[96]:
<pyspark.sql.group.GroupedData at 0x7f0ad03750f0>

And when I try to show them

In [98]:
spDF.groupBy('name').pivot('name', values=None).show()
Out [98]:

    ---------------------------------------------------------------------------
AttributeError                       Traceback (most recent call last)
<ipython-input-98-94354082e956> in <module>()
----> 1 spDF.groupBy('name').pivot('name', values=None).show()
AttributeError: 'GroupedData' object has no attribute 'show'

I don't know why 'GroupedData' can't be shown, what should I do to solve the issue?

like image 386
Nabih Bawazir Avatar asked Aug 13 '18 11:08

Nabih Bawazir


Video Answer


2 Answers

The pivot() method returns a GroupedData object, just like groupBy(). You cannot use show() on a GroupedData object without using an aggregate function (such as sum() or even count()) on it before.

See this article for more information

like image 108
ech0 Avatar answered Sep 21 '22 20:09

ech0


Let's create some test data that resembles your dataset:

data = [
    ("123", "McDonalds"),
    ("123", "Starbucks"),
    ("123", "McDonalds"),
    ("777", "McDonalds"),
    ("777", "McDonalds"),
    ("777", "Dunkin")
]
df = spark.createDataFrame(data, ["customer_id", "name"])
df.show()
+-----------+---------+
|customer_id|     name|
+-----------+---------+
|        123|McDonalds|
|        123|Starbucks|
|        123|McDonalds|
|        777|McDonalds|
|        777|McDonalds|
|        777|   Dunkin|
+-----------+---------+

Let's pivot the dataset so the customer_ids are columns:

df.groupBy("name").pivot("customer_id").count().show()

+---------+----+----+
|     name| 123| 777|
+---------+----+----+
|McDonalds|   2|   2|
|Starbucks|   1|null|
|   Dunkin|null|   1|
+---------+----+----+

Now let's pivot the DataFrame so the restaurant names are columns:

df.groupBy("customer_id").pivot("name").count().show()

+-----------+------+---------+---------+
|customer_id|Dunkin|McDonalds|Starbucks|
+-----------+------+---------+---------+
|        777|     1|        2|     null|
|        123|  null|        2|        1|
+-----------+------+---------+---------+

Code like df.groupBy("name").show() errors out with the AttributeError: 'GroupedData' object has no attribute 'show' message. You can only call methods defined in the pyspark.sql.GroupedData class on instances of the GroupedData class.

like image 39
Powers Avatar answered Sep 20 '22 20:09

Powers