Pyspark 1.6 - Aliasing columns after pivoting with multiple aggregates

Tags:

I am currently trying to alias the columns I'm getting after pivoting on a value on a Pyspark dataframe. The problem here is that the columns names I'm putting in the alias call are not properly set.

A concrete example :

Starting from this dataframe :

import pyspark.sql.functions as func

df = sc.parallelize([
    (217498, 100000001, 'A'), (217498, 100000025, 'A'), (217498, 100000124, 'A'),
    (217498, 100000152, 'B'), (217498, 100000165, 'C'), (217498, 100000177, 'C'),
    (217498, 100000182, 'A'), (217498, 100000197, 'B'), (217498, 100000210, 'B'),
    (854123, 100000005, 'A'), (854123, 100000007, 'A')
]).toDF(["user_id", "timestamp", "actions"])

which gives

+-------+--------------------+------------+
|user_id|     timestamp      |  actions   |
+-------+--------------------+------------+
| 217498|           100000001|    'A'     |
| 217498|           100000025|    'A'     |
| 217498|           100000124|    'A'     |
| 217498|           100000152|    'B'     |
| 217498|           100000165|    'C'     |
| 217498|           100000177|    'C'     |
| 217498|           100000182|    'A'     |
| 217498|           100000197|    'B'     |
| 217498|           100000210|    'B'     |
| 854123|           100000005|    'A'     |
| 854123|           100000007|    'A'     |

The problem is that calling

df = df.groupby('user_id')\
       .pivot('actions')\
       .agg(func.count('timestamp').alias('ts_count'),
            func.mean('timestamp').alias('ts_mean'))

gives the columns names

df.columns

['user_id',
 'A_(count(timestamp),mode=Complete,isDistinct=false) AS ts_count#4L',
 'A_(avg(timestamp),mode=Complete,isDistinct=false) AS ts_mean#5',
 'B_(count(timestamp),mode=Complete,isDistinct=false) AS ts_count#4L',
 'B_(avg(timestamp),mode=Complete,isDistinct=false) AS ts_mean#5',
 'C_(count(timestamp),mode=Complete,isDistinct=false) AS ts_count#4L',
 'C_(avg(timestamp),mode=Complete,isDistinct=false) AS ts_mean#5']

which are completely impractical.

I could clean my column names with the methods shown here - (regex) or here - (use of withColumnRenamed(). However these are workarounds that could easily break after an update.

To sum it up: How can I use the columns generated by the pivot without having to parse them ? (e.g. 'A_(count(timestamp),mode=Complete,isDistinct=false) AS ts_count#4L' generated name) ?

Any help would be appreciated ! Thanks

358

asked Jan 24 '17 16:01

hyperc54

1 Answers

This is happening because the column you are pivoting on doesn't have distinct values. This results in duplicate column names when the pivot occurs so spark gives it those column names to make them distinct. You need to group your pivot column before you pivot to make the values in the pivot column (actions) distinct.

Let me know if you need more help!

@hyperc54

112

answered Oct 12 '22 01:10

karhershey

Related questions
                            
                                Use soup.get_text() with UTF-8
                            
                                Writing & Reading the same csv file in Python
                            
                                How many times the finalizer method is called and zombies (PEP 442)
                            
                                Python ctypes: How to pass NULL as argument with format const char **
                            
                                Python opening and reading files one liner
                            
                                NoSuchKey when getting a signed url for a cloudstorage object with a space in the name
                            
                                Rephrase spirograph code into function
                            
                                how to install python-dev with no root?
                            
                                Python urlencode don't encode special characters
                            
                                Converting to Jython a Python 3.5 project - UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 4-10: illegal Unicode character
                            
                                Django URL without trailing slash not working
                            
                                Why is My Minimax Not Expanding and Making Moves Correctly?
                            
                                How to unfocus (blur) Python-gi GTK+3 window on Linux
                            
                                Outlook 2013 display HTML code instead of actual Data
                            
                                Python 2.7 Cx_Freeze: ImportError: No module named __startup__
                            
                                python script crashes after long time running

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark 1.6 - Aliasing columns after pivoting with multiple aggregates

Tags:

python-2.7

pivot

apache-spark

pyspark

pyspark-sql

hyperc54

People also ask

1 Answers

karhershey

Recent Activity

Donate For Us