Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pivoting a pandas dataframe with duplicate index values

Tags:

python

pandas

I have a data frame which has rows for each user joining my site and making a purchase.

+---+-----+--------------------+---------+--------+-----+
|   | uid |        msg         |  _time  | gender | age |
+---+-----+--------------------+---------+--------+-----+
| 0 |   1 | confirmed_settings | 1/29/15 | M      |  37 |
| 1 |   1 | sale               | 4/13/15 | M      |  37 |
| 2 |   3 | confirmed_settings | 4/19/15 | M      |  35 |
| 3 |   4 | confirmed_settings | 2/21/15 | M      |  21 |
| 4 |   5 | confirmed_settings | 3/28/15 | M      |  18 |
| 5 |   4 | sale               | 3/15/15 | M      |  21 |
+---+-----+--------------------+---------+--------+-----+

I would like to change the dataframe so that each row is unique for a uid and there is a columns called sale and confirmed_settings which have the timestamp of the action. Note that not every user has a sale, but every user has a confirmed_settings. Like below:

+---+-----+--------------------+---------+---------+--------+-----+
|   | uid | confirmed_settings |  sale   |  _time  | gender | age |
+---+-----+--------------------+---------+---------+--------+-----+
| 0 |   1 | 1/29/15            | 4/13/15 | 1/29/15 | M      |  37 |
| 1 |   3 | 4/19/15            | null    | 4/19/15 | M      |  35 |
| 2 |   4 | 2/21/15            | 3/15/15 | 2/21/15 | M      |  21 |
| 3 |   5 | 3/28/15            | null    | 3/28/15 | M      |  18 |
+---+-----+--------------------+---------+---------+--------+-----+

To do this, I am trying:

df1 = df.pivot(index='uid', columns='msg', values='_time').reset_index()
df1 = df1.merge(df[['uid', 'gender', 'age']].drop_duplicates(), on='uid')

But I get this error: ValueError: Index contains duplicate entries, cannot reshape

How can I pivot a df with duplicate index values to transform my dataframe?


Edit: df1 = df.pivot_table(index='uid', columns='msg', values='_time').reset_index()

gives this error DataError: No numeric types to aggregate but im not even sure that is the right path to go on.

like image 282
metersk Avatar asked Apr 28 '15 18:04

metersk


1 Answers

I suspect there are indeed duplicate uid-msg entries/keys (e.g. uid 2 has 2 confirmed_settings entries under msg), which you alluded to in the comments for fixxxer's answer. If there are, you can't use pivot, because you can't tell it how to treat the different values encountered during aggregation (count? max? mean? sum?). Note that the Index error is an error on the Index of the resulting pivoted table df1, not the original DataFrame df.

pivot_table lets you do it however with the aggfunc argument. How about something like this?

df1 = df.pivot_table(index = 'uid', columns = 'msg', values = '_time', aggfunc = len)

This will help you figure out which user-msg records have duplicate entries (anything with over 1), and after cleaning them out, you can use pivot on df to successfully pivot _time.

like image 187
selwyth Avatar answered Oct 21 '22 13:10

selwyth