I have a data frame which has rows for each user joining my site and making a purchase.
+---+-----+--------------------+---------+--------+-----+
| | uid | msg | _time | gender | age |
+---+-----+--------------------+---------+--------+-----+
| 0 | 1 | confirmed_settings | 1/29/15 | M | 37 |
| 1 | 1 | sale | 4/13/15 | M | 37 |
| 2 | 3 | confirmed_settings | 4/19/15 | M | 35 |
| 3 | 4 | confirmed_settings | 2/21/15 | M | 21 |
| 4 | 5 | confirmed_settings | 3/28/15 | M | 18 |
| 5 | 4 | sale | 3/15/15 | M | 21 |
+---+-----+--------------------+---------+--------+-----+
I would like to change the dataframe so that each row is unique for a uid and there is a columns called sale
and confirmed_settings
which have the timestamp of the action. Note that not every user has a sale
, but every user has a confirmed_settings
. Like below:
+---+-----+--------------------+---------+---------+--------+-----+
| | uid | confirmed_settings | sale | _time | gender | age |
+---+-----+--------------------+---------+---------+--------+-----+
| 0 | 1 | 1/29/15 | 4/13/15 | 1/29/15 | M | 37 |
| 1 | 3 | 4/19/15 | null | 4/19/15 | M | 35 |
| 2 | 4 | 2/21/15 | 3/15/15 | 2/21/15 | M | 21 |
| 3 | 5 | 3/28/15 | null | 3/28/15 | M | 18 |
+---+-----+--------------------+---------+---------+--------+-----+
To do this, I am trying:
df1 = df.pivot(index='uid', columns='msg', values='_time').reset_index()
df1 = df1.merge(df[['uid', 'gender', 'age']].drop_duplicates(), on='uid')
But I get this error: ValueError: Index contains duplicate entries, cannot reshape
How can I pivot a df with duplicate index values to transform my dataframe?
Edit:
df1 = df.pivot_table(index='uid', columns='msg', values='_time').reset_index()
gives this error DataError: No numeric types to aggregate
but im not even sure that is the right path to go on.
I suspect there are indeed duplicate uid
-msg
entries/keys (e.g. uid
2 has 2 confirmed_settings entries under msg
), which you alluded to in the comments for fixxxer's answer. If there are, you can't use pivot
, because you can't tell it how to treat the different values encountered during aggregation (count? max? mean? sum?). Note that the Index error is an error on the Index of the resulting pivoted table df1
, not the original DataFrame df
.
pivot_table
lets you do it however with the aggfunc
argument. How about something like this?
df1 = df.pivot_table(index = 'uid', columns = 'msg', values = '_time', aggfunc = len)
This will help you figure out which user-msg records have duplicate entries (anything with over 1), and after cleaning them out, you can use pivot
on df
to successfully pivot _time
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With