I am trying to unstack a multi-index with pandas and I am keep getting:
ValueError: Index contains duplicate entries, cannot reshape
Given a dataset with four columns:
I first set a three-level multi-index:
In [37]: e.set_index(['id', 'date', 'location'], inplace=True)
In [38]: e
Out[38]:
value
id date location
id1 2014-12-12 loc1 16.86
2014-12-11 loc1 17.18
2014-12-10 loc1 17.03
2014-12-09 loc1 17.28
Then I try to unstack the location:
In [39]: e.unstack('location')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-39-bc1e237a0ed7> in <module>()
----> 1 e.unstack('location')
...
C:\Anaconda\envs\sandbox\lib\site-packages\pandas\core\reshape.pyc in _make_selectors(self)
143
144 if mask.sum() < len(self.index):
--> 145 raise ValueError('Index contains duplicate entries, '
146 'cannot reshape')
147
ValueError: Index contains duplicate entries, cannot reshape
What is going on here?
To fix this error, we can use the pivot_table() function with a specific aggfunc argument to aggregate the data values in a certain way. Notice that we don't receive an error this time. The values in the DataFrame show the sum of points for each combination of team and position.
Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.
Here's an example DataFrame which show this, it has duplicate values with the same index. The question is, do you want to aggregate these or keep them as multiple rows?
In [11]: df
Out[11]:
0 1 2 3
0 1 2 a 16.86
1 1 2 a 17.18
2 1 4 a 17.03
3 2 5 b 17.28
In [12]: df.pivot_table(values=3, index=[0, 1], columns=2, aggfunc='mean') # desired?
Out[12]:
2 a b
0 1
1 2 17.02 NaN
4 17.03 NaN
2 5 NaN 17.28
In [13]: df1 = df.set_index([0, 1, 2])
In [14]: df1
Out[14]:
3
0 1 2
1 2 a 16.86
a 17.18
4 a 17.03
2 5 b 17.28
In [15]: df1.unstack(2)
ValueError: Index contains duplicate entries, cannot reshape
One solution is to reset_index
(and get back to df
) and use pivot_table
.
In [16]: df1.reset_index().pivot_table(values=3, index=[0, 1], columns=2, aggfunc='mean')
Out[16]:
2 a b
0 1
1 2 17.02 NaN
4 17.03 NaN
2 5 NaN 17.28
Another option (if you don't want to aggregate) is to append a dummy level, unstack it, then drop the dummy level...
There's a far more simpler solution to tackle this.
The reason why you get ValueError: Index contains duplicate entries, cannot reshape
is because, once you unstack "Location
", then the remaining index columns "id
" and "date
" combinations are no longer unique.
You can avoid this by retaining the default index column (row #) and while setting the index using "id
", "date
" and "location
", add it in "append
" mode instead of the default overwrite mode.
So use,
e.set_index(['id', 'date', 'location'], append=True)
Once this is done, your index columns will still have the default index along with the set indexes. And unstack
will work.
Let me know how it works out.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With