by = "B" block has duplicated indices both in case1 and case2, why case1 work but case2 does not. case1 <pre class="prettyprint"><code>df1 = pd.DataFrame({"a":[0,100,200], "by":["A","B","B"]}, index=[0,1,1]) df1.groupby("by").diff() # result is okay </code></pre> case2 <pre class="prettyprint"><code>df2 = pd.DataFrame({"a":[0,100,200], "by":["C","B","B"]}, index=[0,1,1]) df2.groupby("by").diff() # throws ValueError: cannot reindex from a duplicate axis </code></pre>

Your problem is solved by turning off the sort property of groupby. <pre class="prettyprint lang-py prettyprint-override"><code>df1 = pd.DataFrame({"a":[0,100,200], "by":["C","B","B"]}, index=[0,1,1]) df1.groupby("by", sort=False).diff() print(df1) </code></pre> Result: <pre class="prettyprint"><code> a by 0 0 C 1 100 B 1 200 B </code></pre> Explanation: Even if you "cannot reindex from a duplicate axis", Pandas tries to do it by assigning a rank to letters by their alphabetical order when the sort property is activated, for instance : A ---> 1 B ---> 2 B ---> 3 Even if we have 2 B's the incrementation is possible by considering the second B comes logically after the first B. For example the chunks of code below works perfectly: <pre class="prettyprint lang-py prettyprint-override"><code>import pandas as pd # THE CODE BELOW WORKS PERFECTLY df1 = pd.DataFrame({"a":[0,100,90], "by":["A","B","B"]}, index=[0,1,1]) df1.groupby("by").diff() print(df1) df1 = pd.DataFrame({"a":[0,100,90], "by":["B","C","C"]}, index=[0,1,1]) df1.groupby("by").diff() print(df1) df1 = pd.DataFrame({"a":[0,100,90], "by":["C","D","D"]}, index=[0,1,1]) df1.groupby("by").diff() print(df1) </code></pre> Because D comes after C, C comes after B and so on... Pandas tries to find a logic what is illogical considering the alphabetical order is this: DCC ---> You could not assign 1 to D thus 2 to C. Chunks of code below generate errors: <pre class="prettyprint lang-py prettyprint-override"><code># EVERY CHUNK OF CODE BELOW GENERATES AN ERROR df1 = pd.DataFrame({"a":[0,100,90], "by":["B","A","A"]}, index=[0,1,1]) df1.groupby("by").diff() print(df1) # builtins.ValueError: cannot reindex from a duplicate axis df1 = pd.DataFrame({"a":[0,100,90], "by":["D","C","C"]}, index=[0,1,1]) df1.groupby("by").diff() print(df1) # builtins.ValueError: cannot reindex from a duplicate axis df1 = pd.DataFrame({"a":[0,100,90], "by":["E","D","D"]}, index=[0,1,1]) df1.groupby("by").diff() print(df1) # builtins.ValueError: cannot reindex from a duplicate axis </code></pre> To go further: Let's consider these 2 chunks and their results: <pre class="prettyprint lang-py prettyprint-override"><code>df1 = pd.DataFrame({"a":[0,100,200], "by":["E","D","F"]}, index=[0,1,1]) df1.groupby("by").diff() print(df1) # builtins.ValueError: cannot reindex from a duplicate axis </code></pre> with only a change on index... <pre class="prettyprint lang-py prettyprint-override"><code>df1 = pd.DataFrame({"a":[0,100,200], "by":["E","D","F"]}, index=[0,1,2]) df1.groupby("by").diff() print(df1) # a by # 0 0 E # 1 100 D # 2 200 F </code></pre> Even if EDF is not the alphabetical order, Pandas seems to go for a sort logic by using the index... index is 011 in the first case with no sorting logic that's not the case with 012 In conclusion you have to desactivate sorting by turning it to False property to prevent Pandas sorting attempts

error reindex from a duplicate axis in groupby

Tags:

python

pandas

pandas-groupby

by = "B" block has duplicated indices both in case1 and case2,

why case1 work but case2 does not.

case1

df1 = pd.DataFrame({"a":[0,100,200],  "by":["A","B","B"]}, index=[0,1,1])
df1.groupby("by").diff()  
# result is okay

case2

df2 = pd.DataFrame({"a":[0,100,200],  "by":["C","B","B"]}, index=[0,1,1])
df2.groupby("by").diff()  
# throws ValueError: cannot reindex from a duplicate axis

658

asked Jun 19 '20 15:06

junliang

1 Answers

Your problem is solved by turning off the sort property of groupby.

df1 = pd.DataFrame({"a":[0,100,200],  "by":["C","B","B"]}, index=[0,1,1])
df1.groupby("by", sort=False).diff()
print(df1)

Result:

Explanation:

Even if you "cannot reindex from a duplicate axis", Pandas tries to do it by assigning a rank to letters by their alphabetical order when the sort property is activated, for instance :

A ---> 1

B ---> 2

B ---> 3

Even if we have 2 B's the incrementation is possible by considering the second B comes logically after the first B. For example the chunks of code below works perfectly:

import pandas as pd

# THE CODE BELOW WORKS PERFECTLY
df1 = pd.DataFrame({"a":[0,100,90],  "by":["A","B","B"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)

df1 = pd.DataFrame({"a":[0,100,90],  "by":["B","C","C"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)

df1 = pd.DataFrame({"a":[0,100,90],  "by":["C","D","D"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)

Because D comes after C, C comes after B and so on... Pandas tries to find a logic

what is illogical considering the alphabetical order is this: DCC ---> You could not assign 1 to D thus 2 to C.

Chunks of code below generate errors:

# EVERY CHUNK OF CODE BELOW GENERATES AN ERROR
df1 = pd.DataFrame({"a":[0,100,90],  "by":["B","A","A"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
# builtins.ValueError: cannot reindex from a duplicate axis

df1 = pd.DataFrame({"a":[0,100,90],  "by":["D","C","C"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
# builtins.ValueError: cannot reindex from a duplicate axis

df1 = pd.DataFrame({"a":[0,100,90],  "by":["E","D","D"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
# builtins.ValueError: cannot reindex from a duplicate axis

To go further: Let's consider these 2 chunks and their results:

df1 = pd.DataFrame({"a":[0,100,200],  "by":["E","D","F"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
# builtins.ValueError: cannot reindex from a duplicate axis

with only a change on index...

df1 = pd.DataFrame({"a":[0,100,200],  "by":["E","D","F"]}, index=[0,1,2])
df1.groupby("by").diff()
print(df1)

#      a by
# 0    0  E
# 1  100  D
# 2  200  F

Even if EDF is not the alphabetical order, Pandas seems to go for a sort logic by using the index... index is 011 in the first case with no sorting logic that's not the case with 012

In conclusion you have to desactivate sorting by turning it to False property to prevent Pandas sorting attempts

195

answered Sep 22 '22 08:09

Laurent B.

Related questions
                            
                                Obtain plot data from JPEG or png file
                            
                                optimal data structure to store million of pixels in python?
                            
                                Calling a JS File Within Jupyter Notebook & Sharing Data
                            
                                Wrap text in matplotlib table
                            
                                Stream realtime video between 2 computers using Python
                            
                                Live speech recognition
                            
                                How to save an image list in PDF using PIL (pillow)?
                            
                                How to handle the connection event of widget output in Orange3?
                            
                                Editable plots in PowerPoint from python: equivalent of officer and rvg
                            
                                Cannot plot predicted time series values using matplotlib
                            
                                Using and Randomizing Proxies
                            
                                How to simplify matrix expressions in SymPy?
                            
                                Efficient text preprocessing using PySpark (clean, tokenize, stopwords, stemming, filter)
                            
                                How to exchange Msgpack files between Python and R?
                            
                                ffill weird behavior , when have the duplicate columns names
                            
                                Copy flask request/app context to another process
                            
                                OSError: Could not find geos_c.dll or load any of its variants
                            
                                Django admin returns 404 on POST, 200 on GET
                            
                                Multivariate time series forecasting with 3 months dataset
                            
                                No module named 'Cython' with pip installation of tar.gz

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With