Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

error reindex from a duplicate axis in groupby

by = "B" block has duplicated indices both in case1 and case2,

why case1 work but case2 does not.

case1

df1 = pd.DataFrame({"a":[0,100,200],  "by":["A","B","B"]}, index=[0,1,1])
df1.groupby("by").diff()  
# result is okay

case2

df2 = pd.DataFrame({"a":[0,100,200],  "by":["C","B","B"]}, index=[0,1,1])
df2.groupby("by").diff()  
# throws ValueError: cannot reindex from a duplicate axis
like image 658
junliang Avatar asked Jun 19 '20 15:06

junliang


People also ask

How do you solve Cannot reindex from a duplicate axis?

Prevent. In order to make sure your DataFrame cannot contain duplicate values in the index, you can set allows_duplicate_labels flag to False for preventing the assignment of duplicate values.

Can Pandas index have duplicates?

Indicate duplicate index values. Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated. The value or values in a set of duplicates to mark as missing.

What is the purpose of reindex () function?

The reindex() method allows you to change the row indexes, and the columns labels. Note: The values are set to NaN if the new index is not the same as the old.


1 Answers

Your problem is solved by turning off the sort property of groupby.

df1 = pd.DataFrame({"a":[0,100,200],  "by":["C","B","B"]}, index=[0,1,1])
df1.groupby("by", sort=False).diff()
print(df1)

Result:

     a by
0    0  C
1  100  B
1  200  B

Explanation:

Even if you "cannot reindex from a duplicate axis", Pandas tries to do it by assigning a rank to letters by their alphabetical order when the sort property is activated, for instance :

A ---> 1

B ---> 2

B ---> 3

Even if we have 2 B's the incrementation is possible by considering the second B comes logically after the first B. For example the chunks of code below works perfectly:

import pandas as pd

# THE CODE BELOW WORKS PERFECTLY
df1 = pd.DataFrame({"a":[0,100,90],  "by":["A","B","B"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)

df1 = pd.DataFrame({"a":[0,100,90],  "by":["B","C","C"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)

df1 = pd.DataFrame({"a":[0,100,90],  "by":["C","D","D"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)

Because D comes after C, C comes after B and so on... Pandas tries to find a logic

what is illogical considering the alphabetical order is this: DCC ---> You could not assign 1 to D thus 2 to C.

Chunks of code below generate errors:

# EVERY CHUNK OF CODE BELOW GENERATES AN ERROR
df1 = pd.DataFrame({"a":[0,100,90],  "by":["B","A","A"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
# builtins.ValueError: cannot reindex from a duplicate axis

df1 = pd.DataFrame({"a":[0,100,90],  "by":["D","C","C"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
# builtins.ValueError: cannot reindex from a duplicate axis

df1 = pd.DataFrame({"a":[0,100,90],  "by":["E","D","D"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
# builtins.ValueError: cannot reindex from a duplicate axis

To go further: Let's consider these 2 chunks and their results:

df1 = pd.DataFrame({"a":[0,100,200],  "by":["E","D","F"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
# builtins.ValueError: cannot reindex from a duplicate axis

with only a change on index...

df1 = pd.DataFrame({"a":[0,100,200],  "by":["E","D","F"]}, index=[0,1,2])
df1.groupby("by").diff()
print(df1)

#      a by
# 0    0  E
# 1  100  D
# 2  200  F

Even if EDF is not the alphabetical order, Pandas seems to go for a sort logic by using the index... index is 011 in the first case with no sorting logic that's not the case with 012

In conclusion you have to desactivate sorting by turning it to False property to prevent Pandas sorting attempts

like image 195
Laurent B. Avatar answered Sep 22 '22 08:09

Laurent B.