by = "B" block has duplicated indices both in case1 and case2,
why case1 work but case2 does not.
case1
df1 = pd.DataFrame({"a":[0,100,200], "by":["A","B","B"]}, index=[0,1,1])
df1.groupby("by").diff()
# result is okay
case2
df2 = pd.DataFrame({"a":[0,100,200], "by":["C","B","B"]}, index=[0,1,1])
df2.groupby("by").diff()
# throws ValueError: cannot reindex from a duplicate axis
Prevent. In order to make sure your DataFrame cannot contain duplicate values in the index, you can set allows_duplicate_labels flag to False for preventing the assignment of duplicate values.
Indicate duplicate index values. Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated. The value or values in a set of duplicates to mark as missing.
The reindex() method allows you to change the row indexes, and the columns labels. Note: The values are set to NaN if the new index is not the same as the old.
Your problem is solved by turning off the sort property of groupby.
df1 = pd.DataFrame({"a":[0,100,200], "by":["C","B","B"]}, index=[0,1,1])
df1.groupby("by", sort=False).diff()
print(df1)
Result:
a by
0 0 C
1 100 B
1 200 B
Explanation:
Even if you "cannot reindex from a duplicate axis", Pandas tries to do it by assigning a rank to letters by their alphabetical order when the sort property is activated, for instance :
A ---> 1
B ---> 2
B ---> 3
Even if we have 2 B's the incrementation is possible by considering the second B comes logically after the first B. For example the chunks of code below works perfectly:
import pandas as pd
# THE CODE BELOW WORKS PERFECTLY
df1 = pd.DataFrame({"a":[0,100,90], "by":["A","B","B"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
df1 = pd.DataFrame({"a":[0,100,90], "by":["B","C","C"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
df1 = pd.DataFrame({"a":[0,100,90], "by":["C","D","D"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
Because D comes after C, C comes after B and so on... Pandas tries to find a logic
what is illogical considering the alphabetical order is this: DCC ---> You could not assign 1 to D thus 2 to C.
Chunks of code below generate errors:
# EVERY CHUNK OF CODE BELOW GENERATES AN ERROR
df1 = pd.DataFrame({"a":[0,100,90], "by":["B","A","A"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
# builtins.ValueError: cannot reindex from a duplicate axis
df1 = pd.DataFrame({"a":[0,100,90], "by":["D","C","C"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
# builtins.ValueError: cannot reindex from a duplicate axis
df1 = pd.DataFrame({"a":[0,100,90], "by":["E","D","D"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
# builtins.ValueError: cannot reindex from a duplicate axis
To go further: Let's consider these 2 chunks and their results:
df1 = pd.DataFrame({"a":[0,100,200], "by":["E","D","F"]}, index=[0,1,1])
df1.groupby("by").diff()
print(df1)
# builtins.ValueError: cannot reindex from a duplicate axis
with only a change on index...
df1 = pd.DataFrame({"a":[0,100,200], "by":["E","D","F"]}, index=[0,1,2])
df1.groupby("by").diff()
print(df1)
# a by
# 0 0 E
# 1 100 D
# 2 200 F
Even if EDF is not the alphabetical order, Pandas seems to go for a sort logic by using the index... index is 011 in the first case with no sorting logic that's not the case with 012
In conclusion you have to desactivate sorting by turning it to False property to prevent Pandas sorting attempts
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With