If I have a dataframe with duplicates in the index, how would I create a set of dataframes with no duplicates in the index?
More precisely, given the dataframe:
a b
1 1 6
1 2 7
2 3 8
2 4 9
2 5 0
I would want as output, a list of dataframes:
a b
1 1 6
2 3 8
a b
1 2 7
2 4 9
a b
2 5 0
This needs to be scalable to as many dataframes as needed based on the number of duplicates.
df=df.reset_index()
dfs=[]
while not df.empty:
dfs.append(df[~df.duplicated('index',keep='first')].set_index('index'))
df=df[df.duplicated('index',keep='first')]
#dfs will have all your dataframes
Use GroupBy.cumcount
for custom groups and then convert groups to dictionaries:
df = dict(tuple(df.groupby(df.groupby(level=0).cumcount())))
print (df)
{0: a b
1 1 6
2 3 8, 1: a b
1 2 7
2 4 9, 2: a b
2 5 0}
print (dfs[0])
a b
1 1 6
2 3 8
Or convert to list of DataFrames:
dfs = [x for i, x in df.groupby(df.groupby(level=0).cumcount())]
print (dfs)
[ a b
1 1 6
2 3 8, a b
1 2 7
2 4 9, a b
2 5 0]
Another approach is to use pd.DataFrame.groupby.nth
:
import numpy as np
g = df.groupby(df.index)
cnt = np.bincount(df.index).max()
dfs = [g.nth(i) for i in range(cnt)]
Output:
[ a b
1 1 6
2 3 8,
a b
1 2 7
2 4 9,
a b
2 5 0]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With