How can I sort a DataFrame so that rows in the duplicate column are "recycled"?
For example, my original DataFrame looks like this:
In [3]: df
Out[3]: 
    A  B
0  r1  0
1  r1  1
2  r2  2
3  r2  3
4  r3  4
5  r3  5
I would like it to turn to:
In [3]: df_sorted
Out[3]: 
    A  B
0  r1  0
2  r2  2
4  r3  4
1  r1  1
3  r2  3
5  r3  5
Rows are sorted such that rows in columns A are in a "recycled" fashion.
I have searched APIs in Pandas, but it seems there isn't any proper method to do so. I can write a complicated function to accomplish this, but just wondering is there any smart way or existing pandas method can do this? Thanks a lot in advance.
Update:
Apologies for a wrong statement. In my real problem, column B contains string values.
DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique. In this article, you will learn how to use this method to identify the duplicate rows in a DataFrame.
You can sort pandas DataFrame by one or multiple (one or more) columns using sort_values() method and by ascending or descending order. To specify the order, you have to use ascending boolean property; False for descending and True for ascending. By default, it is set to True.
You can use cumcount for counting duplicates in column A, then sort_values first by A (in sample not necessary, in real data maybe important) and then by C. Last remove column C by drop:
df['C'] = df.groupby('A')['A'].cumcount()
df.sort_values(by=['C', 'A'], inplace=True)
print (df)
    A  B  C
0  r1  0  0
2  r2  2  0
4  r3  4  0
1  r1  1  1
3  r2  3  1
5  r3  5  1
df.drop('C', axis=1, inplace=True)
print (df)
    A  B
0  r1  0
2  r2  2
4  r3  4
1  r1  1
3  r2  3
5  r3  5
Timings:
Small df (len(df)=6)
In [26]: %timeit (jez(df))
1000 loops, best of 3: 2 ms per loop
In [27]: %timeit (boud(df1))
100 loops, best of 3: 2.52 ms per loop
Large df (len(df)=6000)
In [23]: %timeit (jez(df))
100 loops, best of 3: 3.44 ms per loop
In [28]: %timeit (boud(df1))
100 loops, best of 3: 2.52 ms per loop
Code for timing:
df = pd.concat([df]*1000).reset_index(drop=True) 
df1 = df.copy()
def jez(df):
    df['C'] = df.groupby('A')['A'].cumcount()
    df.sort_values(by=['C', 'A'], inplace=True)
    df.drop('C', axis=1, inplace=True)
    return (df)
def boud(df):
    df['C'] = df.groupby('A')['B'].rank()
    df = df.sort_values(['C', 'A'])
    df.drop('C', axis=1, inplace=True)
    return (df)
100 loops, best of 3: 4.29 ms per loop
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With