Sort rows of DataFrame by duplicate

Tags:

How can I sort a DataFrame so that rows in the duplicate column are "recycled"?

For example, my original DataFrame looks like this:

In [3]: df
Out[3]: 
    A  B
0  r1  0
1  r1  1
2  r2  2
3  r2  3
4  r3  4
5  r3  5

I would like it to turn to:

In [3]: df_sorted
Out[3]: 
    A  B
0  r1  0
2  r2  2
4  r3  4
1  r1  1
3  r2  3
5  r3  5

Rows are sorted such that rows in columns A are in a "recycled" fashion.

I have searched APIs in Pandas, but it seems there isn't any proper method to do so. I can write a complicated function to accomplish this, but just wondering is there any smart way or existing pandas method can do this? Thanks a lot in advance.

Update: Apologies for a wrong statement. In my real problem, column B contains string values.

917

asked Aug 15 '16 04:08

Xer

1 Answers

You can use cumcount for counting duplicates in column A, then sort_values first by A (in sample not necessary, in real data maybe important) and then by C. Last remove column C by drop:

df['C'] = df.groupby('A')['A'].cumcount()
df.sort_values(by=['C', 'A'], inplace=True)
print (df)
    A  B  C
0  r1  0  0
2  r2  2  0
4  r3  4  0
1  r1  1  1
3  r2  3  1
5  r3  5  1

df.drop('C', axis=1, inplace=True)
print (df)
    A  B
0  r1  0
2  r2  2
4  r3  4
1  r1  1
3  r2  3
5  r3  5

Timings:

Small df (len(df)=6)

In [26]: %timeit (jez(df))
1000 loops, best of 3: 2 ms per loop

In [27]: %timeit (boud(df1))
100 loops, best of 3: 2.52 ms per loop

Large df (len(df)=6000)

In [23]: %timeit (jez(df))
100 loops, best of 3: 3.44 ms per loop

In [28]: %timeit (boud(df1))
100 loops, best of 3: 2.52 ms per loop

Code for timing:

df = pd.concat([df]*1000).reset_index(drop=True) 
df1 = df.copy()

def jez(df):
    df['C'] = df.groupby('A')['A'].cumcount()
    df.sort_values(by=['C', 'A'], inplace=True)
    df.drop('C', axis=1, inplace=True)
    return (df)

def boud(df):
    df['C'] = df.groupby('A')['B'].rank()
    df = df.sort_values(['C', 'A'])
    df.drop('C', axis=1, inplace=True)
    return (df)
100 loops, best of 3: 4.29 ms per loop

154

answered Nov 10 '22 23:11

jezrael

Related questions
                            
                                cv2.imread does not read jpg files
                            
                                why do i get a bad file descriptor error?
                            
                                Fast fuse of close points in a numpy-2d (vectorized)
                            
                                Python - is there a way to store an operation(+ - * /) in a list or as a variable?
                            
                                Python - Find center of object in an image
                            
                                are elements of an array in a set?
                            
                                How to implement a Global Python Logger?
                            
                                Python/Django date query: Unsupported lookup 'date' for DateField or join on the field not permitted
                            
                                xterm not working in mininet
                            
                                nvcc fatal : Value 'sm_61' is not defined for option 'gpu-architecture' error with theano
                            
                                How to create 2-layers neural network using TensorFlow and python on MNIST data
                            
                                Python's super() , what exactly happens? [duplicate]
                            
                                Python: Generate a geometric progression using list comprehension
                            
                                Reference a dictionary within itself
                            
                                PEP 424 __length_hint__() - Is there a way to do the same for generators or zips?
                            
                                How to binarize the values in a pandas DataFrame?
                            
                                Losing merged cells border while editing Excel file with openpyxl
                            
                                No module named 'django.core.context_processors', in views.py
                            
                                AWS Elastic Beanstalk Environment Variables in Python
                            
                                sphinx-apidoc picks up submodules, but autodoc doesn't document them

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sort rows of DataFrame by duplicate

Tags:

python

sorting

pandas

dataframe

duplicates

Xer

People also ask

1 Answers

jezrael

Recent Activity

Donate For Us