Apply a function on elements in a Pandas column, grouped on another column

Tags:

I have a dataset with several columns. Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.

         _id      fName        lName    age
0       ABCD     Andrew       Schulz    
1       ABCD    Andreww                  23
2       DEFG       John          boy
3       DEFG      Johnn          boy     14
4       CDGH        Bob        TANNA     13
5       ABCD.     Peter        Parker    45
6       DEFGH     Clark          Kent    25

So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values. Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:

         _id      fName        lName   age
0       ABCD     Andrew       Schulz    23
2       DEFG       John          boy    14
4       CDGH        Bob        TANNA    13
5       ABCD      Peter       Parker    45
6       DEFG      Clark         Kent    25

I intend to use pyjarowinkler. If I had two independent columns (without all the group by stuff) to check, this is how I use it.

    df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])]
    df = df[df['score'] > 0.87]

Can someone suggest a pythonic and fast way of doing this

UPDATE

So, I have tried using record linkage library for this. And I have ended up at a dataframe containing pair of indexes that are similar called 'matches'. Now I just want to basically combine the data.

    # Indexation step
    indexer = recordlinkage.Index()
    indexer.block(left_on='_id')
    candidate_links = indexer.index(df)

    # Comparison step
    compare_cl = recordlinkage.Compare()
    compare_cl.string('fName', 'fName', method='jarowinkler', threshold=threshold, label='full_name')

    features = compare_cl.compute(candidate_links, df)

    # Classification step
    matches = features[features.sum(axis=1) >= 1]
    print(len(matches))

This is how matches looks:

index1   index2          fName
0           1             1.0
2           3             1.0

I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows

402

asked Jun 24 '20 05:06

Sushant

1 Answers

just wanted to clear some doubts regarding your ques. Couldn't clear them in comments due to low reputation.

Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:

So basically your function would return the DataFrame containing the first row in each group (by ID)? This will result in the above listed resultant DataFrame.

         _id      fName        lName   age
0       ABCD     Andrew       Schulz    23
2       DEFG       John          boy    14
4       CDGH        Bob        TANNA    13

answered Oct 16 '22 03:10

Piyush

Related questions
                            
                                Why the difference in handling unbound locals in functions versus classes?
                            
                                Can't disable flask/werkzeug logging
                            
                                Handle Turkish uppercase and lowercase correctly, need to modify/override built-in functions?
                            
                                Python3 can't pickle _thread.RLock objects on list with multiprocessing
                            
                                implementing RNN with numpy
                            
                                'import quandl' produces 'Process finished with exit code -1073741819 (0xC0000005)'
                            
                                actor critic policy loss going to zero (with no improvement)
                            
                                How to properly set steps_per_epoch and validation_steps in Keras?
                            
                                How to sort records by sequence instead of name in Odoo OCA widget web_widget_x2many_2d_matrix?
                            
                                websocket._exceptions.WebSocketProxyException: failed CONNECT via proxy status: 503
                            
                                Pycharm: How to focus on Editor when hit a debug point
                            
                                Concatenating a dask dataframe and a pandas dataframe
                            
                                How to upload files to onedrive using msgraph-sdk-python?
                            
                                Use cases of `numpy.positive`
                            
                                How to convert Keras .h5 model to darknet yolo.weights format?
                            
                                Changing CNN to work with 3D convolutions
                            
                                How to find the most likely sequences of hidden states for a Hidden Markov Model
                            
                                Django test IntegrityError in fixture teardown
                            
                                aiopg + sqlalchemy: how to "drop table if exists" without raw sql?
                            
                                Difference between websocket and websockets

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apply a function on elements in a Pandas column, grouped on another column

Tags:

python

pandas

distance

Sushant

People also ask

1 Answers

Piyush

Recent Activity

Donate For Us