Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply a function on elements in a Pandas column, grouped on another column

I have a dataset with several columns. Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.

         _id      fName        lName    age
0       ABCD     Andrew       Schulz    
1       ABCD    Andreww                  23
2       DEFG       John          boy
3       DEFG      Johnn          boy     14
4       CDGH        Bob        TANNA     13
5       ABCD.     Peter        Parker    45
6       DEFGH     Clark          Kent    25

So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values. Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:

         _id      fName        lName   age
0       ABCD     Andrew       Schulz    23
2       DEFG       John          boy    14
4       CDGH        Bob        TANNA    13
5       ABCD      Peter       Parker    45
6       DEFG      Clark         Kent    25

I intend to use pyjarowinkler. If I had two independent columns (without all the group by stuff) to check, this is how I use it.

    df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])]
    df = df[df['score'] > 0.87]

Can someone suggest a pythonic and fast way of doing this

UPDATE

So, I have tried using record linkage library for this. And I have ended up at a dataframe containing pair of indexes that are similar called 'matches'. Now I just want to basically combine the data.

    # Indexation step
    indexer = recordlinkage.Index()
    indexer.block(left_on='_id')
    candidate_links = indexer.index(df)

    # Comparison step
    compare_cl = recordlinkage.Compare()
    compare_cl.string('fName', 'fName', method='jarowinkler', threshold=threshold, label='full_name')

    features = compare_cl.compute(candidate_links, df)

    # Classification step
    matches = features[features.sum(axis=1) >= 1]
    print(len(matches))

This is how matches looks:

index1   index2          fName
0           1             1.0
2           3             1.0

I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows

like image 402
Sushant Avatar asked Jun 24 '20 05:06

Sushant


People also ask

How do I group values in a column in pandas?

Groupby is a very powerful pandas method. You can group by one column and count the values of another column per this column value using value_counts. Using groupby and value_counts we can count the number of activities each person did.

How do you group by one column and sum another panda?

Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.


1 Answers

just wanted to clear some doubts regarding your ques. Couldn't clear them in comments due to low reputation.

Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:

So basically your function would return the DataFrame containing the first row in each group (by ID)? This will result in the above listed resultant DataFrame.

         _id      fName        lName   age
0       ABCD     Andrew       Schulz    23
2       DEFG       John          boy    14
4       CDGH        Bob        TANNA    13
like image 69
Piyush Avatar answered Oct 16 '22 03:10

Piyush