I have data that looks like: <pre class="prettyprint"><code>Identifier Category1 Category2 Category3 Category4 Category5 1000 foo bat 678 a.x ld 1000 foo bat 78 l.o op 1000 coo cat 678 p.o kt 1001 coo sat 89 a.x hd 1001 foo bat 78 l.o op 1002 foo bat 678 a.x ld 1002 foo bat 78 l.o op 1002 coo cat 678 p.o kt </code></pre> What i am trying to do is compare 1000 to 1001 and to 1002 and so on. The output I want the code to give is : 1000 is the same as 1002. So, the approach I wanted to use was: <ol> <li>First group all the identifier items into separate dataframes (maybe?). For example, df1 would be all rows pertaining to identifier 1000 and df2 would be all rows pertaining to identifier 1002. (**Please note that I want the code to do this itself as there are millions of rows, as opposed to me writing code to manually compare identifiers **). I have tried using the groupby feature of pandas, it does the part of grouping well, but then I do not know how to compare the groups.</li> <li>Compare each of the groups/sub-data frames.</li> </ol> One method I was thinking of was reading each row of a particular identifier into an array/vector and comparing arrays/vectors using a comparison metric (Manhattan distance, cosine similarity etc). Any help is appreciated, I am very new to Python. Thanks in advance!

You could do something like the following: <pre class="prettyprint"><code>import pandas as pd input_file = pd.read_csv("input.csv") columns = ['Category1','Category2','Category3','Category4','Category5'] duplicate_entries = {} for group in input_file.groupby('Identifier'): # transforming to tuples so that it can be used as keys on a dict lines = [tuple(y) for y in group[1].loc[:,columns].values.tolist()] key = tuple(lines) if key not in duplicate_entries: duplicate_entries[key] = [] duplicate_entries[key].append(group[0]) </code></pre> Then the <code>duplicate_entries</code> values will have the list of duplicate Identifiers <pre class="prettyprint"><code>duplicate_entries.values() > [[1000, 1002], [1001]] </code></pre> EDIT: To get only the entries that have duplicates, you could have something like: <pre class="prettyprint"><code>all_dup = [dup for dup in duplicate_entries if len(dup) > 1] </code></pre> Explaining the indices (sorry I didn't explained it before): Iterating through the <code>df.groupby</code> outcome gives a tuple where the first entry is the key of the group (in this case it would be a 'Identifier') and the second one is a Series of the grouped dataframes. So to get the lines that contain the duplicate entries we'd use <code>[1]</code> and the 'Identifier' for that group is found at <code>[0]</code>. Because on the <code>duplicate_entries</code> array we'd like the identifier of that entry, using <code>group[0]</code> would get us that.

Grouping and comparing groups using pandas

Tags:

I have data that looks like:

Identifier  Category1 Category2 Category3 Category4 Category5
1000           foo      bat       678         a.x       ld
1000           foo      bat       78          l.o       op
1000           coo      cat       678         p.o       kt
1001           coo      sat       89          a.x       hd
1001           foo      bat       78          l.o       op
1002           foo      bat       678         a.x       ld
1002           foo      bat       78          l.o       op
1002           coo      cat       678         p.o       kt

What i am trying to do is compare 1000 to 1001 and to 1002 and so on. The output I want the code to give is : 1000 is the same as 1002. So, the approach I wanted to use was:

First group all the identifier items into separate dataframes (maybe?). For example, df1 would be all rows pertaining to identifier 1000 and df2 would be all rows pertaining to identifier 1002. (**Please note that I want the code to do this itself as there are millions of rows, as opposed to me writing code to manually compare identifiers **). I have tried using the groupby feature of pandas, it does the part of grouping well, but then I do not know how to compare the groups.
Compare each of the groups/sub-data frames.

One method I was thinking of was reading each row of a particular identifier into an array/vector and comparing arrays/vectors using a comparison metric (Manhattan distance, cosine similarity etc).

Any help is appreciated, I am very new to Python. Thanks in advance!

872

asked Jun 11 '17 00:06

S.k.S

1 Answers

You could do something like the following:

import pandas as pd

input_file = pd.read_csv("input.csv")
columns = ['Category1','Category2','Category3','Category4','Category5']

duplicate_entries = {}

for group in input_file.groupby('Identifier'):
    # transforming to tuples so that it can be used as keys on a dict
    lines = [tuple(y) for y in group[1].loc[:,columns].values.tolist()]    
    key = tuple(lines) 

    if key not in duplicate_entries:
        duplicate_entries[key] = []

    duplicate_entries[key].append(group[0])

Then the duplicate_entries values will have the list of duplicate Identifiers

duplicate_entries.values()
> [[1000, 1002], [1001]]

EDIT:

To get only the entries that have duplicates, you could have something like:

all_dup = [dup for dup in duplicate_entries if len(dup) > 1]

Explaining the indices (sorry I didn't explained it before): Iterating through the df.groupby outcome gives a tuple where the first entry is the key of the group (in this case it would be a 'Identifier') and the second one is a Series of the grouped dataframes. So to get the lines that contain the duplicate entries we'd use [1] and the 'Identifier' for that group is found at [0]. Because on the duplicate_entries array we'd like the identifier of that entry, using group[0] would get us that.

116

answered Sep 23 '22 15:09

Raquel Guimarães

Related questions
                            
                                How to flatten a memoryview?
                            
                                Yocto fido ->morty update dnsmasq NO GNU_HASH
                            
                                Why is app name not showing in Toolbar in android?
                            
                                How to retrieve a Bearer Token from an Authorization Header in JavaScript (Angular 2/4)?
                            
                                Any idea on how to get this id from the conversation between attendee and bot?
                            
                                Header can't pass in ajax using cross domain
                            
                                Angular HTML5Mode - Rewrite rule
                            
                                Nativescript + iOS webview + local files
                            
                                Auth0 Lock Component not showing in React with container option
                            
                                Can the Firefox DevTools Inspector highlight be made sticky?
                            
                                How to persist data in a Docker .NET Core Web app?
                            
                                React Navigator StackNavigator: goBack does not work when called twice from the same scene

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With