Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping and comparing groups using pandas

Tags:

I have data that looks like:

Identifier  Category1 Category2 Category3 Category4 Category5
1000           foo      bat       678         a.x       ld
1000           foo      bat       78          l.o       op
1000           coo      cat       678         p.o       kt
1001           coo      sat       89          a.x       hd
1001           foo      bat       78          l.o       op
1002           foo      bat       678         a.x       ld
1002           foo      bat       78          l.o       op
1002           coo      cat       678         p.o       kt

What i am trying to do is compare 1000 to 1001 and to 1002 and so on. The output I want the code to give is : 1000 is the same as 1002. So, the approach I wanted to use was:

  1. First group all the identifier items into separate dataframes (maybe?). For example, df1 would be all rows pertaining to identifier 1000 and df2 would be all rows pertaining to identifier 1002. (**Please note that I want the code to do this itself as there are millions of rows, as opposed to me writing code to manually compare identifiers **). I have tried using the groupby feature of pandas, it does the part of grouping well, but then I do not know how to compare the groups.
  2. Compare each of the groups/sub-data frames.

One method I was thinking of was reading each row of a particular identifier into an array/vector and comparing arrays/vectors using a comparison metric (Manhattan distance, cosine similarity etc).

Any help is appreciated, I am very new to Python. Thanks in advance!

like image 872
S.k.S Avatar asked Jun 11 '17 00:06

S.k.S


People also ask

What is grouping in pandas?

Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe. groupby() function is used to split the data into groups based on some criteria.

How do you use groups in pandas?

The Hello, World! of pandas GroupBy You call . groupby() and pass the name of the column that you want to group on, which is "state" . Then, you use ["last_name"] to specify the columns on which you want to perform the actual aggregation. You can pass a lot more than just a single column name to .

What is the difference between groupby and Pivot_table in pandas?

What is the difference between the pivot_table and the groupby? The groupby method is generally enough for two-dimensional operations, but pivot_table is used for multi-dimensional grouping operations.


1 Answers

You could do something like the following:

import pandas as pd

input_file = pd.read_csv("input.csv")
columns = ['Category1','Category2','Category3','Category4','Category5']

duplicate_entries = {}

for group in input_file.groupby('Identifier'):
    # transforming to tuples so that it can be used as keys on a dict
    lines = [tuple(y) for y in group[1].loc[:,columns].values.tolist()]    
    key = tuple(lines) 

    if key not in duplicate_entries:
        duplicate_entries[key] = []

    duplicate_entries[key].append(group[0])

Then the duplicate_entries values will have the list of duplicate Identifiers

duplicate_entries.values()
> [[1000, 1002], [1001]]

EDIT:

To get only the entries that have duplicates, you could have something like:

all_dup = [dup for dup in duplicate_entries if len(dup) > 1]

Explaining the indices (sorry I didn't explained it before): Iterating through the df.groupby outcome gives a tuple where the first entry is the key of the group (in this case it would be a 'Identifier') and the second one is a Series of the grouped dataframes. So to get the lines that contain the duplicate entries we'd use [1] and the 'Identifier' for that group is found at [0]. Because on the duplicate_entries array we'd like the identifier of that entry, using group[0] would get us that.

like image 116
Raquel Guimarães Avatar answered Sep 23 '22 15:09

Raquel Guimarães