I have data that looks like:
Identifier Category1 Category2 Category3 Category4 Category5
1000 foo bat 678 a.x ld
1000 foo bat 78 l.o op
1000 coo cat 678 p.o kt
1001 coo sat 89 a.x hd
1001 foo bat 78 l.o op
1002 foo bat 678 a.x ld
1002 foo bat 78 l.o op
1002 coo cat 678 p.o kt
What i am trying to do is compare 1000 to 1001 and to 1002 and so on. The output I want the code to give is : 1000 is the same as 1002. So, the approach I wanted to use was:
One method I was thinking of was reading each row of a particular identifier into an array/vector and comparing arrays/vectors using a comparison metric (Manhattan distance, cosine similarity etc).
Any help is appreciated, I am very new to Python. Thanks in advance!
Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe. groupby() function is used to split the data into groups based on some criteria.
The Hello, World! of pandas GroupBy You call . groupby() and pass the name of the column that you want to group on, which is "state" . Then, you use ["last_name"] to specify the columns on which you want to perform the actual aggregation. You can pass a lot more than just a single column name to .
What is the difference between the pivot_table and the groupby? The groupby method is generally enough for two-dimensional operations, but pivot_table is used for multi-dimensional grouping operations.
You could do something like the following:
import pandas as pd
input_file = pd.read_csv("input.csv")
columns = ['Category1','Category2','Category3','Category4','Category5']
duplicate_entries = {}
for group in input_file.groupby('Identifier'):
# transforming to tuples so that it can be used as keys on a dict
lines = [tuple(y) for y in group[1].loc[:,columns].values.tolist()]
key = tuple(lines)
if key not in duplicate_entries:
duplicate_entries[key] = []
duplicate_entries[key].append(group[0])
Then the duplicate_entries
values will have the list of duplicate Identifiers
duplicate_entries.values()
> [[1000, 1002], [1001]]
EDIT:
To get only the entries that have duplicates, you could have something like:
all_dup = [dup for dup in duplicate_entries if len(dup) > 1]
Explaining the indices (sorry I didn't explained it before): Iterating through the df.groupby
outcome gives a tuple where the first entry is the key of the group (in this case it would be a 'Identifier') and the second one is a Series of the grouped dataframes. So to get the lines that contain the duplicate entries we'd use [1]
and the 'Identifier' for that group is found at [0]
. Because on the duplicate_entries
array we'd like the identifier of that entry, using group[0]
would get us that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With