clustering/grouping names with little or no anomalies into clusters in pandas

Question

I have a dataframe with names field as:

print(df)
                              names
   --------------------------------
0  U.S.A.
1  United States of America
2  USA
4  US America
5  Kenyan Footbal League
6  Kenyan Football League
7  Kenya Football League Assoc.
8  Kenya Footbal League Association
9  Tata Motors
10 Tat Motor
11 Tata Motors Ltd.
12 Tata Motor Limited
13 REL
14 Reliance Limited
15 Reliance Co.

Now I want to club all these similar kind of names into one category such that the final dataframe looks something like this:

print(df)
                              names   group_name
   ---------------------------------------------
0  U.S.A.                             USA
1  United States of America           USA
2  USA                                USA
4  US America                         USA
5  Kenyan Footbal League              Kenya Football League
6  Kenyan Football League             Kenya Football League
7  Kenya Football League Assoc.       Kenya Football League
8  Kenya Footbal League Association   Kenya Football League
9  Tata Motors                        Tata Motors
10 Tat Motor                          Tata Motors
11 Tata Motors Ltd.                   Tata Motors
12 Tata Motor Limited                 Tata Motors
13 REL                                Reliance
14 Reliance Limited.                  Reliance
15 Reliance Co.                       Reliance

Now this is just 16 records, so its easy to look up all the possible names and anomalies in their names, and create a dictionary for mapping. But in actual I have a data-frame with about 5800 unique names (NOTE: 'USA' and 'U.S.A.' are counted as different entities when stating the count of unique).
So is there any programmatic approach to tackle such a scenario?

I tried running fuzzy match using difflib and fuzzywuzzy libraries but even its final results are not concrete. Often times difflib would just match up based on words like 'limited','association',etc. even though they would be referring to two different names with just 'association' or 'limited' as the common word among them.
Any help is appreciated.

EDIT:
Even if I create a list of Stop-words with words like 'associatio','limited','cooprations','group',etc there are chances of missing out these stop word names when mentioned differently. For instance, if 'association' and 'limited' are just mentioned as 'assoc.','ltd' and 'ltd.' there are chances that I'll miss out adding some of these to the stop-word list.

I have already tried, topic modelling with LDA and NMF the results were pretty similar to what I had achieved earlier using difflib and fuzzywuzzy libraries. And yes I did all the preprocessing (converting to lower cases, leamtization, extra whitespaces handling) before any of these approaches

U12-Forward · Accepted Answer

Late answer, focusing on it for a hour, you can use difflib.SequenceMatcher and filter the ratio where it is greater than 0.6, and a big chunk of code as well... also I simply remove the last word of each list, in the names column after it is modified, and get the longest word which apparently gets your desired result, and here it is...

import difflib
df2 = df.copy()
df2.loc[df2.names.str.contains('America'), 'names'] = 'US'
df2['names'] = df2.names.str.replace('.', '').str.lstrip()
df2.loc[df2.names.str.contains('REL'), 'names'] = 'Reliance'
df['group_name'] = df2.names.apply(lambda x: max(sorted([i.rsplit(None, 1)[0] for i in df2.names.tolist() if difflib.SequenceMatcher(None, x, i).ratio() > 0.6]), key=len))
print(df)

Output:

                                names             group_name
0                              U.S.A.                    USA
1            United States of America                    USA
2                                 USA                    USA
3                          US America                    USA
4               Kenyan Footbal League  Kenya Football League
5              Kenyan Football League  Kenya Football League
6        Kenya Football League Assoc.  Kenya Football League
7    Kenya Footbal League Association  Kenya Football League
8                         Tata Motors            Tata Motors
9                           Tat Motor            Tata Motors
10                   Tata Motors Ltd.            Tata Motors
11                 Tata Motor Limited            Tata Motors
12                                REL               Reliance
13                   Reliance Limited               Reliance
14                       Reliance Co.               Reliance

A code with my best effort.

clustering/grouping names with little or no anomalies into clusters in pandas

Tags:

python

pandas

Aman Singh

1 Answers

U12-Forward

Recent Activity

Donate For Us

clustering/grouping names with little or no anomalies into clusters in pandas

Tags:

python

pandas

Aman Singh

1 Answers

U12-Forward

Related questions

Recent Activity

Donate For Us