I have a dataframe with names field as:
print(df)
names
--------------------------------
0 U.S.A.
1 United States of America
2 USA
4 US America
5 Kenyan Footbal League
6 Kenyan Football League
7 Kenya Football League Assoc.
8 Kenya Footbal League Association
9 Tata Motors
10 Tat Motor
11 Tata Motors Ltd.
12 Tata Motor Limited
13 REL
14 Reliance Limited
15 Reliance Co.
Now I want to club all these similar kind of names into one category such that the final dataframe looks something like this:
print(df)
names group_name
---------------------------------------------
0 U.S.A. USA
1 United States of America USA
2 USA USA
4 US America USA
5 Kenyan Footbal League Kenya Football League
6 Kenyan Football League Kenya Football League
7 Kenya Football League Assoc. Kenya Football League
8 Kenya Footbal League Association Kenya Football League
9 Tata Motors Tata Motors
10 Tat Motor Tata Motors
11 Tata Motors Ltd. Tata Motors
12 Tata Motor Limited Tata Motors
13 REL Reliance
14 Reliance Limited. Reliance
15 Reliance Co. Reliance
Now this is just 16 records, so its easy to look up all the possible names and anomalies in their names, and create a dictionary for mapping. But in actual I have a data-frame with about 5800 unique names (NOTE: 'USA' and 'U.S.A.' are counted as different entities when stating the count of unique).
So is there any programmatic approach to tackle such a scenario?
I tried running fuzzy match using difflib and fuzzywuzzy libraries but even its final results are not concrete. Often times difflib would just match up based on words like 'limited','association',etc. even though they would be referring to two different names with just 'association' or 'limited' as the common word among them.
Any help is appreciated.
EDIT:
Even if I create a list of Stop-words with words like 'associatio','limited','cooprations','group',etc there are chances of missing out these stop word names when mentioned differently. For instance, if 'association' and 'limited' are just mentioned as 'assoc.','ltd' and 'ltd.' there are chances that I'll miss out adding some of these to the stop-word list.
I have already tried, topic modelling with LDA and NMF the results were pretty similar to what I had achieved earlier using difflib and fuzzywuzzy libraries. And yes I did all the preprocessing (converting to lower cases, leamtization, extra whitespaces handling) before any of these approaches
Late answer, focusing on it for a hour, you can use difflib.SequenceMatcher and filter the ratio where it is greater than 0.6, and a big chunk of code as well... also I simply remove the last word of each list, in the names column after it is modified, and get the longest word which apparently gets your desired result, and here it is...
import difflib
df2 = df.copy()
df2.loc[df2.names.str.contains('America'), 'names'] = 'US'
df2['names'] = df2.names.str.replace('.', '').str.lstrip()
df2.loc[df2.names.str.contains('REL'), 'names'] = 'Reliance'
df['group_name'] = df2.names.apply(lambda x: max(sorted([i.rsplit(None, 1)[0] for i in df2.names.tolist() if difflib.SequenceMatcher(None, x, i).ratio() > 0.6]), key=len))
print(df)
Output:
names group_name
0 U.S.A. USA
1 United States of America USA
2 USA USA
3 US America USA
4 Kenyan Footbal League Kenya Football League
5 Kenyan Football League Kenya Football League
6 Kenya Football League Assoc. Kenya Football League
7 Kenya Footbal League Association Kenya Football League
8 Tata Motors Tata Motors
9 Tat Motor Tata Motors
10 Tata Motors Ltd. Tata Motors
11 Tata Motor Limited Tata Motors
12 REL Reliance
13 Reliance Limited Reliance
14 Reliance Co. Reliance
A code with my best effort.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With