Below is a sample of my df
name
A S BITO
A S KIGEL
A S NATURENERGI
A S NATURENERGIE
A S NATURENERGIE
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P BU SERVICE POWER PETER GMBH
A S P GMBH
A RESE LAND
A RITTER WITH SA
A RITTER WITH SA
A RITTER WITH SA
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER WITH MASCHINE
A RITTER WITH MASCHINE SA CO
A RITTER WITH MASCHINE SA CO
The aim is to replace the name by unique value with most occurence
Below is a list of unique values
name occurences
A S BITO 1
A S KIGEL 1
A S NATURENERGI 1
A S NATURENERGIE 2
A S P BU SERVICE POWER P 2
A S P BU SERVICE POWER PETER GMBH 1
A S P GMBH 1
A RESE LAND 1
A RITTER WITH SA 3
A RITTER SA CO 4
A RITTER WITH MASCHINE 1
A RITTER WITH MASCHINE SA CO 2
As you can see with the DF, some names can be grouped.
However, due to misspelling, there aren't.
The desired output would look like this
name
A S BITO
A S KIGEL
A S NATURENERGIE
A S NATURENERGIE
A S NATURENERGIE
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P GMBH
A RESE LAND
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
A RITTER SA CO
Below is the code tried
df['name'] = df['name'].replace('A S NATURENERGI', 'A S NATURENERGIE')
df['name'] = df['name'].replace('A S P BU SERVICE POWER PETER GMBH', 'A S P BU SERVICE POWER P')
df['name'] = df['name'].replace('A RITTER WITH SA', 'A RITTER SA CO')
df['name'] = df['name'].replace('A RITTER WITH MASCHINE', 'A RITTER SA CO')
df['name'] = df['name'].replace('A RITTER WITH MASCHINE SA CO ', 'A RITTER SA CO')
However, it might not be the best way to deal with that.
Thus, I was thinking about using difflib and computing a matching score.
The next step will be to replace the name with the highest score match.
f = partial(difflib.get_close_matches, possibilities= df['name'].tolist(), n=1) #
matches = df['name'].map(f).str[0].fillna('')
scores = [difflib.SequenceMatcher(None, x, y).ratio() for x, y in zip(matches, df['name'])]
df_diff = df.assign(best=matches, score=scores)
The downside of this method is that I will retrieve exactly the same name...
So, if anyone has some ideas, thanks a lot!
I create a custom function which iterates mapping in a pandas series:
import difflib
def similarity_replace(series):
reverse_map = {}
diz_map = {}
for i,s in series.iteritems():
diz_map[s] = s.replace(" ", "")
reverse_map[s.replace(" ", "")] = s
best_match = {}
uni = list(set(diz_map.values()))
for w in uni:
best_match[w] = sorted(difflib.get_close_matches(w, uni, n=3, cutoff=0.6), key=len)[0]
return series.map(diz_map).map(best_match).map(reverse_map)
here the example:
name = pd.Series(['A S BITO',
'A S KIGEL',
'A S NATURENERGI',
'A S NATURENERGIE',
'A S NATURENERGIE',
'A S P BU SERVICE POWER P',
'A S P BU SERVICE POWER P',
'A S P BU SERVICE POWER PETER GMBH',
'A S P GMBH',
'A RESE LAND',
'A RITTER WITH SA',
'A RITTER WITH SA',
'A RITTER WITH SA',
'A RITTER SA CO',
'A RITTER SA CO',
'A RITTER SA CO',
'A RITTER SA CO',
'A RITTER WITH MASCHINE',
'A RITTER WITH MASCHINE SA CO',
'A RITTER WITH MASCHINE SA CO'])
similarity_replace(similarity_replace(name))
output:
0 A S BITO
1 A S KIGEL
2 A S NATURENERGI
3 A S NATURENERGI
4 A S NATURENERGI
5 A S P BU SERVICE POWER P
6 A S P BU SERVICE POWER P
7 A S P BU SERVICE POWER P
8 A S P GMBH
9 A RESE LAND
10 A RITTER SA CO
11 A RITTER SA CO
12 A RITTER SA CO
13 A RITTER SA CO
14 A RITTER SA CO
15 A RITTER SA CO
16 A RITTER SA CO
17 A RITTER SA CO
18 A RITTER SA CO
19 A RITTER SA CO
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With