Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problem to process visually identical looking characters (umlauts)

This may probably be a more general issue related to character encoding, but since I came across the issue while coding an outer join of two dataframes, I post it with a Python code example.

On the bottom line the question is: why is technically not the identical character as ö and how can I make sure, both are not only visually identical but also technically? If you copy-paste both characters in a text editor and do a search for one of them, you will never find both!

So now the Python example trying to do a simple outer join of two dataframes on the column 'filename' (here presented as CSV data):

df1:

filename;abstract
problematic_ö.txt;abc
non-problematic_ö.txt;yxz

df2:

bytes;filename
374;problematic_ö.txt
128;non-problematic_ö.txt

Python code:

import csv
import pandas as pd

df1 = pd.read_csv('df1.csv', header=0, sep = ';')
df2 = pd.read_csv('df2.csv', header=0, sep = ';')

print(df1) 
print(df2) 

df_outerjoin = pd.merge(df1, df2, how='outer', indicator=True)
df_outerjoin.to_csv('df_outerjoin.csv', sep =';', index=False, header=True, quoting=csv.QUOTE_NONNUMERIC)

print(df_outerjoin)

Output:

#               filename   abstract     bytes        _merge
1      problematic_ö.txt        abc       NaN     left_only
2  non-problematic_ö.txt        yxz     128.0          both
3      problematic_ö.txt        NaN     374.0    right_only

So the 'ö' in the problematic filename isn't recognised as the same character as 'ö' in the non-problematic filename.

What is happening here?

What can I do to overcome this issue — can I do something "smart" by importing the data files with special encoding setting or will I have to do a dumb search and replace?

like image 671
Madamadam Avatar asked Nov 17 '25 11:11

Madamadam


2 Answers

There are various ways to represent the same character in Unicode. In your situation, the problematic filename contains the character 'ö', which is really represented by two Unicode code points: 'o' (Latin Small Letter O) and the combining character ''. (Combining Diaeresis) The non-problematic filename, on the other hand, uses the character 'ö' (Latin Small Letter O with Diaeresis), which is represented by a single Unicode code char.

You can use unicodedata library - unicodedata.normalize.

It works like this

import unicodedata

a = "ö"
b = "ö"
print(a == b)

a = unicodedata.normalize('NFC', a)
b = unicodedata.normalize('NFC', b)

print(a == b)

Output:

False
True
like image 95
Nejc Avatar answered Nov 20 '25 02:11

Nejc


Rather than using unicodedata, Pandas provides the following method Series.str.normalize(form), so something like:

df1['filename'] = df1['filename'].str.normalize('NFC')
df2['filename'] = df2['filename'].str.normalize('NFC')

Before doing the outer join.

like image 29
Andj Avatar answered Nov 20 '25 00:11

Andj