Problem to process visually identical looking characters (umlauts)

Question

This may probably be a more general issue related to character encoding, but since I came across the issue while coding an outer join of two dataframes, I post it with a Python code example.

On the bottom line the question is: why is ö technically not the identical character as ö and how can I make sure, both are not only visually identical but also technically? If you copy-paste both characters in a text editor and do a search for one of them, you will never find both!

So now the Python example trying to do a simple outer join of two dataframes on the column 'filename' (here presented as CSV data):

df1:

filename;abstract
problematic_ö.txt;abc
non-problematic_ö.txt;yxz

df2:

bytes;filename
374;problematic_ö.txt
128;non-problematic_ö.txt

Python code:

import csv
import pandas as pd

df1 = pd.read_csv('df1.csv', header=0, sep = ';')
df2 = pd.read_csv('df2.csv', header=0, sep = ';')

print(df1) 
print(df2) 

df_outerjoin = pd.merge(df1, df2, how='outer', indicator=True)
df_outerjoin.to_csv('df_outerjoin.csv', sep =';', index=False, header=True, quoting=csv.QUOTE_NONNUMERIC)

print(df_outerjoin)

Output:

#               filename   abstract     bytes        _merge
1      problematic_ö.txt        abc       NaN     left_only
2  non-problematic_ö.txt        yxz     128.0          both
3      problematic_ö.txt        NaN     374.0    right_only

So the 'ö' in the problematic filename isn't recognised as the same character as 'ö' in the non-problematic filename.

What is happening here?

What can I do to overcome this issue — can I do something "smart" by importing the data files with special encoding setting or will I have to do a dumb search and replace?

Nejc · Accepted Answer

There are various ways to represent the same character in Unicode. In your situation, the problematic filename contains the character 'ö', which is really represented by two Unicode code points: 'o' (Latin Small Letter O) and the combining character ''. (Combining Diaeresis) The non-problematic filename, on the other hand, uses the character 'ö' (Latin Small Letter O with Diaeresis), which is represented by a single Unicode code char.

You can use unicodedata library - unicodedata.normalize.

It works like this

import unicodedata

a = "ö"
b = "ö"
print(a == b)

a = unicodedata.normalize('NFC', a)
b = unicodedata.normalize('NFC', b)

print(a == b)

Output:

False
True

You can use unicodedata library - unicodedata.normalize.

It works like this

import unicodedata

a = "ö"
b = "ö"
print(a == b)

a = unicodedata.normalize('NFC', a)
b = unicodedata.normalize('NFC', b)

print(a == b)

Output:

False
True

Andj · Answer

Rather than using unicodedata, Pandas provides the following method Series.str.normalize(form), so something like:

df1['filename'] = df1['filename'].str.normalize('NFC')
df2['filename'] = df2['filename'].str.normalize('NFC')

Before doing the outer join.

Problem to process visually identical looking characters (umlauts)

Tags:

python

character-encoding

utf

Madamadam

2 Answers

Nejc

Andj

Recent Activity

Donate For Us

Problem to process visually identical looking characters (umlauts)

Tags:

python

character-encoding

utf

Madamadam

2 Answers

Nejc

Andj

Related questions

Recent Activity

Donate For Us