This may probably be a more general issue related to character encoding, but since I came across the issue while coding an outer join of two dataframes, I post it with a Python code example.
On the bottom line the question is: why is ö technically not the identical character as ö and how can I make sure, both are not only visually identical but also technically?
If you copy-paste both characters in a text editor and do a search for one of them, you will never find both!
So now the Python example trying to do a simple outer join of two dataframes on the column 'filename' (here presented as CSV data):
df1:
filename;abstract
problematic_ö.txt;abc
non-problematic_ö.txt;yxz
df2:
bytes;filename
374;problematic_ö.txt
128;non-problematic_ö.txt
Python code:
import csv
import pandas as pd
df1 = pd.read_csv('df1.csv', header=0, sep = ';')
df2 = pd.read_csv('df2.csv', header=0, sep = ';')
print(df1)
print(df2)
df_outerjoin = pd.merge(df1, df2, how='outer', indicator=True)
df_outerjoin.to_csv('df_outerjoin.csv', sep =';', index=False, header=True, quoting=csv.QUOTE_NONNUMERIC)
print(df_outerjoin)
Output:
# filename abstract bytes _merge
1 problematic_ö.txt abc NaN left_only
2 non-problematic_ö.txt yxz 128.0 both
3 problematic_ö.txt NaN 374.0 right_only
So the 'ö' in the problematic filename isn't recognised as the same character as 'ö' in the non-problematic filename.
What is happening here?
What can I do to overcome this issue — can I do something "smart" by importing the data files with special encoding setting or will I have to do a dumb search and replace?
There are various ways to represent the same character in Unicode. In your situation, the problematic filename contains the character 'ö', which is really represented by two Unicode code points: 'o' (Latin Small Letter O) and the combining character ''. (Combining Diaeresis) The non-problematic filename, on the other hand, uses the character 'ö' (Latin Small Letter O with Diaeresis), which is represented by a single Unicode code char.
You can use unicodedata library - unicodedata.normalize.
It works like this
import unicodedata
a = "ö"
b = "ö"
print(a == b)
a = unicodedata.normalize('NFC', a)
b = unicodedata.normalize('NFC', b)
print(a == b)
Output:
False
True
Rather than using unicodedata, Pandas provides the following method Series.str.normalize(form), so something like:
df1['filename'] = df1['filename'].str.normalize('NFC')
df2['filename'] = df2['filename'].str.normalize('NFC')
Before doing the outer join.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With