I downloaded a dataset of facebook messages and it was formatted like this:
f\u00c3\u00b8rste student
It's supposed to be første student
but I cant seem to decode it correctly.
I tried:
str = 'f\u00c3\u00b8rste student'
print(str)
# 'første student'
str = 'f\u00c3\u00b8rste student'
print(str.encode('utf-8'))
# b'f\xc3\x83\xc2\xb8rste student'
But it did't work.
To undo whatever encoding foulup has taken place, you first need to convert the characters to the bytes with the same ordinals by encoding in ISO-8859-1 (Latin-1) and then after that decoding as UTF-8:
>>> 'f\u00c3\u00b8rste student'.encode('iso-8859-1').decode('utf-8')
'første student'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With