Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python3: Convert Latin-1 to UTF-8 [duplicate]

My code looks like the following:

for file in glob.iglob(os.path.join(dir, '*.txt')):
    print(file)
    with codecs.open(file,encoding='latin-1') as f:
        infile = f.read()

with codecs.open('test.txt',mode='w',encoding='utf-8') as f:
    f.write(infile)

The files I work with are encoded in Latin-1 (I could not open them in UTF-8 obviously). But I want to write the resulting files in utf-8.

But this:

<Trans audio_filename="VALE_M11_070.MP3" xml:lang="español">
<Datos clave_texto=" VALE_M11_070" tipo_texto="entrevista_semidirigida">
<Corpus corpus="PRESEEA" subcorpus="ESESUMA" ciudad="Valencia" pais="España"/>

Instead becomes this (in gedit):

<Trans audio_filename="VALE_M11_070.MP3" xml:lang="espa뇃漀氀∀㸀ഀ਀㰀䐀愀琀`漀猀 挀氀愀瘀攀开琀攀砀琀漀㴀∀ 嘀䄀䰀䔀开䴀㄀㄀开 㜀

If I print it on the Terminal, it shows up normal.

Even more confusing is what I get when I open the resulting file with LibreOffice Writer:

<#T#r#a#n#s# (and so on)

So how do I properly convert a latin-1 string to a utf-8 string? In python2, it's easy, but in python3, it seems confusing to me.

I tried already these in different combinations:

#infile = bytes(infile,'utf-8').decode('utf-8')
#infile = infile.encode('utf-8').decode('utf-8')
#infile = bytes(infile,'utf-8').decode('utf-8')

But somehow I always end up with the same weird output.

Thanks in advance!

Edit: This question is different to the questions linked in the comment, as it concerns Python 3, not Python 2.7.

like image 935
I.P. Avatar asked Nov 09 '16 17:11

I.P.


2 Answers

I have found a half-part way in this. This is not what you want / need, but might help others in the right direction...

# First read the file
txt = open("file_name", "r", encoding="latin-1") # r = read, w = write & a = append
items = txt.readlines()
txt.close()

# and write the changes to file
output = open("file_name", "w", encoding="utf-8")
for string_fin in items:
    if "é" in string_fin:
        string_fin = string_fin.replace("é", "é")

    if "ë" in string_fin:
        string_fin = string_fin.replace("ë", "ë")

    # this works if not to much needs changing...

    output.write(string_fin)

output.close();

*note for detection

like image 77
Flummox - don't be evil SE Avatar answered Nov 08 '22 04:11

Flummox - don't be evil SE


For python 3.6:

your_str = your_str.encode('utf-8').decode('latin-1')
like image 3
Frenzi Avatar answered Nov 08 '22 06:11

Frenzi