Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python string replace for UTF-16-LE file

Tags:

python

string

Python 2.6

Using Python string.replace() seems not working for UTF-16-LE file. I think of 2 ways:

  1. Find a Python module that can handle Unicode string manipulate.
  2. Convert the target Unicode file to ASCII, use string.replace(), then convert it back. But I am worry about this may cause loss data.

Can the community suggest me a good way to solve this? Thanks.

EDIT: My code looks like this:

infile = open(inputfilename)
for s in infile:
 outfile.write(s.replace(targetText, replaceText))

Looks like the for loop can parse the line correct. Did I make any mistakes here?

EDIT2:

I've read the Python Unicode tutorial and tried below code, and get it worked. However, just wondering if there's any better way to do this. Can anyone help? Thanks.

infile = codecs.open(infilename,'r', encoding='utf-16-le')

newlines = []
for line in infile:
    newlines.append(line.replace(originalText,replacementText))

outfile = codecs.open(outfilename, 'w', encoding='utf-16-le')
outfile.writelines(newlines)

Do I need to close infile or outfile?

like image 224
Stan Avatar asked Dec 17 '22 20:12

Stan


2 Answers

You don't have a Unicode file. There is no such thing (unless you are the author of NotePad, which conflates "Unicode" and "UTF-16LE").

Please read the Python Unicode HOWTO and Joel on Unicode.

Update I'm glad the suggested reading helped you. Here's a better version of your code:

infile = codecs.open(infilename,'r', encoding='utf-16-le')
outfile = codecs.open(outfilename, 'w', encoding='utf-16-le')
for line in infile:
    fixed_line = line.replace(originalText,replacementText)
    # no need to save up all the output lines in a list
    outfile.write(fixed_line)
infile.close()
outfile.close()

It's always a good habit to release resources (e.g. close files) immediately when you are finished with them. More importantly, with output files, the directory is usually not updated until you close the file.

Read up on the "with" statement to find out about even better practice with file handling.

like image 167
John Machin Avatar answered Dec 30 '22 22:12

John Machin


Python 3

Looks like Python 3.6 will assume your file is UTF-8 by default if you open it in text mode (default):

>>> open('/etc/hosts')
<_io.TextIOWrapper name='/etc/hosts' mode='r' encoding='UTF-8'>

A function like file.readlines() will return str objects and in Python 3 strings are unicode. If you open the file in binary mode, it will be almost like Python 2 behavior:

>>> open('/etc/hosts', 'rb)
<_io.BufferedReader name='/etc/hosts'>

In this case readlines will return bytes objects and you must decode in order to get unicode:

>>> type(open('/etc/hosts', 'rb').readline())
bytes

>>> type(open('/etc/hosts', 'rb').readline().decode('utf-8'))
str

You can open your file using another encoding using the encoding argument:

>>> open('/etc/hosts', encoding='ascii')
<_io.TextIOWrapper name='/etc/hosts' mode='r' encoding='ascii'>

Python 2 (this is a very old answer)

Python 2 does not care about encoding, a file is just a stream of bytes. A function like file.readlines() will return str objects, not unicode even if you open the file in text mode. You can convert each line to an unicode object using str.decode('your-file-encoding').

>>> f = open('/etc/issue')
>>> l = f.readline()
>>> l
'Ubuntu 10.04.1 LTS \\n \\l\n'
>>> type(l)
<type 'str'>
>>> u = l.decode('utf-8')
>>> type(u)
<type 'unicode'>

You can get results similar to Python 3 using codecs.open instead of just open.

like image 23
Paulo Scardine Avatar answered Dec 30 '22 21:12

Paulo Scardine