Python 2.6
Using Python string.replace() seems not working for UTF-16-LE file. I think of 2 ways:
Can the community suggest me a good way to solve this? Thanks.
EDIT: My code looks like this:
infile = open(inputfilename)
for s in infile:
outfile.write(s.replace(targetText, replaceText))
Looks like the for loop can parse the line correct. Did I make any mistakes here?
EDIT2:
I've read the Python Unicode tutorial and tried below code, and get it worked. However, just wondering if there's any better way to do this. Can anyone help? Thanks.
infile = codecs.open(infilename,'r', encoding='utf-16-le')
newlines = []
for line in infile:
newlines.append(line.replace(originalText,replacementText))
outfile = codecs.open(outfilename, 'w', encoding='utf-16-le')
outfile.writelines(newlines)
Do I need to close infile or outfile?
You don't have a Unicode file. There is no such thing (unless you are the author of NotePad, which conflates "Unicode" and "UTF-16LE").
Please read the Python Unicode HOWTO and Joel on Unicode.
Update I'm glad the suggested reading helped you. Here's a better version of your code:
infile = codecs.open(infilename,'r', encoding='utf-16-le')
outfile = codecs.open(outfilename, 'w', encoding='utf-16-le')
for line in infile:
fixed_line = line.replace(originalText,replacementText)
# no need to save up all the output lines in a list
outfile.write(fixed_line)
infile.close()
outfile.close()
It's always a good habit to release resources (e.g. close files) immediately when you are finished with them. More importantly, with output files, the directory is usually not updated until you close the file.
Read up on the "with" statement to find out about even better practice with file handling.
Looks like Python 3.6 will assume your file is UTF-8 by default if you open it in text mode (default):
>>> open('/etc/hosts')
<_io.TextIOWrapper name='/etc/hosts' mode='r' encoding='UTF-8'>
A function like file.readlines()
will return str objects and in Python 3 strings are unicode. If you open the file in binary mode, it will be almost like Python 2 behavior:
>>> open('/etc/hosts', 'rb)
<_io.BufferedReader name='/etc/hosts'>
In this case readlines
will return bytes
objects and you must decode in order to get unicode:
>>> type(open('/etc/hosts', 'rb').readline())
bytes
>>> type(open('/etc/hosts', 'rb').readline().decode('utf-8'))
str
You can open your file using another encoding using the encoding
argument:
>>> open('/etc/hosts', encoding='ascii')
<_io.TextIOWrapper name='/etc/hosts' mode='r' encoding='ascii'>
Python 2 does not care about encoding, a file is just a stream of bytes. A function like file.readlines() will return str
objects, not unicode
even if you open the file in text mode. You can convert each line to an unicode object using str.decode('your-file-encoding').
>>> f = open('/etc/issue')
>>> l = f.readline()
>>> l
'Ubuntu 10.04.1 LTS \\n \\l\n'
>>> type(l)
<type 'str'>
>>> u = l.decode('utf-8')
>>> type(u)
<type 'unicode'>
You can get results similar to Python 3 using codecs.open
instead of just open
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With