Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading UTF-8 file with codecs in IronPython

I have a .csv file encoded in UTF-8, which contains both latin and cyrillic symbols.

;F1;F2;abcdefg3;F200
;ABSOLUTE;NOMINAL;NOMINAL;NOMINAL
o1;1;USA;Новосибирск;1223

I'm trying to execute following script in IronPython 2.7.1:

import codecs

f = codecs.open(r"file.csv", "rb", "utf-8")
f.next()

During the execution of f.next() an exception occurs:

Traceback (most recent call last):
  File "c:\Program Files\Microsoft Visual Studio 10.0\Common7\IDE\Extensions\Microsoft\Python Tools for Visual Studio\1.1\visualstudio_py_repl.py", line 492, in run_file_as_main
    code.Execute(self.exec_mod)
  File "<string>", line 4, in <module>
  File "C:\Program Files\IronPython 2.7.1\Lib\codecs.py", line 684, in next
    return self.reader.next()
  File "C:\Program Files\IronPython 2.7.1\Lib\codecs.py", line 615, in next
    line = self.readline()
  File "C:\Program Files\IronPython 2.7.1\Lib\codecs.py", line 530, in readline
    data = self.read(readsize, firstline=True)
  File "C:\Program Files\IronPython 2.7.1\Lib\codecs.py", line 477, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeEncodeError: ('unknown', '\x00', 0, 1, '')

At the same time in CPython 2.7 the script works correctly. Also in the IronPython 2.7.1 following script works fine:

import codecs

f = codecs.open(r"file.csv", "rb", "utf-8")
f.readlines()

Does anybody know what may cause such strange behaviour?

like image 911
Rustam Miftakhutdinov Avatar asked Apr 12 '12 12:04

Rustam Miftakhutdinov


People also ask

How do I read a UTF-8 file?

In Java, the InputStreamReader accepts a charset to decode the byte streams into character streams. We can pass a StandardCharsets. UTF_8 into the InputStreamReader constructor to read data from a UTF-8 file.

How do I decode a UTF-8 string in Python?

To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.

What is UTF-8 in Python?

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding.


1 Answers

Looks like it could be a bug in how next() handles codecs. Can you please open an issue with the files to reproduce attached?

like image 125
Jeff Hardy Avatar answered Oct 21 '22 16:10

Jeff Hardy