I have a .csv file encoded in UTF-8, which contains both latin and cyrillic symbols.
;F1;F2;abcdefg3;F200
;ABSOLUTE;NOMINAL;NOMINAL;NOMINAL
o1;1;USA;Новосибирск;1223
I'm trying to execute following script in IronPython 2.7.1:
import codecs
f = codecs.open(r"file.csv", "rb", "utf-8")
f.next()
During the execution of f.next() an exception occurs:
Traceback (most recent call last):
File "c:\Program Files\Microsoft Visual Studio 10.0\Common7\IDE\Extensions\Microsoft\Python Tools for Visual Studio\1.1\visualstudio_py_repl.py", line 492, in run_file_as_main
code.Execute(self.exec_mod)
File "<string>", line 4, in <module>
File "C:\Program Files\IronPython 2.7.1\Lib\codecs.py", line 684, in next
return self.reader.next()
File "C:\Program Files\IronPython 2.7.1\Lib\codecs.py", line 615, in next
line = self.readline()
File "C:\Program Files\IronPython 2.7.1\Lib\codecs.py", line 530, in readline
data = self.read(readsize, firstline=True)
File "C:\Program Files\IronPython 2.7.1\Lib\codecs.py", line 477, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeEncodeError: ('unknown', '\x00', 0, 1, '')
At the same time in CPython 2.7 the script works correctly. Also in the IronPython 2.7.1 following script works fine:
import codecs
f = codecs.open(r"file.csv", "rb", "utf-8")
f.readlines()
Does anybody know what may cause such strange behaviour?
In Java, the InputStreamReader accepts a charset to decode the byte streams into character streams. We can pass a StandardCharsets. UTF_8 into the InputStreamReader constructor to read data from a UTF-8 file.
To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.
UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding.
Looks like it could be a bug in how next()
handles codecs. Can you please open an issue with the files to reproduce attached?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With