Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

codecs.open(utf-8) fails to read plain ASCII file

I have a plain ASCII file. When I try to open it with codecs.open(..., "utf-8"), I am unable to read single characters. ASCII is a subset of UTF-8, so why can't codecs open such a file in UTF-8 mode?

# test.py

import codecs

f = codecs.open("test.py", "r", "utf-8")

# ASCII is supposed to be a subset of UTF-8:
# http://www.fileformat.info/info/unicode/utf8.htm

assert len(f.read(1)) == 1 # OK
f.readline()
c = f.read(1)
print len(c)
print "'%s'" % c
assert len(c) == 1 # fails

# max% p test.py
# 63
# '
# import codecs
#
# f = codecs.open("test.py", "r", "utf-8")
#
# # ASC'
# Traceback (most recent call last):
#   File "test.py", line 15, in <module>
#     assert len(c) == 1 # fails
# AssertionError
# max%

system:

Linux max 4.4.0-89-generic #112~14.04.1-Ubuntu SMP Tue Aug 1 22:08:32 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Of course it works with regular open. It also works if I remove the "utf-8" option. Also what does 63 mean? That's like the middle of the 3rd line. I don't get it.

like image 719
personal_cloud Avatar asked Sep 27 '17 00:09

personal_cloud


1 Answers

Found your problem:

When passed an encoding, codecs.open returns a StreamReaderWriter, which is really just a wrapper around (not a subclass of; it's a "composed of" relationship, not inheritance) StreamReader and StreamWriter. Problem is:

  1. StreamReaderWriter provides a "normal" read method (that is, it takes a size parameter and that's it)
  2. It delegates to the internal StreamReader.read method, where the size argument is only a hint as to the number of bytes to read, but not a limit; the second argument, chars, is a strict limiter, but StreamReaderWriter never passes that argument along (it doesn't accept it)
  3. When size hinted, but not capped using chars, if StreamReader has buffered data, and it's large enough to match the size hint StreamReader.read blindly returns the contents of the buffer, rather than limiting it in any way based on the size hint (after all, only chars imposes a maximum return size)

The API of StreamReader.read and the meaning of size/chars for the API is the only documented thing here; the fact that codecs.open returns StreamReaderWriter is not contractual, nor is the fact that StreamReaderWriter wraps StreamReader, I just used ipython's ?? magic to read the source code of the codecs module to verify this behavior. But documented or not, that's what it's doing (feel free to read the source code for StreamReaderWriter, it's all Python level, so it's easy).

The best solution is to switch to io.open, which is faster and more correct in every standard case (codecs.open supports the weirdo codecs that don't convert between bytes [Py2 str] and str [Py2 unicode], but rather, handle str to str or bytes to bytes encodings, but that's an incredibly limited use case; most of the time, you're converting between bytes and str). All you need to do is import io instead of codecs, and change the codecs.open line to:

f = io.open("test.py", encoding="utf-8")

The rest of your code can remain unchanged (and will likely run faster to boot).

As an alternative, you could explicitly bypass StreamReaderWriter to get the StreamReader's read method and pass the limiting argument directly, e.g. change:

c = f.read(1)

to:

# Pass second, character limiting argument after size hint
c = f.reader.read(6, 1)  # 6 is sort of arbitrary; should ensure a full char read in one go

I suspect Python Bug #8260, which covers intermingling readline and read on codecs.open created file objects, applies here, officially, it's "fixed", but if you read the comments, the fix wasn't complete (and may not be possible to complete given the documented API); arbitrarily weird combinations of read and readline will be able to break it.

Again, just use io.open; as long as you're on Python 2.6 or higher, it's available, and it's just plain better.

like image 136
ShadowRanger Avatar answered Oct 23 '22 03:10

ShadowRanger