Why codecs.iterdecode() eats empty strings?

Tags:

Why the following two decoding methods return different results?

>>> import codecs
>>>
>>> data = ['', '', 'a', '']
>>> list(codecs.iterdecode(data, 'utf-8'))
[u'a']
>>> [codecs.decode(i, 'utf-8') for i in data]
[u'', u'', u'a', u'']

Is this a bug or expected behavior? My Python version 2.7.13.

536

asked May 11 '17 00:05

Cheng Lian

1 Answers

This is normal. iterdecode takes an iterator over encoded chunks and returns an iterator over decoded chunks, but it doesn't promise a one-to-one correspondence. All it guarantees is that the concatenation of all output chunks is a valid decoding of the concatenation of all input chunks.

If you look at the source code, you'll see it's explicitly discarding empty output chunks:

def iterdecode(iterator, encoding, errors='strict', **kwargs):
    """
    Decoding iterator.
    Decodes the input strings from the iterator using an IncrementalDecoder.
    errors and kwargs are passed through to the IncrementalDecoder
    constructor.
    """
    decoder = getincrementaldecoder(encoding)(errors, **kwargs)
    for input in iterator:
        output = decoder.decode(input)
        if output:
            yield output
    output = decoder.decode("", True)
    if output:
        yield output

Be aware that the reason iterdecode exists, and the reason you wouldn't just call decode on all the chunks yourself, is that the decoding process is stateful. The UTF-8 encoded form of one character might be split over multiple chunks. Other codecs might have really weird stateful behavior, like maybe a byte sequence that inverts the case of all characters until you see that byte sequence again.

109

answered Oct 21 '22 15:10

user2357112 supports Monica

Related questions
                            
                                pandas: map multiple columns to one column
                            
                                Do the individual Series contained within a DataFrame maintain their own index?
                            
                                Seaborn countplot set legend for x values
                            
                                Find lots of string in text - Python
                            
                                How to fix "NameError: name method-name is not defined"? [duplicate]
                            
                                Python 3.4 crashes when producing some – but not all – Cartopy maps with segmentation fault 11
                            
                                How print every line of a python script as its being executed (including the console)?
                            
                                Semantics of `async for` - can __anext__ calls overlap?
                            
                                spark importing data from oracle - java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
                            
                                How to run python programs in visual studio code in virtualenv
                            
                                Add datashader image to matplotlib subplots
                            
                                Cannot Upgrade from python 3.5.2 to 3.6
                            
                                Node.js scraping with chrome-remote-interface
                            
                                How does 'global' behave under an if statement?
                            
                                Difference between Python 3.7 math.remainder and %(modulo operator)
                            
                                Is it possible to get the objective function value during each training step?
                            
                                Change bar color in a 3D bar plot in matplotlib based on value
                            
                                Update/delete confluence page using python code
                            
                                Python smtplib has no attribute SMTP_SSL
                            
                                Command Line Varaible is not overriding Suite Level Variable in Robot Framework

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why codecs.iterdecode() eats empty strings?

Tags:

python

unicode

utf-8

python-2.7

codec

Cheng Lian

People also ask

1 Answers

user2357112 supports Monica

Recent Activity

Donate For Us