I have googled and also search on SO for the difference between these buffer modules. However, I still don't understand very well and I think some of the posts I read are out of date. In Python 2.7.11, I downloaded a binary file of a specific format using <code>r = requests.get(url)</code>. Then I passed <code>StringIO.StringIO(r.content)</code>, <code>cStringIO.StringIO(r.content)</code> and <code>io.BytesIO(r.content)</code> to a function designed for parsing the content. All these three methods are available. I mean, even if the file is binary, it's still feasible to use <code>StringIO</code>. Why? Another thing is concerning their efficiency. <pre class="prettyprint"><code>In [1]: import StringIO, cStringIO, io In [2]: from numpy import random In [3]: x = random.random(1000000) In [4]: %timeit y = cStringIO.StringIO(x) 1000000 loops, best of 3: 736 ns per loop In [5]: %timeit y = StringIO.StringIO(x) 1000 loops, best of 3: 283 µs per loop In [6]: %timeit y = io.BytesIO(x) 1000 loops, best of 3: 1.26 ms per loop </code></pre> As illustrated above, <code>cStringIO > StringIO > BytesIO</code>. I found someone mentioned that <code>io.BytesIO</code> always makes a new copy which costs more time. But there are also some posts mentioned that this was fixed in later Python versions. So, can anyone make a thorough comparison between these <code>IO</code>s, in both latest Python 2.x and 3.x? <hr> Some of the reference I found: <ul> <li> https://trac.edgewall.org/ticket/12046 <blockquote> io.StringIO requires a unicode string. io.BytesIO requires a bytes string. StringIO.StringIO allows either unicode or bytes string. cStringIO.StringIO requires a string that is encoded as a bytes string. </blockquote> </li> </ul> But <code>cStringIO.StringIO('abc')</code> doesn't raise any error. <ul> <li> https://review.openstack.org/#/c/286926/1 <blockquote> The StringIO class is the wrong class to use for this, especially considering that subunit v2 is binary and not a string. </blockquote> </li> <li> http://comments.gmane.org/gmane.comp.python.devel/148717 <blockquote> cStringIO.StringIO(b'data') didn't copy the data while io.BytesIO(b'data') makes a copy (even if the data is not modified later). </blockquote> </li> </ul> There is a fix patch in this post in 2014. <ul> <li>Lots of SO posts not listed here.</li> </ul> <hr> Here are the Python 2.7 results for Eric's example <pre class="prettyprint"><code>%timeit cStringIO.StringIO(u_data) 1000000 loops, best of 3: 488 ns per loop %timeit cStringIO.StringIO(b_data) 1000000 loops, best of 3: 448 ns per loop %timeit StringIO.StringIO(u_data) 1000000 loops, best of 3: 1.15 µs per loop %timeit StringIO.StringIO(b_data) 1000000 loops, best of 3: 1.19 µs per loop %timeit io.StringIO(u_data) 1000 loops, best of 3: 304 µs per loop # %timeit io.StringIO(b_data) # error # %timeit io.BytesIO(u_data) # error %timeit io.BytesIO(b_data) 10000 loops, best of 3: 77.5 µs per loop </code></pre> As for 2.7, <code>cStringIO.StringIO</code> and <code>StringIO.StringIO</code> are far more efficient than <code>io</code>.

You should use <code>io.StringIO</code> for handling <code>unicode</code> objects and <code>io.BytesIO</code> for handling <code>bytes</code> objects in both python 2 and 3, for forwards-compatibility (this is all 3 has to offer). <hr> Here's a better test (for python 2 and 3), that doesn't include conversion costs from numpy to <code>str</code>/<code>bytes</code> <pre class="prettyprint"><code>import numpy as np import string b_data = np.random.choice(list(string.printable), size=1000000).tobytes() u_data = b_data.decode('ascii') u_data = u'\u2603' + u_data[1:] # add a non-ascii character </code></pre> And then: <pre class="prettyprint"><code>import io %timeit io.StringIO(u_data) %timeit io.StringIO(b_data) %timeit io.BytesIO(u_data) %timeit io.BytesIO(b_data) </code></pre> In python 2, you can also test: <pre class="prettyprint"><code>import StringIO, cStringIO %timeit cStringIO.StringIO(u_data) %timeit cStringIO.StringIO(b_data) %timeit StringIO.StringIO(u_data) %timeit StringIO.StringIO(b_data) </code></pre> Some of these will crash, complaining about non-ascii characters <hr> Python 3.5 results: <pre class="prettyprint"><code>>>> %timeit io.StringIO(u_data) 100 loops, best of 3: 8.61 ms per loop >>> %timeit io.StringIO(b_data) TypeError: initial_value must be str or None, not bytes >>> %timeit io.BytesIO(u_data) TypeError: a bytes-like object is required, not 'str' >>> %timeit io.BytesIO(b_data) The slowest run took 6.79 times longer than the fastest. This could mean that an intermediate result is being cached 1000000 loops, best of 3: 344 ns per loop </code></pre> Python 2.7 results (run on a different machine): <pre class="prettyprint"><code>>>> %timeit io.StringIO(u_data) 1000 loops, best of 3: 304 µs per loop >>> %timeit io.StringIO(b_data) TypeError: initial_value must be unicode or None, not str >>> %timeit io.BytesIO(u_data) TypeError: 'unicode' does not have the buffer interface >>> %timeit io.BytesIO(b_data) 10000 loops, best of 3: 77.5 µs per loop </code></pre> <pre class="prettyprint"><code>>>> %timeit cStringIO.StringIO(u_data) UnicodeEncodeError: 'ascii' codec cant encode character u'\u2603' in position 0: ordinal not in range(128) >>> %timeit cStringIO.StringIO(b_data) 1000000 loops, best of 3: 448 ns per loop >>> %timeit StringIO.StringIO(u_data) 1000000 loops, best of 3: 1.15 µs per loop >>> %timeit StringIO.StringIO(b_data) 1000000 loops, best of 3: 1.19 µs per loop </code></pre>

Confusing about StringIO, cStringIO and ByteIO

Tags:

python

stringio

bytesio

cstringio

I have googled and also search on SO for the difference between these buffer modules. However, I still don't understand very well and I think some of the posts I read are out of date.

In Python 2.7.11, I downloaded a binary file of a specific format using r = requests.get(url). Then I passed StringIO.StringIO(r.content), cStringIO.StringIO(r.content) and io.BytesIO(r.content) to a function designed for parsing the content.

All these three methods are available. I mean, even if the file is binary, it's still feasible to use StringIO. Why?

Another thing is concerning their efficiency.

In [1]: import StringIO, cStringIO, io  In [2]: from numpy import random  In [3]: x = random.random(1000000)  In [4]: %timeit y = cStringIO.StringIO(x) 1000000 loops, best of 3: 736 ns per loop  In [5]: %timeit y = StringIO.StringIO(x) 1000 loops, best of 3: 283 µs per loop  In [6]: %timeit y = io.BytesIO(x) 1000 loops, best of 3: 1.26 ms per loop

As illustrated above, cStringIO > StringIO > BytesIO.

I found someone mentioned that io.BytesIO always makes a new copy which costs more time. But there are also some posts mentioned that this was fixed in later Python versions.

So, can anyone make a thorough comparison between these IOs, in both latest Python 2.x and 3.x?

Some of the reference I found:

https://trac.edgewall.org/ticket/12046

io.StringIO requires a unicode string. io.BytesIO requires a bytes string. StringIO.StringIO allows either unicode or bytes string. cStringIO.StringIO requires a string that is encoded as a bytes string.

But cStringIO.StringIO('abc') doesn't raise any error.

https://review.openstack.org/#/c/286926/1

The StringIO class is the wrong class to use for this, especially considering that subunit v2 is binary and not a string.
http://comments.gmane.org/gmane.comp.python.devel/148717

cStringIO.StringIO(b'data') didn't copy the data while io.BytesIO(b'data') makes a copy (even if the data is not modified later).

There is a fix patch in this post in 2014.

Lots of SO posts not listed here.

Here are the Python 2.7 results for Eric's example

%timeit cStringIO.StringIO(u_data) 1000000 loops, best of 3: 488 ns per loop %timeit cStringIO.StringIO(b_data) 1000000 loops, best of 3: 448 ns per loop %timeit StringIO.StringIO(u_data) 1000000 loops, best of 3: 1.15 µs per loop %timeit StringIO.StringIO(b_data) 1000000 loops, best of 3: 1.19 µs per loop %timeit io.StringIO(u_data) 1000 loops, best of 3: 304 µs per loop # %timeit io.StringIO(b_data) # error # %timeit io.BytesIO(u_data) # error %timeit io.BytesIO(b_data) 10000 loops, best of 3: 77.5 µs per loop

As for 2.7, cStringIO.StringIO and StringIO.StringIO are far more efficient than io.

985

asked May 26 '16 13:05

ddzzbbwwmm

1 Answers

You should use io.StringIO for handling unicode objects and io.BytesIO for handling bytes objects in both python 2 and 3, for forwards-compatibility (this is all 3 has to offer).

Here's a better test (for python 2 and 3), that doesn't include conversion costs from numpy to str/bytes

import numpy as np import string b_data = np.random.choice(list(string.printable), size=1000000).tobytes() u_data = b_data.decode('ascii') u_data = u'\u2603' + u_data[1:]  # add a non-ascii character

And then:

import io %timeit io.StringIO(u_data) %timeit io.StringIO(b_data) %timeit io.BytesIO(u_data) %timeit io.BytesIO(b_data)

In python 2, you can also test:

import StringIO, cStringIO %timeit cStringIO.StringIO(u_data) %timeit cStringIO.StringIO(b_data) %timeit StringIO.StringIO(u_data) %timeit StringIO.StringIO(b_data)

Some of these will crash, complaining about non-ascii characters

Python 3.5 results:

>>> %timeit io.StringIO(u_data) 100 loops, best of 3: 8.61 ms per loop >>> %timeit io.StringIO(b_data) TypeError: initial_value must be str or None, not bytes >>> %timeit io.BytesIO(u_data) TypeError: a bytes-like object is required, not 'str' >>> %timeit io.BytesIO(b_data) The slowest run took 6.79 times longer than the fastest. This could mean that an intermediate result is being cached 1000000 loops, best of 3: 344 ns per loop

Python 2.7 results (run on a different machine):

>>> %timeit io.StringIO(u_data) 1000 loops, best of 3: 304 µs per loop >>> %timeit io.StringIO(b_data) TypeError: initial_value must be unicode or None, not str >>> %timeit io.BytesIO(u_data) TypeError: 'unicode' does not have the buffer interface >>> %timeit io.BytesIO(b_data) 10000 loops, best of 3: 77.5 µs per loop

>>> %timeit cStringIO.StringIO(u_data) UnicodeEncodeError: 'ascii' codec cant encode character u'\u2603' in position 0: ordinal not in range(128) >>> %timeit cStringIO.StringIO(b_data) 1000000 loops, best of 3: 448 ns per loop >>> %timeit StringIO.StringIO(u_data) 1000000 loops, best of 3: 1.15 µs per loop >>> %timeit StringIO.StringIO(b_data) 1000000 loops, best of 3: 1.19 µs per loop

answered Sep 26 '22 17:09

Eric

Related questions
                            
                                Why is foo(*arg, x) not allowed in Python?
                            
                                Python 3.7: check if type annotation is "subclass" of generic
                            
                                How do you set up a Flask application with SQLAlchemy for testing?
                            
                                Python pattern for sharing configuration throughout application
                            
                                Globals variables and Python multiprocessing [duplicate]
                            
                                Boto3 updating multiple values
                            
                                Using shorter textwidth in comments and docstrings
                            
                                Python: Making numpy default to float32
                            
                                Is there a Python equivalent for C++ "multiset<int>"?
                            
                                running nose --with-coverage to get all the package files, but not other dependencies and libs
                            
                                Does Python always raise an exception if you do Ctrl+C when a subprocess is executing?
                            
                                Relationship between pickle and deepcopy
                            
                                Python Formatter Tool
                            
                                Use subprocess to send a password
                            
                                Python regular expressions - re.search() vs re.findall()
                            
                                Change Series inplace in DataFrame after applying function on it
                            
                                How to bypass cloudflare bot/ddos protection in Scrapy?
                            
                                Use Jinja2 template engine in external javascript file
                            
                                I have python3.4 but no pip or ensurepip.. is something wrong with my python3.4 version?
                            
                                Python - Flask-SocketIO send message from thread: not always working

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With