I have googled and also search on SO for the difference between these buffer modules. However, I still don't understand very well and I think some of the posts I read are out of date.
In Python 2.7.11, I downloaded a binary file of a specific format using r = requests.get(url)
. Then I passed StringIO.StringIO(r.content)
, cStringIO.StringIO(r.content)
and io.BytesIO(r.content)
to a function designed for parsing the content.
All these three methods are available. I mean, even if the file is binary, it's still feasible to use StringIO
. Why?
Another thing is concerning their efficiency.
In [1]: import StringIO, cStringIO, io In [2]: from numpy import random In [3]: x = random.random(1000000) In [4]: %timeit y = cStringIO.StringIO(x) 1000000 loops, best of 3: 736 ns per loop In [5]: %timeit y = StringIO.StringIO(x) 1000 loops, best of 3: 283 µs per loop In [6]: %timeit y = io.BytesIO(x) 1000 loops, best of 3: 1.26 ms per loop
As illustrated above, cStringIO > StringIO > BytesIO
.
I found someone mentioned that io.BytesIO
always makes a new copy which costs more time. But there are also some posts mentioned that this was fixed in later Python versions.
So, can anyone make a thorough comparison between these IO
s, in both latest Python 2.x and 3.x?
Some of the reference I found:
io.StringIO requires a unicode string. io.BytesIO requires a bytes string. StringIO.StringIO allows either unicode or bytes string. cStringIO.StringIO requires a string that is encoded as a bytes string.
But cStringIO.StringIO('abc')
doesn't raise any error.
https://review.openstack.org/#/c/286926/1
The StringIO class is the wrong class to use for this, especially considering that subunit v2 is binary and not a string.
http://comments.gmane.org/gmane.comp.python.devel/148717
cStringIO.StringIO(b'data') didn't copy the data while io.BytesIO(b'data') makes a copy (even if the data is not modified later).
There is a fix patch in this post in 2014.
Here are the Python 2.7 results for Eric's example
%timeit cStringIO.StringIO(u_data) 1000000 loops, best of 3: 488 ns per loop %timeit cStringIO.StringIO(b_data) 1000000 loops, best of 3: 448 ns per loop %timeit StringIO.StringIO(u_data) 1000000 loops, best of 3: 1.15 µs per loop %timeit StringIO.StringIO(b_data) 1000000 loops, best of 3: 1.19 µs per loop %timeit io.StringIO(u_data) 1000 loops, best of 3: 304 µs per loop # %timeit io.StringIO(b_data) # error # %timeit io.BytesIO(u_data) # error %timeit io.BytesIO(b_data) 10000 loops, best of 3: 77.5 µs per loop
As for 2.7, cStringIO.StringIO
and StringIO.StringIO
are far more efficient than io
.
StringIO gives you file-like access to strings, so you can use an existing module that deals with a file and change almost nothing and make it work with strings. For example, say you have a logger that writes things to a file and you want to instead send the log output over the network.
The StringIO module is an in-memory file-like object. This object can be used as input or output to the most function that would expect a standard file object. When the StringIO object is created it is initialized by passing a string to the constructor. If no string is passed the StringIO will start empty.
String is a collection of alphabets, words or other characters. It is one of the primitive data structures and are the building blocks for data manipulation. Python has a built-in string class named str . Python strings are "immutable" which means they cannot be changed after they are created.
TextIOWrapper , which extends TextIOBase , is a buffered text interface to a buffered raw stream ( BufferedIOBase ). Finally, StringIO is an in-memory stream for text.
You should use io.StringIO
for handling unicode
objects and io.BytesIO
for handling bytes
objects in both python 2 and 3, for forwards-compatibility (this is all 3 has to offer).
Here's a better test (for python 2 and 3), that doesn't include conversion costs from numpy to str
/bytes
import numpy as np import string b_data = np.random.choice(list(string.printable), size=1000000).tobytes() u_data = b_data.decode('ascii') u_data = u'\u2603' + u_data[1:] # add a non-ascii character
And then:
import io %timeit io.StringIO(u_data) %timeit io.StringIO(b_data) %timeit io.BytesIO(u_data) %timeit io.BytesIO(b_data)
In python 2, you can also test:
import StringIO, cStringIO %timeit cStringIO.StringIO(u_data) %timeit cStringIO.StringIO(b_data) %timeit StringIO.StringIO(u_data) %timeit StringIO.StringIO(b_data)
Some of these will crash, complaining about non-ascii characters
Python 3.5 results:
>>> %timeit io.StringIO(u_data) 100 loops, best of 3: 8.61 ms per loop >>> %timeit io.StringIO(b_data) TypeError: initial_value must be str or None, not bytes >>> %timeit io.BytesIO(u_data) TypeError: a bytes-like object is required, not 'str' >>> %timeit io.BytesIO(b_data) The slowest run took 6.79 times longer than the fastest. This could mean that an intermediate result is being cached 1000000 loops, best of 3: 344 ns per loop
Python 2.7 results (run on a different machine):
>>> %timeit io.StringIO(u_data) 1000 loops, best of 3: 304 µs per loop >>> %timeit io.StringIO(b_data) TypeError: initial_value must be unicode or None, not str >>> %timeit io.BytesIO(u_data) TypeError: 'unicode' does not have the buffer interface >>> %timeit io.BytesIO(b_data) 10000 loops, best of 3: 77.5 µs per loop
>>> %timeit cStringIO.StringIO(u_data) UnicodeEncodeError: 'ascii' codec cant encode character u'\u2603' in position 0: ordinal not in range(128) >>> %timeit cStringIO.StringIO(b_data) 1000000 loops, best of 3: 448 ns per loop >>> %timeit StringIO.StringIO(u_data) 1000000 loops, best of 3: 1.15 µs per loop >>> %timeit StringIO.StringIO(b_data) 1000000 loops, best of 3: 1.19 µs per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With