Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy array from cStringIO object and avoiding copies

This to understand things better. It is not an actual problem that I need to fix. A cstringIO object is supposed to emulate a string, file and also an iterator over the lines. Does it also emulate a buffer ? In anycase ideally one should be able to construct a numpy array as follows

import numpy as np
import cstringIO

c = cStringIO.StringIO('\x01\x00\x00\x00\x01\x00\x00\x00')

#Trying the iterartor abstraction
b = np.fromiter(c,int)
# The above fails with: ValueError: setting an array element with a sequence.

#Trying the file abstraction
b = np.fromfile(c,int)
# The above fails with: IOError: first argument must be an open file

#Trying the sequence abstraction
b = np.array(c, int)
# The above fails with: TypeError: long() argument must be a string or a number 

#Trying the string abstraction
b = np.fromstring(c)
#The above fails with: TypeError: argument 1 must be string or read-only buffer

b = np.fromstring(c.getvalue(), int)  # does work

My question is why does it behave this way.

The practical problem where this came up is the following: I have a iterator which yields a tuple. I am interested in making a numpy array from one of the components of the tuple with as little copying and duplication as possible. My first cut was to keep writing the interesting components of the yielded tuple into a StringIO object and then use its memory buffer for the array. I can of course use getvalue() but will create and return a copy. What would be a good way to avoid the extra copying.

like image 604
san Avatar asked Jun 24 '11 06:06

san


People also ask

Does reshape make a copy?

With a compatible order, reshape does not produce a copy.

How can you shallow copy the data in NumPy view () in method is method or assignments?

copy() is supposed to create a shallow copy of its argument, but when applied to a NumPy array it creates a shallow copy in sense B, i.e. the new array gets its own copy of the data buffer, so changes to one array do not affect the other. x_copy = x. copy() is all you need to make a copy of x .

Does NumPy indexing create a copy?

Fancy indexing. NumPy arrays can be indexed with slices, but also with boolean or integer arrays (masks). This method is called fancy indexing. It creates copies not views.

Which function of NumPy should be used to make deep copy of an array?

copy() function creates a deep copy. It is a complete copy of the array and its data, and doesn't share with the original array.


2 Answers

The problem seems to be that numpy doesn't like being given characters instead of numbers. Remember, in Python, single characters and strings have the same type — numpy must have some type detection going on under the hood, and takes '\x01' to be a nested sequence.

The other problem is that a cStringIO iterates over its lines, not its characters.

Something like the following iterator should get around both of these problems:

def chariter(filelike):
    octet = filelike.read(1)
    while octet:
        yield ord(octet)
        octet = filelike.read(1)

Use it like so (note the seek!):

c.seek(0)
b = np.fromiter(chariter(c), int)
like image 94
detly Avatar answered Sep 26 '22 01:09

detly


As cStringIO does not implement the buffer interface, if its getvalue returns a copy of the data, then there is no way to get its data without copying.

If getvalue returns the buffer as a string without making a copy, numpy.frombuffer(x.getvalue(), dtype='S1') will give a (read-only) numpy array referring to the string, without an additional copy.


The reason why np.fromiter(c, int) and np.array(c, int) do not work is that cStringIO, when iterated, returns a line at a time, similarly as files:

>>> list(iter(c))
['\x01\x00\x00\x00\x01\x00\x00\x00']

Such a long string cannot be converted to a single integer.

***

It's best not to worry too much about making copies unless it really turns out to be a problem. The reason is that the extra overhead in e.g. using a generator and passing it to numpy.fromiter may be actually larger than what is involved in constructing a list, and then passing that to numpy.array --- making the copies may be cheap compared to Python runtime overhead.

However, if the issue is with memory, then one solution is to put the items directly into the final Numpy array. If you know the size beforehand, you can pre-allocate it. If the size is unknown, you can use the .resize() method in the array to grow it as needed.

like image 27
pv. Avatar answered Sep 22 '22 01:09

pv.