Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to pipe binary data into numpy arrays without tmp storage?

There are several similar questions but none of them answers this simple question directly:

How can i catch a commands output and stream that content into numpy arrays without creating a temporary string object to read from?

So, what I would like to do is this:

import subprocess
import numpy
import StringIO

def parse_header(fileobject):
    # this function moves the filepointer and returns a dictionary
    d = do_some_parsing(fileobject)
    return d

sio = StringIO.StringIO(subprocess.check_output(cmd))
d = parse_header(sio)
# now the file pointer is at the start of data, parse_header takes care of that.
# ALL of the data is now available in the next line of sio
dt = numpy.dtype([(key, 'f8') for key in d.keys()])

# i don't know how do make this work:
data = numpy.fromxxxx(sio , dt)

# if i would do this, I create another copy besides the StringIO object, don't I?
# so this works, but isn't this 'bad' ?
datastring = sio.read()
data = numpy.fromstring(datastring, dtype=dt)

I tried it with StringIO and cStringIO but both are not accepted by numpy.frombuffer and numpy.fromfile.

Using StringIO object I first have to read the stream into a string and then use numpy.fromstring, but I would like to avoid creating the intermediate object (several Gigabytes).

An alternative for me would be if I can stream sys.stdin into numpy arrays, but that does not work with numpy.fromfile either (seek needs to be implemented).

Are there any work-arounds for this? I can't be the first one trying this (unless this is a PEBKAC case?)

Solution: This is the current solution, it's a mix of unutbu's instruction how to use the Popen with PIPE and the hint of eryksun to use bytearray, so I don't know who to accept!? :S

proc = sp.Popen(cmd, stdout = sp.PIPE, shell=True)
d = parse_des_header(proc.stdout)
rec_dtype = np.dtype([(key,'f8') for key in d.keys()])
data = bytearray(proc.stdout.read())
ndata = np.frombuffer(data, dtype = rec_dtype)

I didn't check if the data is really not creating another copy, don't know how. But what I noticed that this works much faster than everything I tried before, so many thanks to both the answers' authors!

Update 2022: I just tried above solution steps without the bytearray() step and it just works fine. Thanks to Python 3 I guess?

like image 204
K.-Michael Aye Avatar asked Oct 24 '12 23:10

K.-Michael Aye


1 Answers

You can use Popen with stdout=subprocess.PIPE. Read in the header, then load the rest into a bytearray to use with np.frombuffer.

Additional comments based on your edit:

If you're going to call proc.stdout.read(), it's equivalent to using check_output(). Both create a temporary string. If you preallocate data, you could use proc.stdout.readinto(data). Then if the number of bytes read into data is less than len(data), free the excess memory, else extend data by whatever is left to be read.

data = bytearray(2**32) # 4 GiB
n = proc.stdout.readinto(data)
if n < len(data):
    data[n:] = ''        
else:
    data += proc.stdout.read()

You could also come at this starting with a pre-allocated ndarray ndata and use buf = np.getbuffer(ndata). Then readinto(buf) as above.

Here's an example to show that the memory is shared between the bytearray and the np.ndarray:

>>> data = bytearray('\x01')
>>> ndata = np.frombuffer(data, np.int8)
>>> ndata
array([1], dtype=int8)
>>> ndata[0] = 2
>>> data
bytearray(b'\x02')
like image 68
Eryk Sun Avatar answered Sep 20 '22 10:09

Eryk Sun