There are several similar questions but none of them answers this simple question directly:
How can i catch a commands output and stream that content into numpy arrays without creating a temporary string object to read from?
So, what I would like to do is this:
import subprocess
import numpy
import StringIO
def parse_header(fileobject):
# this function moves the filepointer and returns a dictionary
d = do_some_parsing(fileobject)
return d
sio = StringIO.StringIO(subprocess.check_output(cmd))
d = parse_header(sio)
# now the file pointer is at the start of data, parse_header takes care of that.
# ALL of the data is now available in the next line of sio
dt = numpy.dtype([(key, 'f8') for key in d.keys()])
# i don't know how do make this work:
data = numpy.fromxxxx(sio , dt)
# if i would do this, I create another copy besides the StringIO object, don't I?
# so this works, but isn't this 'bad' ?
datastring = sio.read()
data = numpy.fromstring(datastring, dtype=dt)
I tried it with StringIO and cStringIO but both are not accepted by numpy.frombuffer and numpy.fromfile.
Using StringIO object I first have to read the stream into a string and then use numpy.fromstring, but I would like to avoid creating the intermediate object (several Gigabytes).
An alternative for me would be if I can stream sys.stdin into numpy arrays, but that does not work with numpy.fromfile either (seek needs to be implemented).
Are there any work-arounds for this? I can't be the first one trying this (unless this is a PEBKAC case?)
Solution: This is the current solution, it's a mix of unutbu's instruction how to use the Popen with PIPE and the hint of eryksun to use bytearray, so I don't know who to accept!? :S
proc = sp.Popen(cmd, stdout = sp.PIPE, shell=True)
d = parse_des_header(proc.stdout)
rec_dtype = np.dtype([(key,'f8') for key in d.keys()])
data = bytearray(proc.stdout.read())
ndata = np.frombuffer(data, dtype = rec_dtype)
I didn't check if the data is really not creating another copy, don't know how. But what I noticed that this works much faster than everything I tried before, so many thanks to both the answers' authors!
Update 2022: I just tried above solution steps without the bytearray() step and it just works fine. Thanks to Python 3 I guess?
You can use Popen
with stdout=subprocess.PIPE
. Read in the header, then load the rest into a bytearray
to use with np.frombuffer
.
Additional comments based on your edit:
If you're going to call proc.stdout.read()
, it's equivalent to using check_output()
. Both create a temporary string. If you preallocate data
, you could use proc.stdout.readinto(data)
. Then if the number of bytes read into data
is less than len(data)
, free the excess memory, else extend data
by whatever is left to be read.
data = bytearray(2**32) # 4 GiB
n = proc.stdout.readinto(data)
if n < len(data):
data[n:] = ''
else:
data += proc.stdout.read()
You could also come at this starting with a pre-allocated ndarray
ndata
and use buf = np.getbuffer(ndata)
. Then readinto(buf)
as above.
Here's an example to show that the memory is shared between the bytearray
and the np.ndarray
:
>>> data = bytearray('\x01')
>>> ndata = np.frombuffer(data, np.int8)
>>> ndata
array([1], dtype=int8)
>>> ndata[0] = 2
>>> data
bytearray(b'\x02')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With