I'm trying to construct an np.array
by sampling from a python generator, that yields one row of the array per invocation of next
. Here is some sample code:
import numpy as np
data = np.eye(9)
labels = np.array([0,0,0,1,1,1,2,2,2])
def extract_one_class(X,labels,y):
""" Take an array of data X, a column vector array of labels, and one particular label y. Return an array of all instances in X that have label y """
return X[np.nonzero(labels[:] == y)[0],:]
def generate_points(data, labels, size):
""" Generate and return 'size' pairs of points drawn from different classes """
label_alphabet = np.unique(labels)
assert(label_alphabet.size > 1)
for useless in xrange(size):
shuffle(label_alphabet)
first_class = extract_one_class(data,labels,label_alphabet[0])
second_class = extract_one_class(data,labels,label_alphabet[1])
pair = np.hstack((first_class[randint(0,first_class.shape[0]),:],second_class[randint(0,second_class.shape[0]),:]))
yield pair
points = np.fromiter(generate_points(data,labels,5),dtype = np.dtype('f8',(2*data.shape[1],1)))
The extract_one_class
function returns a subset of data: all data points belonging to one class label. I would like to have points be an np.array
with shape = (size,data.shape[1])
. Currently the code snippet above returns an error:
ValueError: setting an array element with a sequence.
The documentation of fromiter
claims to return a one-dimensional array. Yet others have used fromiter to construct record arrays in numpy before (e.g http://iam.al/post/21116450281/numpy-is-my-homeboy).
Am I off the mark in assuming I can generate an array in this fashion? Or is my numpy just not quite right?
As you've noticed, the documentation of np.fromiter
explains that the function creates a 1D array. You won't be able to create a 2D array that way, and @unutbu method of returning a 1D array that you reshape afterwards is a sure go.
However, you can indeed create structured arrays using fromiter
, as illustrated by:
>>> import itertools
>>> a = itertools.izip((1,2,3),(10,20,30))
>>> r = np.fromiter(a,dtype=[('',int),('',int)])
array([(1, 10), (2, 20), (3, 30)],
dtype=[('f0', '<i8'), ('f1', '<i8')])
but look, r.shape=(3,)
, that is, r
is really nothing but 1D array of records, each record being composed of two integers. Because all the fields have the same dtype
, we can take a view of r
as a 2D array
>>> r.view((int,2))
array([[ 1, 10],
[ 2, 20],
[ 3, 30]])
So, yes, you could try to use np.fromiter
with a dtype
like [('',int)]*data.shape[1]
: you'll get a 1D array of length size
, that you can then view this array as ((int, data.shape[1]))
. You can use floats instead of ints, the important part is that all fields have the same dtype.
If you really want it, you can use some fairly complex dtype
. Consider for example
r = np.fromiter(((_,) for _ in a),dtype=[('',(int,2))])
Here, you get a 1D structured array with 1 field, the field consisting of an array of 2 integers. Note the use of (_,)
to make sure that each record is passed as a tuple (else np.fromiter
chokes). But do you need that complexity?
Note also that as you know the length of the array beforehand (it's size
), you should use the counter
optional argument of np.fromiter
for more efficiency.
You could modify generate_points
to yield single floats instead of np.arrays, use np.fromiter
to form a 1D array, and then use .reshape(size, -1)
to make it a 2D array.
points = np.fromiter(
generate_points(data,labels,5)).reshape(size, -1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With