I'm creating several numpy arrays from a list of numpy arrays, like so:
seq_length = 1500
seq_diff = 200 # difference between start of two sequences
# x and y are 2D numpy arrays
x_seqs = [x[i:i+seq_length,:] for i in range(0, seq_diff*(len(x) // seq_diff), seq_diff)]
y_seqs = [y[i:i+seq_length,:] for i in range(0, seq_diff*(len(y) // seq_diff), seq_diff)]
boundary1 = int(0.7 * len(x_seqs)) # 70% is training set
boundary2 = int(0.85 * len(x_seqs)) # 15% validation, 15% test
x_train = np.array(x_seqs[:boundary1])
y_train = np.array(y_seqs[:boundary1])
x_valid = np.array(x_seqs[boundary1:boundary2])
y_valid = np.array(y_seqs[boundary1:boundary2])
x_test = np.array(x_seqs[boundary2:])
y_test = np.array(y_seqs[boundary2:])
I'd like to end up with 6 arrays of shape (n, 1500, 300) where n is either 70%, 15% or 15% of my data for the training, validation and test arrays, respectively.
This is where it goes wrong: the _train
and _valid
arrays turn out fine, but the _test
arrays are one-dimensional arrays of arrays. That is:
x_train.shape
is (459, 1500, 300)
x_valid.shape
is (99, 1500, 300)
x_test.shape
is (99,)
But printing x_test
verifies that it contains the correct elements - i.e. it's a 99-element long array of (1500, 300)
arrays.
Why do the _test
matrices get the wrong shape, while the _train
and _valid
matrices don't?
The items in x_seqs
vary in length. When they are all the same length, np.array
can make a 3d array from them; when they differ it makes an object array of lists. Look at the dtype
of x_test
. Look at the [len(i) for i in x_test]
.
I took your code, added:
x=np.zeros((2000,10))
y=x.copy()
...
print([len(i) for i in x_seqs])
print(x_train.shape)
print(x_valid.shape)
print(x_test.shape)
and got:
1520:~/mypy$ python3 stack40643639.py
[1500, 1500, 1500, 1400, 1200, 1000, 800, 600, 400, 200]
(7,)
(1, 600, 10)
(2,)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With