My question is how to efficiently expand an array, by copying itself many times. I am trying to expand my survey samples to the full size dataset, by copying every sample N times. N is the influence factor that signed to the sample. So I wrote two loops to do this task (script pasted below). It works, but is slow. My sample size is 20,000, and try to expand it into 3 million full size.. is there any function I can try? Thank you for your help!
----My script----
lines = np.asarray(person.read().split('\n'))
df_array = np.asarray(lines[0].split(' '))
for j in range(1,len(lines)-1):
subarray = np.asarray(lines[j].split(' '))
factor = int(round(float(subarray[-1]),0))
for i in range(1,factor):
df_array = np.vstack((df_array, subarray))
print len(df_array)
First, you can try to load data all together with numpy.loadtxt
.
Then, to repeat according to the last column, use numpy.repeat
:
>>> data = np.array([[1, 2, 3],
... [4, 5, 6]])
>>> np.repeat(data, data[:,-1], axis=0)
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3],
[4, 5, 6],
[4, 5, 6],
[4, 5, 6],
[4, 5, 6],
[4, 5, 6],
[4, 5, 6]])
Finally, if you need to round data[:,-1]
, replace it with np.round(data[:,-1]).astype(int)
.
Stacking numpy arrays over and over is not very efficient, because they're not really optimized for dynamic growth like that. Every time you vstack, it's allocating a whole new chunk of memory for the size of your data at that point.
Use lists then build your array right at the end, maybe something with a generator like this:
def upsample(stream):
for line in stream:
rec = line.strip().split()
factor = int(round(float(rec[-1]),0))
for i in xrange(factor):
yield rec
df_array = np.array(list(upsample(person)))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With