Requirements: <ul> <li>I need to grow an array arbitrarily large from data. </li> <li>I can guess the size (roughly 100-200) with no guarantees that the array will fit every time</li> <li>Once it is grown to its final size, I need to perform numeric computations on it, so I'd prefer to eventually get to a 2-D numpy array.</li> <li>Speed is critical. As an example, for one of 300 files, the update() method is called 45 million times (takes 150s or so) and the finalize() method is called 500k times (takes total of 106s) ... taking a total of 250s or so.</li> </ul> Here is my code: <pre class="prettyprint"><code>def __init__(self): self.data = [] def update(self, row): self.data.append(row) def finalize(self): dx = np.array(self.data) </code></pre> Other things I tried include the following code ... but this is waaaaay slower. <pre class="prettyprint"><code>def class A: def __init__(self): self.data = np.array([]) def update(self, row): np.append(self.data, row) def finalize(self): dx = np.reshape(self.data, size=(self.data.shape[0]/5, 5)) </code></pre> Here is a schematic of how this is called: <pre class="prettyprint"><code>for i in range(500000): ax = A() for j in range(200): ax.update([1,2,3,4,5]) ax.finalize() # some processing on ax </code></pre>

I tried a few different things, with timing. <pre class="prettyprint"><code>import numpy as np </code></pre> <ol> <li> The method you mention as slow: (32.094 seconds) <pre class="prettyprint"><code>class A: def __init__(self): self.data = np.array([]) def update(self, row): self.data = np.append(self.data, row) def finalize(self): return np.reshape(self.data, newshape=(self.data.shape[0]/5, 5)) </code></pre> </li> <li> Regular ol Python list: (0.308 seconds) <pre class="prettyprint"><code>class B: def __init__(self): self.data = [] def update(self, row): for r in row: self.data.append(r) def finalize(self): return np.reshape(self.data, newshape=(len(self.data)/5, 5)) </code></pre> </li> <li> Trying to implement an arraylist in numpy: (0.362 seconds) <pre class="prettyprint"><code>class C: def __init__(self): self.data = np.zeros((100,)) self.capacity = 100 self.size = 0 def update(self, row): for r in row: self.add(r) def add(self, x): if self.size == self.capacity: self.capacity *= 4 newdata = np.zeros((self.capacity,)) newdata[:self.size] = self.data self.data = newdata self.data[self.size] = x self.size += 1 def finalize(self): data = self.data[:self.size] return np.reshape(data, newshape=(len(data)/5, 5)) </code></pre> </li> </ol> And this is how I timed it: <pre class="prettyprint"><code>x = C() for i in xrange(100000): x.update([i]) </code></pre> So it looks like regular old Python lists are pretty good ;)

Fastest way to grow a numpy numeric array

I need to grow an array arbitrarily large from data.
I can guess the size (roughly 100-200) with no guarantees that the array will fit every time
Once it is grown to its final size, I need to perform numeric computations on it, so I'd prefer to eventually get to a 2-D numpy array.
Speed is critical. As an example, for one of 300 files, the update() method is called 45 million times (takes 150s or so) and the finalize() method is called 500k times (takes total of 106s) ... taking a total of 250s or so.

Here is my code:

def __init__(self):     self.data = []  def update(self, row):     self.data.append(row)  def finalize(self):     dx = np.array(self.data)

Other things I tried include the following code ... but this is waaaaay slower.

def class A:     def __init__(self):         self.data = np.array([])      def update(self, row):         np.append(self.data, row)      def finalize(self):         dx = np.reshape(self.data, size=(self.data.shape[0]/5, 5))

Here is a schematic of how this is called:

for i in range(500000):     ax = A()     for j in range(200):          ax.update([1,2,3,4,5])     ax.finalize()     # some processing on ax

797

asked Aug 20 '11 18:08

fodon

1 Answers

I tried a few different things, with timing.

import numpy as np

The method you mention as slow: (32.094 seconds)

class A:      def __init__(self):         self.data = np.array([])      def update(self, row):         self.data = np.append(self.data, row)      def finalize(self):         return np.reshape(self.data, newshape=(self.data.shape[0]/5, 5))

Regular ol Python list: (0.308 seconds)

class B:      def __init__(self):         self.data = []      def update(self, row):         for r in row:             self.data.append(r)      def finalize(self):         return np.reshape(self.data, newshape=(len(self.data)/5, 5))

Trying to implement an arraylist in numpy: (0.362 seconds)

class C:      def __init__(self):         self.data = np.zeros((100,))         self.capacity = 100         self.size = 0      def update(self, row):         for r in row:             self.add(r)      def add(self, x):         if self.size == self.capacity:             self.capacity *= 4             newdata = np.zeros((self.capacity,))             newdata[:self.size] = self.data             self.data = newdata          self.data[self.size] = x         self.size += 1      def finalize(self):         data = self.data[:self.size]         return np.reshape(data, newshape=(len(data)/5, 5))

And this is how I timed it:

x = C() for i in xrange(100000):     x.update([i])

So it looks like regular old Python lists are pretty good ;)

answered Oct 10 '22 19:10

Owen

Related questions
                            
                                Why does map return a map object instead of a list in Python 3?
                            
                                Why use Django on Google App Engine?
                            
                                How to get stable results with TensorFlow, setting random seed
                            
                                Can I add custom methods/attributes to built-in Python types?
                            
                                How to get reproducible results in keras
                            
                                numpy division with RuntimeWarning: invalid value encountered in double_scalars
                            
                                Is there special significance to 16331239353195370.0?
                            
                                Understanding time.perf_counter() and time.process_time()
                            
                                str performance in python
                            
                                Why is the Borg pattern better than the Singleton pattern in Python
                            
                                Python - TypeError: 'int' object is not iterable
                            
                                Most suitable python library for Github API v3 [closed]
                            
                                Python Equivalent of setInterval()?
                            
                                Call a Python method by name
                            
                                Why is bool a subclass of int?
                            
                                How can I troubleshoot Python "Could not find platform independent libraries <prefix>"
                            
                                Mock attributes in Python mock?
                            
                                Converting a float to a string without rounding it
                            
                                Pandas DataFrame aggregate function using multiple columns
                            
                                Pytorch tensor to numpy array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to grow a numpy numeric array

Tags:

performance

python

numpy

fodon

People also ask

1 Answers

Owen

Recent Activity

Donate For Us