I have a list of list with 1,200 rows and 500,000 columns. How do I convert it into a numpy array? I've read the solutions on Bypass "Array is too big" python error but they are not helping. I tried to put them into a numpy array: <pre class="prettyprint"><code>import random import numpy as np lol = [[random.uniform(0,1) for j in range(500000)] for i in range(1200)] np.array(lol) </code></pre> [Error]: <pre class="prettyprint"><code>ValueError: array is too big. </code></pre> Then i've tried <code>pandas</code>: <pre class="prettyprint"><code>import random import pandas as pd lol = [[random.uniform(0,1) for j in range(500000)] for i in range(1200)] pd.lib.to_object_array(lol).astype(float) </code></pre> [Error]: <pre class="prettyprint"><code>ValueError: array is too big. </code></pre> I've also tried hdf5 as @askewchan suggested: <pre class="prettyprint"><code>import h5py filearray = h5py.File('project.data','w') data = filearray.create_dataset('tocluster',(len(data),len(data[0])),dtype='f') data[...] = data </code></pre> [Error]: <pre class="prettyprint"><code> data[...] = data File "/usr/lib/python2.7/dist-packages/h5py/_hl/dataset.py", line 367, in __setitem__ val = numpy.asarray(val, order='C') File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 460, in asarray return array(a, dtype, copy=False, order=order) File "/usr/lib/python2.7/dist-packages/h5py/_hl/dataset.py", line 455, in __array__ arr = numpy.empty(self.shape, dtype=self.dtype if dtype is None else dtype) ValueError: array is too big. </code></pre> This post shows that I can store a huge numpy array in disk Python: how to store a numpy multidimensional array in PyTables?. But i can't even get my list of list into a numpy array =(

With h5py / hdf5: <pre class="prettyprint"><code>import numpy as np import h5py lol = np.empty((1200, 5000)).tolist() f = h5py.File('big.hdf5', 'w') bd = f.create_dataset('big_dataset', (len(lol), len(lol[0])), dtype='f') bd[...] = lol </code></pre> Then, I believe you can access your big dataset <code>bd</code> as if it were an array, but it is stored and accessed from disk, not memory: <pre class="prettyprint"><code>In [14]: bd[0, 1:10] Out[14]: array([ 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32) </code></pre> And you can have several 'datasets' in the one file (multiple arrays). <pre class="prettyprint"><code>abd = f.create_dataset('another_big_dataset', (len(lol), len(lol[0])), dtype='f') abd[...] = lol abd += 10 </code></pre> Then: <pre class="prettyprint"><code>In [24]: abd[:3, :10] Out[24]: array([[ 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.], [ 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.], [ 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.]], dtype=float32) In [25]: bd[:3, :10] Out[25]: array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32) </code></pre> My computer can't handle your example, so I can't test this with an array your size but I hope it works! Depending on what you want to do with your array, you might have more luck with pytables, which does a lot more than h5py. See also: Python Numpy Very Large Matrices exporting from/importing to numpy, scipy in SQLite and HDF5 formats

Have you tried assigning a dtype? This works for me. <pre class="prettyprint"><code>import random import numpy as np lol = [[random.uniform(0,1) for j in range(500000)] for i in range(1200)] ar = np.array(lol, dtype=np.float64) </code></pre> Another option is to use blaze. http://blaze.pydata.org/ <pre class="prettyprint"><code>import random import blaze lol = [[random.uniform(0,1) for j in range(500000)] for i in range(1200)] ar = blaze.array(lol) </code></pre>

How to create a Numpy array from a large list of list- python

I have a list of list with 1,200 rows and 500,000 columns. How do I convert it into a numpy array?

I've read the solutions on Bypass "Array is too big" python error but they are not helping.

I tried to put them into a numpy array:

import random
import numpy as np
lol = [[random.uniform(0,1) for j in range(500000)] for i in range(1200)]
np.array(lol)

[Error]:

ValueError: array is too big.

Then i've tried pandas:

import random
import pandas as pd
lol = [[random.uniform(0,1) for j in range(500000)] for i in range(1200)]
pd.lib.to_object_array(lol).astype(float)

[Error]:

ValueError: array is too big.

I've also tried hdf5 as @askewchan suggested:

import h5py
filearray = h5py.File('project.data','w')
data = filearray.create_dataset('tocluster',(len(data),len(data[0])),dtype='f')
data[...] = data

[Error]:

    data[...] = data
  File "/usr/lib/python2.7/dist-packages/h5py/_hl/dataset.py", line 367, in __setitem__
    val = numpy.asarray(val, order='C')
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 460, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/usr/lib/python2.7/dist-packages/h5py/_hl/dataset.py", line 455, in __array__
    arr = numpy.empty(self.shape, dtype=self.dtype if dtype is None else dtype)
ValueError: array is too big.

This post shows that I can store a huge numpy array in disk Python: how to store a numpy multidimensional array in PyTables?. But i can't even get my list of list into a numpy array =(

Can we convert list to array in Python?

In Python lists can be converted to arrays by using two methods from the NumPy library: Using numpy. array()

On a system with 32GB of RAM and 64-bit Python your code:

import random
import numpy as np
lol = [[random.uniform(0,1) for j in range(500000)] for i in range(1200)]
np.array(lol)

works just fine for me but it's probably not the best route to take. This is the kind of thing PyTables was built for. Since you're dealing with homogeneous data you can use the Array class or, better yet, the CArray class (which supports compression). This can be done as follows:

import numpy as np
import tables as pt

# Create container
h5 = pt.open_file('myarray.h5', 'w')
filters = pt.Filters(complevel=6, complib='blosc')
carr = h5.create_carray('/', 'carray', atom=pt.Float32Atom(), shape=(1200, 500000), filters=filters)

# Fill the array
m, n = carr.shape
for j in xrange(m):
    carr[j,:] = np.random.randn(n) 

h5.close() # "myarray.h5" (~2.2 GB)

# Open file
h5 = pt.open_file('myarray.h5', 'r')
carr = h5.root.carray
# Display some numbers from array
print carr[973:975, :4]
print carr.dtype

If you print carr.flavor it will return 'numpy'. You can use this carr in the same way you can use a NumPy array. The information is stored on disk but is still quite fast.

With h5py / hdf5:

import numpy as np
import h5py

lol = np.empty((1200, 5000)).tolist()

f = h5py.File('big.hdf5', 'w')
bd = f.create_dataset('big_dataset', (len(lol), len(lol[0])), dtype='f')
bd[...] = lol

Then, I believe you can access your big dataset bd as if it were an array, but it is stored and accessed from disk, not memory:

In [14]: bd[0, 1:10]
Out[14]:
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.], dtype=float32)

And you can have several 'datasets' in the one file (multiple arrays).

abd = f.create_dataset('another_big_dataset', (len(lol), len(lol[0])), dtype='f')
abd[...] = lol
abd += 10

Then:

In [24]: abd[:3, :10]
Out[24]: 
array([[ 10.,  10.,  10.,  10.,  10.,  10.,  10.,  10.,  10.,  10.],
       [ 10.,  10.,  10.,  10.,  10.,  10.,  10.,  10.,  10.,  10.],
       [ 10.,  10.,  10.,  10.,  10.,  10.,  10.,  10.,  10.,  10.]], dtype=float32)

In [25]: bd[:3, :10]
Out[25]: 
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]], dtype=float32)

My computer can't handle your example, so I can't test this with an array your size but I hope it works!

Depending on what you want to do with your array, you might have more luck with pytables, which does a lot more than h5py.

See also:
Python Numpy Very Large Matrices
exporting from/importing to numpy, scipy in SQLite and HDF5 formats

Have you tried assigning a dtype? This works for me.

import random
import numpy as np
lol = [[random.uniform(0,1) for j in range(500000)] for i in range(1200)]
ar = np.array(lol, dtype=np.float64)

Another option is to use blaze. http://blaze.pydata.org/

import random
import blaze
lol = [[random.uniform(0,1) for j in range(500000)] for i in range(1200)]
ar = blaze.array(lol)

How to create a Numpy array from a large list of list- python

Tags:

python

arrays

pandas

numpy

pytables

alvas

People also ask

3 Answers

Joel Vroom

askewchan

Michael WS

Recent Activity

Donate For Us

How to create a Numpy array from a large list of list- python

Tags:

python

arrays

pandas

numpy

pytables

alvas

People also ask

3 Answers

Joel Vroom

askewchan

Michael WS

Related questions

Recent Activity

Donate For Us