I'm just starting with NumPy so I may be missing some core concepts... What's the best way to create a NumPy array from a dictionary whose values are lists? Something like this: <pre class="prettyprint"><code>d = { 1: [10,20,30] , 2: [50,60], 3: [100,200,300,400,500] } </code></pre> Should turn into something like: <pre class="prettyprint"><code>data = [ [10,20,30,?,?], [50,60,?,?,?], [100,200,300,400,500] ] </code></pre> I'm going to do some basic statistics on each row, eg: <pre class="prettyprint"><code>deviations = numpy.std(data, axis=1) </code></pre> Questions: <ul> <li>What's the best / most efficient way to create the numpy.array from the dictionary? The dictionary is large; a couple of million keys, each with ~20 items.</li> <li>The number of values for each 'row' are different. If I understand correctly numpy wants uniform size, so what do I fill in for the missing items to make std() happy?</li> </ul> Update: One thing I forgot to mention - while the python techniques are reasonable (eg. looping over a few million items is fast), it's constrained to a single CPU. Numpy operations scale nicely to the hardware and hit all the CPUs, so they're attractive.

You don't need to create numpy arrays to call numpy.std(). You can call numpy.std() in a loop over all the values of your dictionary. The list will be converted to a numpy array on the fly to compute the standard variation. The downside of this method is that the main loop will be in python and not in C. But I guess this should be fast enough: you will still compute std at C speed, and you will save a lot of memory as you won't have to store 0 values where you have variable size arrays. <ul> <li>If you want to further optimize this, you can store your values into a list of numpy arrays, so that you do the python list -> numpy array conversion only once. </li> <li>if you find that this is still too slow, try to use psycho to optimize the python loop.</li> <li>if this is still too slow, try using Cython together with the numpy module. This Tutorial claims impressive speed improvements for image processing. Or simply program the whole std function in Cython (see this for benchmarks and examples with sum function )</li> <li>An alternative to Cython would be to use SWIG with numpy.i.</li> <li>if you want to use only numpy and have everything computed at C level, try grouping all the records of same size together in different arrays and call numpy.std() on each of them. It should look like the following example.</li> </ul> example with O(N) complexity: <pre class="prettyprint"><code>import numpy list_size_1 = [] list_size_2 = [] for row in data.itervalues(): if len(row) == 1: list_size_1.append(row) elif len(row) == 2: list_size_2.append(row) list_size_1 = numpy.array(list_size_1) list_size_2 = numpy.array(list_size_2) std_1 = numpy.std(list_size_1, axis = 1) std_2 = numpy.std(list_size_2, axis = 1) </code></pre>

Best way to create a NumPy array from a dictionary?

Tags:

python

numpy

I'm just starting with NumPy so I may be missing some core concepts...

What's the best way to create a NumPy array from a dictionary whose values are lists?

Something like this:

d = { 1: [10,20,30] , 2: [50,60], 3: [100,200,300,400,500] }

Should turn into something like:

data = [
  [10,20,30,?,?],
  [50,60,?,?,?],
  [100,200,300,400,500]
]

I'm going to do some basic statistics on each row, eg:

deviations = numpy.std(data, axis=1)

Questions:

What's the best / most efficient way to create the numpy.array from the dictionary? The dictionary is large; a couple of million keys, each with ~20 items.
The number of values for each 'row' are different. If I understand correctly numpy wants uniform size, so what do I fill in for the missing items to make std() happy?

Update: One thing I forgot to mention - while the python techniques are reasonable (eg. looping over a few million items is fast), it's constrained to a single CPU. Numpy operations scale nicely to the hardware and hit all the CPUs, so they're attractive.

627

asked Mar 02 '09 06:03

Parand

1 Answers

You don't need to create numpy arrays to call numpy.std(). You can call numpy.std() in a loop over all the values of your dictionary. The list will be converted to a numpy array on the fly to compute the standard variation.

The downside of this method is that the main loop will be in python and not in C. But I guess this should be fast enough: you will still compute std at C speed, and you will save a lot of memory as you won't have to store 0 values where you have variable size arrays.

If you want to further optimize this, you can store your values into a list of numpy arrays, so that you do the python list -> numpy array conversion only once.
if you find that this is still too slow, try to use psycho to optimize the python loop.
if this is still too slow, try using Cython together with the numpy module. This Tutorial claims impressive speed improvements for image processing. Or simply program the whole std function in Cython (see this for benchmarks and examples with sum function )
An alternative to Cython would be to use SWIG with numpy.i.
if you want to use only numpy and have everything computed at C level, try grouping all the records of same size together in different arrays and call numpy.std() on each of them. It should look like the following example.

example with O(N) complexity:

import numpy
list_size_1 = []
list_size_2 = []
for row in data.itervalues():
    if len(row) == 1:
      list_size_1.append(row)
    elif len(row) == 2:
      list_size_2.append(row)
list_size_1 = numpy.array(list_size_1)
list_size_2 = numpy.array(list_size_2)
std_1 = numpy.std(list_size_1, axis = 1)
std_2 = numpy.std(list_size_2, axis = 1)

141

answered Sep 19 '22 14:09

10 revs

Related questions
                            
                                Pipenv "ModuleNotFoundError: No module named 'pip'" after upgrading to python3.7
                            
                                Get week number with week start day different than monday - Python
                            
                                How to save Keras model as frozen graph?
                            
                                Executing the assembly generated by Numba
                            
                                How to create a FacetGrid stacked barplot using Seaborn?
                            
                                ValueError: Data cardinality is ambiguous
                            
                                Updating a matplotlib figure during simulation
                            
                                Python3 process and display webcam stream at the webcams fps
                            
                                it seems that the version of the libffi library seen at runtime is different from the 'ffi.h' file seen at compile-time
                            
                                Airflow - Send email with AWS SES
                            
                                WSL 2 : Pycharm debugger connection time out
                            
                                Why doesn't small integer caching seem to work with int objects from the round() function in Python 3?
                            
                                How can I run a Python project on another computer without installing anything on it?
                            
                                Is x%(1e9 + 7) and x%(10**9 + 7) different in Python? If yes, why?
                            
                                How do you join multiple rows into one row in pandas?
                            
                                Why does .loc assignment with two sets of brackets result in NaN in a pandas.DataFrame?
                            
                                How do I cleanly test equality of objects in Mypy without producing errors?
                            
                                How to best implement simple crash / error reporting?
                            
                                Python: wrapping method invocations with pre and post methods
                            
                                How to work with unsaved many-to-many relations in django?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With