How do I fill two (or more) numpy arrays from a single iterable of tuples?

Tags:

The actual problem I have is that I want to store a long sorted list of (float, str) tuples in RAM. A plain list doesn't fit in my 4Gb RAM, so I thought I could use two numpy.ndarrays.

The source of the data is an iterable of 2-tuples. numpy has a fromiter function, but how can I use it? The number of items in the iterable is unknown. I can't consume it to a list first due to memory limitations. I thought of itertools.tee, but it seems to add a lot of memory overhead here.

What I guess I could do is consume the iterator in chunks and add those to the arrays. Then my question is, how to do that efficiently? Should I maybe make 2 2D arrays and add rows to them? (Then later I'd need to convert them to 1D).

Or maybe there's a better approach? Everything I really need is to search through an array of strings by the value of the corresponding number in logarithmic time (that's why I want to sort by the value of float) and to keep it as compact as possible.

P.S. The iterable is not sorted.

600

asked Feb 25 '13 20:02

Lev Levitsky

1 Answers

Here is a way to build N separate arrays out of a generator of N-tuples:

import numpy as np
import itertools as IT


def gendata():
    # You, of course, have a different gendata...
    N = 100
    for i in xrange(N):
        yield (np.random.random(), str(i))


def fromiter(iterable, dtype, chunksize=7):
    chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
    result = [chunk[name].copy() for name in chunk.dtype.names]
    size = len(chunk)
    while True:
        chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
        N = len(chunk)
        if N == 0:
            break
        newsize = size + N
        for arr, name in zip(result, chunk.dtype.names):
            col = chunk[name]
            arr.resize(newsize, refcheck=0)
            arr[size:] = col
        size = newsize
    return result

x, y = fromiter(gendata(), '<f8,|S20')

order = np.argsort(x)
x = x[order]
y = y[order]

# Some pseudo-random value in x
N = 10
val = x[N]
print(x[N], y[N])
# (0.049875262239617246, '46')

idx = x.searchsorted(val)
print(x[idx], y[idx])
# (0.049875262239617246, '46')

The fromiter function above reads the iterable in chunks (of size chunksize). It calls the NumPy array method resize to extend the resultant arrays as necessary.

I used a small default chunksize since I was testing this code on small data. You, of course, will want to either change the default chunksize or pass a chunksize parameter with a larger value.

112

answered Oct 09 '22 19:10

unutbu

Related questions
                            
                                How to pickle functions/classes defined in __main__ (python)
                            
                                How to run parallel programs in python
                            
                                ReportLab: How to auto resize text to fit block
                            
                                Saving binary data into file on model via django storages boto s3
                            
                                Python if statement error handling [duplicate]
                            
                                Get the formula of a interpolation function created by scipy
                            
                                Python login to email server authentication error
                            
                                IbPy: How to extract API response into a variable
                            
                                SymPy cannot solve an equation that Matlab can
                            
                                distance matrix of curves in python
                            
                                How to reload python module from itself?
                            
                                Url in browser not updated after call of redirect( url_for('xxx' )) in Flask with jQuery mobile
                            
                                What does Cython do with imports?
                            
                                How to access variable by id? [duplicate]
                            
                                SQLAlchemy: prevent automatic closing
                            
                                How can I find out where an object has been instantiated?
                            
                                Amazon Search API
                            
                                Why does communicate deadlock when used with multiple Popen subprocesses?
                            
                                How can I make my program utilize tab completion?
                            
                                Is there a way to add an already created parser as a subparser in argparse?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I fill two (or more) numpy arrays from a single iterable of tuples?

Tags:

python

arrays

iteration

numpy

Lev Levitsky

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us