Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

selecting records in a python structure based on a related structure

In my real problem, I'll have two tables of information (x,y). x will have ~2.6 million records and y will have ~10K; the two tables have a many to one (x->y) relationship. I want to subset x based on y.

The posts that I thought matched best were this and that and also this. I settled on numpy arrays. I'm open to using other data structures; I was just trying to pick something that would scale. Am I using an appropriate approach? Are there other posts that cover this? I didn't want to have to use a database since I'm only doing this once.

The following code tries to illustrate what I'm trying to do.

import numpy, copy
x=numpy.array([(1,'a'), (1, 'b'), (3,'a'), (3, 'b'), (3, 'c'), (4, 'd')], dtype=[('id', int),('category', str, 22)]  )
y=numpy.array([('a', 3.2, 0), ('b', -1, 0), ('c', 0, 0), ('d', 100, 0)], dtype=[('category', str, 20), ('value', float), ('output', int)] )
for id, category in x:
    if y[y['category']==category]['value'][0] > 3:
        y[y['category']==category]['output']=numpy.array(copy.deepcopy(id))
like image 934
Roland Avatar asked Jan 18 '26 05:01

Roland


2 Answers

You have to be careful when you are trying to index with a boolean array (y['category']==category) to modify the original array (y) because 'fancy indexing' returns a copy (not a view), so modifying the copy will not change your original array y. If you are just doing this on an ordinary array it works fine (this confused me in the past). But with a structured array like you're using, it won't be a view even when used as assignment, if you use the mask then index again with a fieldname. It sounds confusing, but it won't work as you've written it, notice that y is unchanged before and after:

for i, category in x:
    c = y['category']==category   #generate the mask once
    if y[c]['value'][0] > 3:
        print 'before:', y[c]['output']
        y[c]['output'] = i
        print 'after:', y[c]['output']

#output:
#before: [0]
#after: [0]
#before: [0]
#after: [0]
#before: [0]
#after: [0]

If you get a view using field access then get the fancy indexing on that view, you will get a setitem call that works:

for i, category in x:
    c = y['category']==category   #generate the mask once
    if y[c]['value'][0] > 3:
        print 'before:', y[c]['output']
        y['output'][c] = i
        print 'after:', y[c]['output']

#output:
#before: [0]
#after: [1]
#before: [1]
#after: [3]
#before: [0]
#after: [4]

As you see, I removed your copy too. i (or id, which I did not use since id is a function) is simply an integer, which does not need to be copied. If you do need to copy something, you might be better off using the numpy copy instead of the standard library copy, as in

y[...]['output'] = np.array(id, copy=True)

or

y[...]['output'] = np.copy(id)

In fact, copy=True should be default, so ... = np.array(id) is probably sufficient, but I am not an authority on copying.

like image 97
askewchan Avatar answered Jan 19 '26 19:01

askewchan


You have 2.6 million records which each (potentially) overwrite one of 10K records. So there could be a lot of overwriting going on. Each time you write to the same location, all the previous work done at that location was for naught.

So you could make things more efficient by looping through y (10K unique? categories) instead of looping through x (2.6M records).

import numpy as np
x = np.array([(1,'a'), (1, 'b'), (3,'a'), (3, 'b'), (3, 'c'), (4, 'd')], dtype=[('id', int),('category', str, 22)]  )
y = np.array([('a', 3.2, 0), ('b', -1, 0), ('c', 0, 0), ('d', 100, 0)], dtype=[('category', str, 20), ('value', float), ('output', int)] )

for idx in np.where(y['value'] > 3)[0]:
    row = y[idx]
    category = row['category']
    # Only the last record in `x` of the right category affects `y`.
    # So find the id value for that last record in `x`
    idval = x[x['category'] == category]['id'][-1]
    y[idx]['output'] = idval

print(y)

yields

[('a', 3.2, 3) ('b', -1.0, 0) ('c', 0.0, 0) ('d', 100.0, 4)]
like image 43
unutbu Avatar answered Jan 19 '26 19:01

unutbu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!