In my real problem, I'll have two tables of information (x,y). x will have ~2.6 million records and y will have ~10K; the two tables have a many to one (x->y) relationship. I want to subset x based on y.
The posts that I thought matched best were this and that and also this. I settled on numpy arrays. I'm open to using other data structures; I was just trying to pick something that would scale. Am I using an appropriate approach? Are there other posts that cover this? I didn't want to have to use a database since I'm only doing this once.
The following code tries to illustrate what I'm trying to do.
import numpy, copy
x=numpy.array([(1,'a'), (1, 'b'), (3,'a'), (3, 'b'), (3, 'c'), (4, 'd')], dtype=[('id', int),('category', str, 22)] )
y=numpy.array([('a', 3.2, 0), ('b', -1, 0), ('c', 0, 0), ('d', 100, 0)], dtype=[('category', str, 20), ('value', float), ('output', int)] )
for id, category in x:
if y[y['category']==category]['value'][0] > 3:
y[y['category']==category]['output']=numpy.array(copy.deepcopy(id))
You have to be careful when you are trying to index with a boolean array (y['category']==category) to modify the original array (y) because 'fancy indexing' returns a copy (not a view), so modifying the copy will not change your original array y. If you are just doing this on an ordinary array it works fine (this confused me in the past). But with a structured array like you're using, it won't be a view even when used as assignment, if you use the mask then index again with a fieldname. It sounds confusing, but it won't work as you've written it, notice that y is unchanged before and after:
for i, category in x:
c = y['category']==category #generate the mask once
if y[c]['value'][0] > 3:
print 'before:', y[c]['output']
y[c]['output'] = i
print 'after:', y[c]['output']
#output:
#before: [0]
#after: [0]
#before: [0]
#after: [0]
#before: [0]
#after: [0]
If you get a view using field access then get the fancy indexing on that view, you will get a setitem call that works:
for i, category in x:
c = y['category']==category #generate the mask once
if y[c]['value'][0] > 3:
print 'before:', y[c]['output']
y['output'][c] = i
print 'after:', y[c]['output']
#output:
#before: [0]
#after: [1]
#before: [1]
#after: [3]
#before: [0]
#after: [4]
As you see, I removed your copy too. i (or id, which I did not use since id is a function) is simply an integer, which does not need to be copied. If you do need to copy something, you might be better off using the numpy copy instead of the standard library copy, as in
y[...]['output'] = np.array(id, copy=True)
or
y[...]['output'] = np.copy(id)
In fact, copy=True should be default, so ... = np.array(id) is probably sufficient, but I am not an authority on copying.
You have 2.6 million records which each (potentially) overwrite one of 10K records. So there could be a lot of overwriting going on. Each time you write to the same location, all the previous work done at that location was for naught.
So you could make things more efficient by looping through y (10K unique? categories) instead of looping through x (2.6M records).
import numpy as np
x = np.array([(1,'a'), (1, 'b'), (3,'a'), (3, 'b'), (3, 'c'), (4, 'd')], dtype=[('id', int),('category', str, 22)] )
y = np.array([('a', 3.2, 0), ('b', -1, 0), ('c', 0, 0), ('d', 100, 0)], dtype=[('category', str, 20), ('value', float), ('output', int)] )
for idx in np.where(y['value'] > 3)[0]:
row = y[idx]
category = row['category']
# Only the last record in `x` of the right category affects `y`.
# So find the id value for that last record in `x`
idval = x[x['category'] == category]['id'][-1]
y[idx]['output'] = idval
print(y)
yields
[('a', 3.2, 3) ('b', -1.0, 0) ('c', 0.0, 0) ('d', 100.0, 4)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With