I have a numpy array of tuples:
trainY = np.array([('php', 'image-processing', 'file-upload', 'upload', 'mime-types'),
('firefox',), ('r', 'matlab', 'machine-learning'),
('c#', 'url', 'encoding'), ('php', 'api', 'file-get-contents'),
('proxy', 'active-directory', 'jmeter'), ('core-plot',),
('c#', 'asp.net', 'windows-phone-7'),
('.net', 'javascript', 'code-generation'),
('sql', 'variables', 'parameters', 'procedure', 'calls')], dtype=object)
I am given list of indices which subsets this np.array:
x = [0, 4]
and a string:
label = 'php'
I want to count the number of times the label 'php'
occurs in this subset of the np.array. In this case, the answer would be 2.
Notes:
1) A label will only appear at most ONCE in a tuple and
2) The tuple can have length from 1 to 5.
3) Length of the list x
is typically 7-50.
4) Length of trainY
is approx 0.8mil
My current code to do this is:
sum([1 for n in x if label in trainY[n]])
This is currently a performance bottleneck of my program and I'm looking for a way to make it much faster. I think we can skip the loop over x
and just do a vectorised looking up trainY
like trainY[x]
but I couldn't get something that worked.
Thank you.
I think using Counters may be a good option in this case.
from collections import Counter
c = Counter([i for j in trainY for i in j])
print c['php'] # Returns 2
print c.most_common(5) # Print the 5 most common items.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With