Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Equivalent of Matlab 'ismember' in numpy (Python)? [duplicate]

I am struggling to find a Numpy equivalent for a particular Matlab coding "pattern" using ismember.

Unfortunately this code tends to be where most of the time is spent in my Matlab scripts so I want to find an efficient Numpy equivalent.

The basic pattern consists of mapping a subset onto a larger grid. I have a set of key value pairs stored as parallel arrays and I want to insert these values into a larger list of key value pairs stored in the same way.

For concreteness say I have quarterly GDP data that I map onto a monthly time grid as follows.

quarters = [200712 200803 200806 200809 200812 200903];
gdp_q = [10.1 10.5 11.1 11.8 10.9 10.3];
months = 200801 : 200812;
gdp_m = NaN(size(months));
[tf, loc] = ismember(quarters, months);
gdp_m(loc(tf)) = gdp_q(tf);

Note that not all the quarters appear in the list of months so both the tf and the loc variables are required.

I have seen similar questions on StackOverflow but they either just give a pure Python solution (here) or where numpy is used then the loc argument isn't returned (here).

In my particular application area, this particular code pattern tends to arise over and over again and uses up most of the CPU time of my functions so an efficient solution here is really crucial for me.

Comments or redesign suggestions are also welcome.

like image 693
snth Avatar asked Nov 26 '10 16:11

snth


2 Answers

If months is sorted, use np.searchsorted. Otherwise, sort and then use np.searchsorted:

import numpy as np
quarters = np.array([200712, 200803, 200806, 200809, 200812, 200903])
months = np.arange(200801, 200813)
loc = np.searchsorted(months, quarters)

np.searchsorted returns the insertion position. If there is a possibility that your data is not even in the right range, you might want to have a check afterwards:

valid = (quarters <= months.max()) & (quarters >= months.min())
loc = loc[valid]

This is a O(N log N) solution. If this is still a big deal in your programme in terms of run time, you might just do this one subroutine in C(++) using a hashing scheme, which would be O(N) (as well as avoiding some constant factors, of course).

like image 118
luispedro Avatar answered Oct 17 '22 13:10

luispedro


I think you can redesign the original MATLAB code sample you give so that it doesn't use the ISMEMBER function. This may speed up the MATLAB code and make it easier to reimplement in Python if you still want to:

quarters = [200712 200803 200806 200809 200812 200903];
gdp_q = [10.1 10.5 11.1 11.8 10.9 10.3];

monthStart = 200801;              %# Starting month value
monthEnd = 200812;                %# Ending month value
nMonths = monthEnd-monthStart+1;  %# Number of months
gdp_m = NaN(1,nMonths);           %# Initialize gdp_m

quarters = quarters-monthStart+1;  %# Shift quarter values so they can be
                                   %#   used as indices into gdp_m
index = (quarters >= 1) & (quarters <= nMonths);  %# Logical index of quarters
                                                  %#   within month range
gdp_m(quarters(index)) = gdp_q(index);  %# Move values from gdp_q to gdp_m
like image 44
gnovice Avatar answered Oct 17 '22 14:10

gnovice