Using Series.map
with a Series argument, I can take the elements of a Series and use them as indices into another Series. I want to do the same thing with some columns of a DataFrame, using each row as a set of index levels into a MultiIndex-ed Series. Here is an example:
>>> d = pandas.DataFrame([["A", 1], ["B", 2], ["C", 3]], columns=["X", "Y"])
>>> d
X Y
0 A 1
1 B 2
2 C 3
[3 rows x 2 columns]
>>> s = pandas.Series(np.arange(9), index=pandas.MultiIndex.from_product([["A", "B", "C"], [1, 2, 3]]))
>>> s
A 1 0
2 1
3 2
B 1 3
2 4
3 5
C 1 6
2 7
3 8
dtype: int32
What I would like is to be able to do d.map(s)
, so that each row of d
should be taken as a tuple to use to index into the MultiIndex of s
. That is, I want the same result as this:
>>> s.ix[[("A", 1), ("B", 2), ("C", 3)]]
A 1 0
B 2 4
C 3 8
dtype: int32
However, DataFrame, unlike Series, has no map
method. The other obvious alternative, s.ix[d]
, gives me the error "Cannot index with multidimensional key", so this is apparently not supported either.
I know I can do it by converting the DataFrame to a list of lists, or by using a row-wise apply
to grab each item one by one, but isn't there any way to do it without that amount of overhead? How can I do the equivalent of Series.map
on multiple columns at once?
You could create a MultiIndex from the DataFrame and the ix/loc using that:
In [11]: mi = pd.MultiIndex.from_arrays(d.values.T)
In [12]: s.loc[mi] # can use ix too
Out[12]:
A 1 0
B 2 4
C 3 8
dtype: int64
This is pretty efficient:
In [21]: s = pandas.Series(np.arange(1000*1000), index=pandas.MultiIndex.from_product([range(1000), range(1000)]))
In [22]: d = pandas.DataFrame(zip(range(1000), range(1000)), columns=["X", "Y"])
In [23]: %timeit mi = pd.MultiIndex.from_arrays(d.values.T); s.loc[mi]
100 loops, best of 3: 2.77 ms per loop
In [24]: %timeit s.apply(lambda x: x + 1) # at least compared to apply
1 loops, best of 3: 3.14 s per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With