Let's say we have a DataFrame
with multiple levels of column headers.
level_0 A B C
level_1 P P P
level_2 x y x y x y
0 -1.027155 0.667489 0.314387 -0.428607 1.277167 -1.328771
1 0.223407 -1.713410 0.480903 -3.517518 -1.412756 0.718804
I want to select a list of columns from a named level.
required_columns = ['A', 'B']
required_level = 'level_0'
Method 1: (deprecated in favor of df.loc)
print df.select(lambda x: x[0] in required_columns, axis=1)
The problem with this is that I have to specify the level with 0. It fails if I use the name of the level.
Method 2:
print df.xs('A', level=required_level, axis=1)
The problem with this is that I can only specify a single value. It fails if I use ['A', 'B'].
Method 3:
print df.ix[:, df.columns.get_level_values(required_level).isin(required_columns)]
This works, but isn't as concise as the previous two methods! :)
Question:
How can I get method 1 or 2 to work? Or, is there a more pythonic way?
The MWE:
import pandas as pd
import numpy as np
header = pd.MultiIndex.from_product([['A', 'B', 'C'],
['P'],
['x', 'y']],
names=['level_0',
'level_1',
'level_2'])
df = pd.DataFrame(
np.random.randn(2, 6),
columns=header
)
required_columns = ['A', 'B']
required_level = 'level_0'
print df
print df.select(lambda x: x[0] in required_columns, axis=1)
print df.xs('A', level=required_level, axis=1)
print df.ix[:, df.columns.get_level_values(required_level).isin(required_columns)]
Related questions:
How to subset the data frame (DataFrame) by column value and name in R? By using R base df[] notation, or subset() you can easily subset the R Data Frame (data. frame) by column value or by column name.
If you have a DataFrame and would like to access or select a specific few rows/columns from that DataFrame, you can use square brackets or other advanced methods such as loc and iloc .
You can use reindex
:
df.reindex(columns=required_columns, level=required_level)
The resulting output:
level_0 A B
level_1 P P
level_2 x y x y
0 -1.265558 0.681565 -0.553084 -1.340652
1 1.705043 -0.512333 -0.785326 0.968391
Have you considered using IndexSlice
? It generally requires the columns to first be sorted (in the original dataframe, they were already sorted).
df.sort_index(axis=1, inplace=True)
>>> df.loc[:, pd.IndexSlice[required_columns, :, :]]
# Output:
# level_0 A B
# level_1 P P
# level_2 x y x y
# 0 0.079368 -1.083421 0.129979 -0.558004
# 1 -0.157843 -1.176632 -0.219833 0.935364
Update
The method you choose really depends why you are selecting your data in the first place and whether or not you need to modify your original data via your selection.
First, to make the example a little more challenging, let's use a MultiIndex dataframe that has the same values across different levels and that is unsorted.
required_columns = ['A', 'B'] # Per original question.
required_level = 'level_0' # Per original question.
np.random.seed(0)
idx = pd.MultiIndex.from_product([list('BAC'), list('AB')], names=['level_0', 'level_1'])
df = pd.DataFrame(np.random.randn(2, len(idx)), columns=idx)
>>> df
# Output:
# level_0 B A C
# level_1 A B A B A B
# 0 1.764052 0.400157 0.978738 2.240893 1.867558 -0.977278
# 1 0.950088 -0.151357 -0.103219 0.410599 0.144044 1.454274
Return a copy of the data
If you only need to view the data, either directly or for subsequent calculations in a pipeline, then the reindex
method mentioned by @root and discussed here in the documentation is a good option.
df2 = df.reindex(columns=required_columns, level=required_level)
>>> df2
# Output:
# level_0 A B
# level_1 A B A B
# 0 0.978738 2.240893 1.764052 0.400157
# 1 -0.103219 0.410599 0.950088 -0.151357
However, if you try to modify this dataframe, the changes won't be reflected in your original.
df2.iloc[0, 0] = np.nan
>>> df # Check values in original dataframe. None are `NaN`.
# Output:
# level_0 B A C
# level_1 A B A B A B
# 0 1.764052 0.400157 0.978738 2.240893 1.867558 -0.977278
# 1 0.950088 -0.151357 -0.103219 0.410599 0.144044 1.454274
Modify the data
An alternative method is to use boolean indexing with loc
. You can use a conditional list comprehension to select the desired columns together with get_level_values
:
cols = [col in required_columns for col in df.columns.get_level_values(required_level)]
>>> df.loc[:, cols]
# Output:
# level_0 B A
# level_1 A B A B
# 0 1.764052 0.400157 0.978738 2.240893
# 1 0.950088 -0.151357 -0.103219 0.410599
If you are slicing the index instead of the columns, then one would obviously need to change df.columns.get_level_values
to df.index.get_level_values
in the code snippet above.
You can also modify the original data using loc
:
df2 = df.copy()
df2.loc[:, cols] = 1
>>> df2
# Output:
# level_0 B A C
# level_1 A B A B A B
# 0 1 1 1 1 1.867558 -0.977278
# 1 1 1 1 1 0.144044 1.454274
Conclusion
Although select
is a good option for returning a view of your multi-indexed data, boolean indexing using loc
allows you to view or modify your data.
Instead of Method 1 or Method 2, I would use the loc
approach described above.
As of pandas 0.20.0, the ix
method has been deprecated. I would not recommend Method 3.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With