Let's say we have a <code>DataFrame</code> with multiple levels of column headers. <pre class="prettyprint"><code>level_0 A B C level_1 P P P level_2 x y x y x y 0 -1.027155 0.667489 0.314387 -0.428607 1.277167 -1.328771 1 0.223407 -1.713410 0.480903 -3.517518 -1.412756 0.718804 </code></pre> I want to select a list of columns from a named level. <pre class="prettyprint"><code>required_columns = ['A', 'B'] required_level = 'level_0' </code></pre> Method 1: (deprecated in favor of df.loc) <pre class="prettyprint"><code>print df.select(lambda x: x[0] in required_columns, axis=1) </code></pre> The problem with this is that I have to specify the level with 0. It fails if I use the name of the level. Method 2: <pre class="prettyprint"><code>print df.xs('A', level=required_level, axis=1) </code></pre> The problem with this is that I can only specify a single value. It fails if I use ['A', 'B']. Method 3: <pre class="prettyprint"><code>print df.ix[:, df.columns.get_level_values(required_level).isin(required_columns)] </code></pre> This works, but isn't as concise as the previous two methods! :) Question: How can I get method 1 or 2 to work? Or, is there a more pythonic way? The MWE: <pre class="prettyprint"><code>import pandas as pd import numpy as np header = pd.MultiIndex.from_product([['A', 'B', 'C'], ['P'], ['x', 'y']], names=['level_0', 'level_1', 'level_2']) df = pd.DataFrame( np.random.randn(2, 6), columns=header ) required_columns = ['A', 'B'] required_level = 'level_0' print df print df.select(lambda x: x[0] in required_columns, axis=1) print df.xs('A', level=required_level, axis=1) print df.ix[:, df.columns.get_level_values(required_level).isin(required_columns)] </code></pre> Related questions: <ol> <li>pandas dataframe select columns in multiindex</li> <li>Giving a column multiple indexes/headers</li> </ol>

You can use <code>reindex</code>: <pre class="prettyprint"><code>df.reindex(columns=required_columns, level=required_level) </code></pre> The resulting output: <pre class="prettyprint"><code>level_0 A B level_1 P P level_2 x y x y 0 -1.265558 0.681565 -0.553084 -1.340652 1 1.705043 -0.512333 -0.785326 0.968391 </code></pre>

Have you considered using <code>IndexSlice</code>? It generally requires the columns to first be sorted (in the original dataframe, they were already sorted). <pre class="prettyprint"><code>df.sort_index(axis=1, inplace=True) >>> df.loc[:, pd.IndexSlice[required_columns, :, :]] # Output: # level_0 A B # level_1 P P # level_2 x y x y # 0 0.079368 -1.083421 0.129979 -0.558004 # 1 -0.157843 -1.176632 -0.219833 0.935364 </code></pre> Update The method you choose really depends why you are selecting your data in the first place and whether or not you need to modify your original data via your selection. First, to make the example a little more challenging, let's use a MultiIndex dataframe that has the same values across different levels and that is unsorted. <pre class="prettyprint"><code>required_columns = ['A', 'B'] # Per original question. required_level = 'level_0' # Per original question. np.random.seed(0) idx = pd.MultiIndex.from_product([list('BAC'), list('AB')], names=['level_0', 'level_1']) df = pd.DataFrame(np.random.randn(2, len(idx)), columns=idx) >>> df # Output: # level_0 B A C # level_1 A B A B A B # 0 1.764052 0.400157 0.978738 2.240893 1.867558 -0.977278 # 1 0.950088 -0.151357 -0.103219 0.410599 0.144044 1.454274 </code></pre> Return a copy of the data If you only need to view the data, either directly or for subsequent calculations in a pipeline, then the <code>reindex</code> method mentioned by @root and discussed here in the documentation is a good option. <pre class="prettyprint"><code>df2 = df.reindex(columns=required_columns, level=required_level) >>> df2 # Output: # level_0 A B # level_1 A B A B # 0 0.978738 2.240893 1.764052 0.400157 # 1 -0.103219 0.410599 0.950088 -0.151357 </code></pre> However, if you try to modify this dataframe, the changes won't be reflected in your original. <pre class="prettyprint"><code>df2.iloc[0, 0] = np.nan >>> df # Check values in original dataframe. None are `NaN`. # Output: # level_0 B A C # level_1 A B A B A B # 0 1.764052 0.400157 0.978738 2.240893 1.867558 -0.977278 # 1 0.950088 -0.151357 -0.103219 0.410599 0.144044 1.454274 </code></pre> Modify the data An alternative method is to use boolean indexing with <code>loc</code>. You can use a conditional list comprehension to select the desired columns together with <code>get_level_values</code>: <pre class="prettyprint"><code>cols = [col in required_columns for col in df.columns.get_level_values(required_level)] >>> df.loc[:, cols] # Output: # level_0 B A # level_1 A B A B # 0 1.764052 0.400157 0.978738 2.240893 # 1 0.950088 -0.151357 -0.103219 0.410599 </code></pre> If you are slicing the index instead of the columns, then one would obviously need to change <code>df.columns.get_level_values</code> to <code>df.index.get_level_values</code> in the code snippet above. You can also modify the original data using <code>loc</code>: <pre class="prettyprint"><code>df2 = df.copy() df2.loc[:, cols] = 1 >>> df2 # Output: # level_0 B A C # level_1 A B A B A B # 0 1 1 1 1 1.867558 -0.977278 # 1 1 1 1 1 0.144044 1.454274 </code></pre> Conclusion Although <code>select</code> is a good option for returning a view of your multi-indexed data, boolean indexing using <code>loc</code> allows you to view or modify your data. Instead of Method 1 or Method 2, I would use the <code>loc</code> approach described above. As of pandas 0.20.0, the <code>ix</code> method has been deprecated. I would not recommend Method 3.

How to select a subset of values from a named column level in a DataFrame?

Let's say we have a DataFrame with multiple levels of column headers.

level_0         A                   B                   C          
level_1         P                   P                   P          
level_2         x         y         x         y         x         y
0       -1.027155  0.667489  0.314387 -0.428607  1.277167 -1.328771
1        0.223407 -1.713410  0.480903 -3.517518 -1.412756  0.718804

I want to select a list of columns from a named level.

required_columns = ['A', 'B']
required_level = 'level_0'

Method 1: (deprecated in favor of df.loc)

print df.select(lambda x: x[0] in required_columns, axis=1)

The problem with this is that I have to specify the level with 0. It fails if I use the name of the level.

Method 2:

print df.xs('A', level=required_level, axis=1)

The problem with this is that I can only specify a single value. It fails if I use ['A', 'B'].

Method 3:

print df.ix[:, df.columns.get_level_values(required_level).isin(required_columns)]

This works, but isn't as concise as the previous two methods! :)

Question:

How can I get method 1 or 2 to work? Or, is there a more pythonic way?

The MWE:

import pandas as pd
import numpy as np

header = pd.MultiIndex.from_product([['A', 'B', 'C'],
                                     ['P'],
                                     ['x', 'y']],
                                    names=['level_0',
                                           'level_1',
                                           'level_2'])
df = pd.DataFrame(
    np.random.randn(2, 6),
    columns=header
)

required_columns = ['A', 'B']
required_level = 'level_0'

print df
print df.select(lambda x: x[0] in required_columns, axis=1)
print df.xs('A', level=required_level, axis=1)
print df.ix[:, df.columns.get_level_values(required_level).isin(required_columns)]

Related questions:

pandas dataframe select columns in multiindex
Giving a column multiple indexes/headers

How do I subset a DataFrame in R by column name?

How to subset the data frame (DataFrame) by column value and name in R? By using R base df[] notation, or subset() you can easily subset the R Data Frame (data. frame) by column value or by column name.

How do I select only certain columns in a DataFrame?

If you have a DataFrame and would like to access or select a specific few rows/columns from that DataFrame, you can use square brackets or other advanced methods such as loc and iloc .

You can use reindex:

df.reindex(columns=required_columns, level=required_level)

The resulting output:

level_0         A                   B          
level_1         P                   P          
level_2         x         y         x         y
0       -1.265558  0.681565 -0.553084 -1.340652
1        1.705043 -0.512333 -0.785326  0.968391

Have you considered using IndexSlice? It generally requires the columns to first be sorted (in the original dataframe, they were already sorted).

df.sort_index(axis=1, inplace=True)
>>> df.loc[:, pd.IndexSlice[required_columns, :, :]]
# Output:
# level_0         A                   B          
# level_1         P                   P          
# level_2         x         y         x         y
# 0        0.079368 -1.083421  0.129979 -0.558004
# 1       -0.157843 -1.176632 -0.219833  0.935364

Update

The method you choose really depends why you are selecting your data in the first place and whether or not you need to modify your original data via your selection.

First, to make the example a little more challenging, let's use a MultiIndex dataframe that has the same values across different levels and that is unsorted.

required_columns = ['A', 'B']  # Per original question.
required_level = 'level_0'  # Per original question.

np.random.seed(0)
idx = pd.MultiIndex.from_product([list('BAC'), list('AB')], names=['level_0', 'level_1'])
df = pd.DataFrame(np.random.randn(2, len(idx)), columns=idx)
>>> df
# Output:
# level_0         B                   A                   C          
# level_1         A         B         A         B         A         B
# 0        1.764052  0.400157  0.978738  2.240893  1.867558 -0.977278
# 1        0.950088 -0.151357 -0.103219  0.410599  0.144044  1.454274

Return a copy of the data

If you only need to view the data, either directly or for subsequent calculations in a pipeline, then the reindex method mentioned by @root and discussed here in the documentation is a good option.

df2 = df.reindex(columns=required_columns, level=required_level)
>>> df2
# Output:
# level_0         A                   B          
# level_1         A         B         A         B
# 0        0.978738  2.240893  1.764052  0.400157
# 1       -0.103219  0.410599  0.950088 -0.151357

However, if you try to modify this dataframe, the changes won't be reflected in your original.

df2.iloc[0, 0] = np.nan
>>> df  # Check values in original dataframe.  None are `NaN`.
# Output:
# level_0         B                   A                   C          
# level_1         A         B         A         B         A         B
# 0        1.764052  0.400157  0.978738  2.240893  1.867558 -0.977278
# 1        0.950088 -0.151357 -0.103219  0.410599  0.144044  1.454274

Modify the data

An alternative method is to use boolean indexing with loc. You can use a conditional list comprehension to select the desired columns together with get_level_values:

cols = [col in required_columns for col in df.columns.get_level_values(required_level)]
>>> df.loc[:, cols]
# Output:
# level_0         B                   A          
# level_1         A         B         A         B
# 0        1.764052  0.400157  0.978738  2.240893
# 1        0.950088 -0.151357 -0.103219  0.410599

If you are slicing the index instead of the columns, then one would obviously need to change df.columns.get_level_values to df.index.get_level_values in the code snippet above.

You can also modify the original data using loc:

df2 = df.copy()
df2.loc[:, cols] = 1
>>> df2
# Output:
# level_0  B     A            C          
# level_1  A  B  A  B         A         B
# 0        1  1  1  1  1.867558 -0.977278
# 1        1  1  1  1  0.144044  1.454274

Conclusion

Although select is a good option for returning a view of your multi-indexed data, boolean indexing using loc allows you to view or modify your data.

Instead of Method 1 or Method 2, I would use the loc approach described above.

As of pandas 0.20.0, the ix method has been deprecated. I would not recommend Method 3.

How to select a subset of values from a named column level in a DataFrame?

Tags:

python

pandas

dataframe

multi-index

bluprince13

People also ask

2 Answers

root

Alexander

Recent Activity

Donate For Us