Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select a subset of values from a named column level in a DataFrame?

Let's say we have a DataFrame with multiple levels of column headers.

level_0         A                   B                   C          
level_1         P                   P                   P          
level_2         x         y         x         y         x         y
0       -1.027155  0.667489  0.314387 -0.428607  1.277167 -1.328771
1        0.223407 -1.713410  0.480903 -3.517518 -1.412756  0.718804

I want to select a list of columns from a named level.

required_columns = ['A', 'B']
required_level = 'level_0'

Method 1: (deprecated in favor of df.loc)

print df.select(lambda x: x[0] in required_columns, axis=1)

The problem with this is that I have to specify the level with 0. It fails if I use the name of the level.

Method 2:

print df.xs('A', level=required_level, axis=1)

The problem with this is that I can only specify a single value. It fails if I use ['A', 'B'].

Method 3:

print df.ix[:, df.columns.get_level_values(required_level).isin(required_columns)]

This works, but isn't as concise as the previous two methods! :)

Question:

How can I get method 1 or 2 to work? Or, is there a more pythonic way?

The MWE:

import pandas as pd
import numpy as np

header = pd.MultiIndex.from_product([['A', 'B', 'C'],
                                     ['P'],
                                     ['x', 'y']],
                                    names=['level_0',
                                           'level_1',
                                           'level_2'])
df = pd.DataFrame(
    np.random.randn(2, 6),
    columns=header
)

required_columns = ['A', 'B']
required_level = 'level_0'

print df
print df.select(lambda x: x[0] in required_columns, axis=1)
print df.xs('A', level=required_level, axis=1)
print df.ix[:, df.columns.get_level_values(required_level).isin(required_columns)]

Related questions:

  1. pandas dataframe select columns in multiindex
  2. Giving a column multiple indexes/headers
like image 886
bluprince13 Avatar asked Aug 06 '17 20:08

bluprince13


People also ask

How do I subset a DataFrame in R by column name?

How to subset the data frame (DataFrame) by column value and name in R? By using R base df[] notation, or subset() you can easily subset the R Data Frame (data. frame) by column value or by column name.

How do I select only certain columns in a DataFrame?

If you have a DataFrame and would like to access or select a specific few rows/columns from that DataFrame, you can use square brackets or other advanced methods such as loc and iloc .


2 Answers

You can use reindex:

df.reindex(columns=required_columns, level=required_level)

The resulting output:

level_0         A                   B          
level_1         P                   P          
level_2         x         y         x         y
0       -1.265558  0.681565 -0.553084 -1.340652
1        1.705043 -0.512333 -0.785326  0.968391 
like image 68
root Avatar answered Sep 20 '22 23:09

root


Have you considered using IndexSlice? It generally requires the columns to first be sorted (in the original dataframe, they were already sorted).

df.sort_index(axis=1, inplace=True)
>>> df.loc[:, pd.IndexSlice[required_columns, :, :]]
# Output:
# level_0         A                   B          
# level_1         P                   P          
# level_2         x         y         x         y
# 0        0.079368 -1.083421  0.129979 -0.558004
# 1       -0.157843 -1.176632 -0.219833  0.935364

Update

The method you choose really depends why you are selecting your data in the first place and whether or not you need to modify your original data via your selection.

First, to make the example a little more challenging, let's use a MultiIndex dataframe that has the same values across different levels and that is unsorted.

required_columns = ['A', 'B']  # Per original question.
required_level = 'level_0'  # Per original question.

np.random.seed(0)
idx = pd.MultiIndex.from_product([list('BAC'), list('AB')], names=['level_0', 'level_1'])
df = pd.DataFrame(np.random.randn(2, len(idx)), columns=idx)
>>> df
# Output:
# level_0         B                   A                   C          
# level_1         A         B         A         B         A         B
# 0        1.764052  0.400157  0.978738  2.240893  1.867558 -0.977278
# 1        0.950088 -0.151357 -0.103219  0.410599  0.144044  1.454274

Return a copy of the data

If you only need to view the data, either directly or for subsequent calculations in a pipeline, then the reindex method mentioned by @root and discussed here in the documentation is a good option.

df2 = df.reindex(columns=required_columns, level=required_level)
>>> df2
# Output:
# level_0         A                   B          
# level_1         A         B         A         B
# 0        0.978738  2.240893  1.764052  0.400157
# 1       -0.103219  0.410599  0.950088 -0.151357

However, if you try to modify this dataframe, the changes won't be reflected in your original.

df2.iloc[0, 0] = np.nan
>>> df  # Check values in original dataframe.  None are `NaN`.
# Output:
# level_0         B                   A                   C          
# level_1         A         B         A         B         A         B
# 0        1.764052  0.400157  0.978738  2.240893  1.867558 -0.977278
# 1        0.950088 -0.151357 -0.103219  0.410599  0.144044  1.454274

Modify the data

An alternative method is to use boolean indexing with loc. You can use a conditional list comprehension to select the desired columns together with get_level_values:

cols = [col in required_columns for col in df.columns.get_level_values(required_level)]
>>> df.loc[:, cols]
# Output:
# level_0         B                   A          
# level_1         A         B         A         B
# 0        1.764052  0.400157  0.978738  2.240893
# 1        0.950088 -0.151357 -0.103219  0.410599

If you are slicing the index instead of the columns, then one would obviously need to change df.columns.get_level_values to df.index.get_level_values in the code snippet above.

You can also modify the original data using loc:

df2 = df.copy()
df2.loc[:, cols] = 1
>>> df2
# Output:
# level_0  B     A            C          
# level_1  A  B  A  B         A         B
# 0        1  1  1  1  1.867558 -0.977278
# 1        1  1  1  1  0.144044  1.454274

Conclusion

Although select is a good option for returning a view of your multi-indexed data, boolean indexing using loc allows you to view or modify your data.

Instead of Method 1 or Method 2, I would use the loc approach described above.

As of pandas 0.20.0, the ix method has been deprecated. I would not recommend Method 3.

like image 25
Alexander Avatar answered Sep 23 '22 23:09

Alexander