Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

selecting across multiple columns with python pandas?

People also ask

How do you select certain columns in Python?

To select a single column, use square brackets [] with the column name of the column of interest.

How do I select all columns in a data frame?

To select all columns except one column in Pandas DataFrame, we can use df. loc[:, df. columns != <column name>].

How do I select a specific column in pandas?

This is the most basic way to select a single column from a dataframe, just put the string name of the column in brackets. Returns a pandas series. Passing a list in the brackets lets you select multiple columns at the same time.


I encourage you to pose these questions on the mailing list, but in any case, it's still a very much low level affair working with the underlying NumPy arrays. For example, to select rows where the value in any column exceed, say, 1.5 in this example:

In [11]: df
Out[11]: 
            A        B        C        D      
2000-01-03 -0.59885 -0.18141 -0.68828 -0.77572
2000-01-04  0.83935  0.15993  0.95911 -1.12959
2000-01-05  2.80215 -0.10858 -1.62114 -0.20170
2000-01-06  0.71670 -0.26707  1.36029  1.74254
2000-01-07 -0.45749  0.22750  0.46291 -0.58431
2000-01-10 -0.78702  0.44006 -0.36881 -0.13884
2000-01-11  0.79577 -0.09198  0.14119  0.02668
2000-01-12 -0.32297  0.62332  1.93595  0.78024
2000-01-13  1.74683 -1.57738 -0.02134  0.11596
2000-01-14 -0.55613  0.92145 -0.22832  1.56631
2000-01-17 -0.55233 -0.28859 -1.18190 -0.80723
2000-01-18  0.73274  0.24387  0.88146 -0.94490
2000-01-19  0.56644 -0.49321  1.17584 -0.17585
2000-01-20  1.56441  0.62331 -0.26904  0.11952
2000-01-21  0.61834  0.17463 -1.62439  0.99103
2000-01-24  0.86378 -0.68111 -0.15788 -0.16670
2000-01-25 -1.12230 -0.16128  1.20401  1.08945
2000-01-26 -0.63115  0.76077 -0.92795 -2.17118
2000-01-27  1.37620 -1.10618 -0.37411  0.73780
2000-01-28 -1.40276  1.98372  1.47096 -1.38043
2000-01-31  0.54769  0.44100 -0.52775  0.84497
2000-02-01  0.12443  0.32880 -0.71361  1.31778
2000-02-02 -0.28986 -0.63931  0.88333 -2.58943
2000-02-03  0.54408  1.17928 -0.26795 -0.51681
2000-02-04 -0.07068 -1.29168 -0.59877 -1.45639
2000-02-07 -0.65483 -0.29584 -0.02722  0.31270
2000-02-08 -0.18529 -0.18701 -0.59132 -1.15239
2000-02-09 -2.28496  0.36352  1.11596  0.02293
2000-02-10  0.51054  0.97249  1.74501  0.20525
2000-02-11  0.10100  0.27722  0.65843  1.73591

In [12]: df[(df.values > 1.5).any(1)]
Out[12]: 
            A       B       C        D     
2000-01-05  2.8021 -0.1086 -1.62114 -0.2017
2000-01-06  0.7167 -0.2671  1.36029  1.7425
2000-01-12 -0.3230  0.6233  1.93595  0.7802
2000-01-13  1.7468 -1.5774 -0.02134  0.1160
2000-01-14 -0.5561  0.9215 -0.22832  1.5663
2000-01-20  1.5644  0.6233 -0.26904  0.1195
2000-01-28 -1.4028  1.9837  1.47096 -1.3804
2000-02-10  0.5105  0.9725  1.74501  0.2052
2000-02-11  0.1010  0.2772  0.65843  1.7359

Multiple conditions have to be combined using & or | (and parentheses!):

In [13]: df[(df['A'] > 1) | (df['B'] < -1)]
Out[13]: 
            A        B       C        D     
2000-01-05  2.80215 -0.1086 -1.62114 -0.2017
2000-01-13  1.74683 -1.5774 -0.02134  0.1160
2000-01-20  1.56441  0.6233 -0.26904  0.1195
2000-01-27  1.37620 -1.1062 -0.37411  0.7378
2000-02-04 -0.07068 -1.2917 -0.59877 -1.4564

I'd be very interested to have some kind of query API to make these kinds of things easier


There are at least a few approaches to shortening the syntax for this in Pandas, until it gets a full query API down the road (perhaps I'll try to join the github project and do this is time permits and if no one else already has started).

One method to shorten the syntax a little is given below:

inds = df.apply(lambda x: x["A"]>10 and x["B"]<5, axis=1) 
print df[inds].to_string()

To fully solve this, one would need to build something like the SQL select and where clauses into Pandas. This is not trivial at all, but one stab that I think might work for this is to use the Python operator built-in module. This allows you to treat things like greater-than as functions instead of symbols. So you could do the following:

def pandas_select(dataframe, select_dict):

    inds = dataframe.apply(lambda x: reduce(lambda v1,v2: v1 and v2, 
                           [elem[0](x[key], elem[1]) 
                           for key,elem in select_dict.iteritems()]), axis=1)
    return dataframe[inds]

Then a test example like yours would be to do the following:

import operator
select_dict = {
               "A":(operator.gt,10),
               "B":(operator.lt,5)                  
              }

print pandas_select(df, select_dict).to_string()

You can shorten the syntax even further by either building in more arguments to pandas_select to handle the different common logical operators automatically, or by importing them into the namespace with shorter names.

Note that the pandas_select function above only works with logical-and chains of constraints. You'd have to modify it to get different logical behavior. Or use not and DeMorgan's Laws.


A query feature has been added to Pandas since this question was asked and answered. An example is given below.

Given this sample data frame:

periods = 8
dates = pd.date_range('20170101', periods=periods)
rand_df = pd.DataFrame(np.random.randn(periods,4), index=dates, 
      columns=list('ABCD'))

The query syntax as follows will allow you to use multiple filters, like a "WHERE" clause in a select statement.

rand_df.query("A < 0 or B < 0")

See the Pandas documentation for additional details.