Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select columns using pandas dataframe.query()

The documentation on dataframe.query() is very terse http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html . I was also unable to find examples of projections by web search.

So I tried simply providing the column names: that gave a syntax error. Likewise for typing select and then the column names. So .. how to do this?

like image 586
WestCoastProjects Avatar asked Jun 18 '17 00:06

WestCoastProjects


4 Answers

After playing around with this for a while and reading through the source code for DataFrame.query, I can't figure out a way to do it.

If it's not impossible, apparently it's at least strongly discouraged. When this question came up on github, prolific Pandas dev/maintainer jreback suggested using df.eval() for selecting columns and df.query() for filtering on rows.


UPDATE:

javadba points out that the return value of eval is not a dataframe. For example, to flesh out jreback's example a bit more...

df.eval('A')

returns a Pandas Series, but

df.eval(['A', 'B'])

does not return at DataFrame, it returns a list (of Pandas Series).

So it seems ultimately the best way to maintain flexibility to filter on rows and columns is to use iloc/loc, e.g.

df.loc[0:4, ['A', 'C']]

output

          A         C
0 -0.497163 -0.046484
1  1.331614  0.741711
2  1.046903 -2.511548
3  0.314644 -0.526187
4 -0.061883 -0.615978
like image 87
Max Power Avatar answered Nov 01 '22 13:11

Max Power


Just a simpler example solution (using get):

My goal:

I want the lat and lon columns out of the result of the query.

My table details:

df_city.columns

Index(['name', 'city_id', 'lat', 'lon', 'CountryName', 'ContinentName'], dtype='object')

# All columns
city_continent = df_city.get(df_city['ContinentName']=='Oceania')

# Only lat and lon
city_continent[['lat', 'lon']]
  lat lon
113883    -19.12753   -169.84623
113884    -19.11667   -169.90000
113885    -19.10000   -169.91667
113886    -46.33333   168.85000
113887    -46.36667   168.55000
...   ... ...
347956    -23.14083   113.77630
347957    -31.48023   131.84242
347958    -28.29967   153.30142
347959    -35.60358   138.10548
347960    -35.02852   117.83416
3712 rows × 2 columns
like image 36
Arindam Roychowdhury Avatar answered Nov 01 '22 15:11

Arindam Roychowdhury


Dataframe.query is more like the where clause in a SQL statement than the select part.

import pandas as pd
import numpy as np
np.random.seed(123)
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])

To select a column or columns you can use the following:

df['A'] or df.loc[:,'A']

or

df[['A','B']] or df.loc[:,['A','B']]

To use the .query method you do something like

df.query('A > B') which would return all the rows where the value in column A is greater than the value in column b.

                   A         B         C         D
2000-01-03  1.265936 -0.866740 -0.678886 -0.094709
2000-01-04  1.491390 -0.638902 -0.443982 -0.434351
2000-01-05  2.205930  2.186786  1.004054  0.386186
2000-01-08 -0.140069 -0.861755 -0.255619 -2.798589

Which is more readable in my opinion that boolean index selection with

df[df['A'] > df['B']]
like image 28
Scott Boston Avatar answered Nov 01 '22 15:11

Scott Boston


How about

df_new = df.query('col1==1 & col2=="x" ')[['col1', 'col3']]

Would filter rows where col1 equals 1 and col2 equals "X" and return only columns 1 and 3.

but you would need to filter for rows otherwise it doesn't work.

for filtering columns only better use .loc or .iloc

like image 38
gonkan Avatar answered Nov 01 '22 15:11

gonkan