The documentation on dataframe.query()
is very terse http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html . I was also unable to find examples of projections by web search.
So I tried simply providing the column names: that gave a syntax error. Likewise for typing select
and then the column names. So .. how to do this?
After playing around with this for a while and reading through the source code for DataFrame.query
, I can't figure out a way to do it.
If it's not impossible, apparently it's at least strongly discouraged. When this question came up on github, prolific Pandas dev/maintainer jreback suggested using df.eval()
for selecting columns and df.query()
for filtering on rows.
UPDATE:
javadba points out that the return value of eval
is not a dataframe. For example, to flesh out jreback's example a bit more...
df.eval('A')
returns a Pandas Series, but
df.eval(['A', 'B'])
does not return at DataFrame, it returns a list (of Pandas Series).
So it seems ultimately the best way to maintain flexibility to filter on rows and columns is to use iloc
/loc
, e.g.
df.loc[0:4, ['A', 'C']]
output
A C
0 -0.497163 -0.046484
1 1.331614 0.741711
2 1.046903 -2.511548
3 0.314644 -0.526187
4 -0.061883 -0.615978
get
):My goal:
I want the lat and lon columns out of the result of the query.
My table details:
df_city.columns
Index(['name', 'city_id', 'lat', 'lon', 'CountryName', 'ContinentName'], dtype='object')
# All columns
city_continent = df_city.get(df_city['ContinentName']=='Oceania')
# Only lat and lon
city_continent[['lat', 'lon']]
lat lon 113883 -19.12753 -169.84623 113884 -19.11667 -169.90000 113885 -19.10000 -169.91667 113886 -46.33333 168.85000 113887 -46.36667 168.55000 ... ... ... 347956 -23.14083 113.77630 347957 -31.48023 131.84242 347958 -28.29967 153.30142 347959 -35.60358 138.10548 347960 -35.02852 117.83416 3712 rows × 2 columns
Dataframe.query
is more like the where clause in a SQL statement than the select part.
import pandas as pd
import numpy as np
np.random.seed(123)
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
To select a column or columns you can use the following:
df['A'] or df.loc[:,'A']
or
df[['A','B']] or df.loc[:,['A','B']]
To use the .query
method you do something like
df.query('A > B')
which would return all the rows where the value in column A is greater than the value in column b.
A B C D
2000-01-03 1.265936 -0.866740 -0.678886 -0.094709
2000-01-04 1.491390 -0.638902 -0.443982 -0.434351
2000-01-05 2.205930 2.186786 1.004054 0.386186
2000-01-08 -0.140069 -0.861755 -0.255619 -2.798589
Which is more readable in my opinion that boolean index selection with
df[df['A'] > df['B']]
How about
df_new = df.query('col1==1 & col2=="x" ')[['col1', 'col3']]
Would filter rows where col1 equals 1 and col2 equals "X" and return only columns 1 and 3.
but you would need to filter for rows otherwise it doesn't work.
for filtering columns only better use .loc
or .iloc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With