I have a pandas data set, called 'df'.
How can I do something like below;
df.query("select * from df")
Thank you.
For those who know R, there is a library called sqldf where you can execute SQL code in R, my question is basically, is there some library like sqldf in python
Pandasql can work both on Pandas DataFrame and Series . The sqldf method is used to query the Dataframes and it requires 2 inputs: The SQL query string. globals() or locals() function.
pandasql allows you to query pandas DataFrames using SQL syntax. It works similarly to sqldf in R . pandasql seeks to provide a more familiar way of manipulating and cleaning data for people new to Python or pandas.
This is not what pandas.query
is supposed to do. You can look at package pandasql
(same like sqldf
in R )
import pandas as pd
import pandasql as ps
df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan],
[1234, 'Customer A', np.nan, '333 Street'],
[1233, 'Customer B', '444 Street', '333 Street'],
[1233, 'Customer B', '444 Street', '666 Street']], columns=
['ID', 'Customer', 'Billing Address', 'Shipping Address'])
q1 = """SELECT ID FROM df """
print(ps.sqldf(q1, locals()))
ID
0 1234
1 1234
2 1233
3 1233
Update 2020-07-10
update the
pandasql
ps.sqldf("select * from df")
After some time of using this I realised the easiest way is to just do
from pandasql import sqldf
output = sqldf("select * from df")
Works like a charm where df
is a pandas dataframe
You can install pandasql: https://pypi.org/project/pandasql/
Much better solution is to use duckdb. It is much faster than sqldf because it does not have to load the entire data into sqlite and load back to pandas.
pip install duckdb
import pandas as pd
import duckdb
test_df = pd.DataFrame.from_dict({"i":[1, 2, 3, 4], "j":["one", "two", "three", "four"]})
duckdb.query("SELECT * FROM test_df where i>2").df() # returns a result dataframe
Performance improvement over pandasql: test data NYC yellow cabs ~120mb of csv data
nyc = pd.read_csv('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv',low_memory=False)
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
pysqldf("SELECT * FROM nyc where trip_distance>10")
# wall time 16.1s
duckdb.query("SELECT * FROM nyc where trip_distance>10").df()
# wall time 183ms
A improvement of speed of roughly 100x
This article gives good details and claims 1000x improvement over pandasql: https://duckdb.org/2021/05/14/sql-on-pandas.html
You can use DataFrame.query(condition)
to return a subset of the data frame matching condition
like this:
df = pd.DataFrame(np.arange(9).reshape(3,3), columns=list('ABC'))
df
A B C
0 0 1 2
1 3 4 5
2 6 7 8
df.query('C < 6')
A B C
0 0 1 2
1 3 4 5
df.query('2*B <= C')
A B C
0 0 1 2
df.query('A % 2 == 0')
A B C
0 0 1 2
2 6 7 8
This is basically the same effect as an SQL statement, except the SELECT * FROM df WHERE
is implied.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With