Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Executing an SQL query over a pandas dataset

I have a pandas data set, called 'df'.

How can I do something like below;

df.query("select * from df")

Thank you.

For those who know R, there is a library called sqldf where you can execute SQL code in R, my question is basically, is there some library like sqldf in python

like image 700
Miguel Santos Avatar asked Aug 24 '17 15:08

Miguel Santos


People also ask

Can we write SQL query on Pandas DataFrame?

Pandasql can work both on Pandas DataFrame and Series . The sqldf method is used to query the Dataframes and it requires 2 inputs: The SQL query string. globals() or locals() function.

Which function is used to execute the SQL command for Pandas?

pandasql allows you to query pandas DataFrames using SQL syntax. It works similarly to sqldf in R . pandasql seeks to provide a more familiar way of manipulating and cleaning data for people new to Python or pandas.


4 Answers

This is not what pandas.query is supposed to do. You can look at package pandasql (same like sqldf in R )

import pandas as pd
import pandasql as ps

df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan],
               [1234, 'Customer A', np.nan, '333 Street'],
               [1233, 'Customer B', '444 Street', '333 Street'],
              [1233, 'Customer B', '444 Street', '666 Street']], columns=
['ID', 'Customer', 'Billing Address', 'Shipping Address'])

q1 = """SELECT ID FROM df """

print(ps.sqldf(q1, locals()))

     ID
0  1234
1  1234
2  1233
3  1233

Update 2020-07-10

update the pandasql

ps.sqldf("select * from df")
like image 127
BENY Avatar answered Oct 20 '22 11:10

BENY


After some time of using this I realised the easiest way is to just do

from pandasql import sqldf

output = sqldf("select * from df")

Works like a charm where df is a pandas dataframe You can install pandasql: https://pypi.org/project/pandasql/

like image 24
Miguel Santos Avatar answered Oct 20 '22 09:10

Miguel Santos


Much better solution is to use duckdb. It is much faster than sqldf because it does not have to load the entire data into sqlite and load back to pandas.

pip install duckdb
import pandas as pd
import duckdb
test_df = pd.DataFrame.from_dict({"i":[1, 2, 3, 4], "j":["one", "two", "three", "four"]})

duckdb.query("SELECT * FROM test_df where i>2").df() # returns a result dataframe

Performance improvement over pandasql: test data NYC yellow cabs ~120mb of csv data

nyc = pd.read_csv('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv',low_memory=False)
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
pysqldf("SELECT * FROM nyc where trip_distance>10")
# wall time 16.1s
duckdb.query("SELECT * FROM nyc where trip_distance>10").df()
# wall time 183ms

A improvement of speed of roughly 100x

This article gives good details and claims 1000x improvement over pandasql: https://duckdb.org/2021/05/14/sql-on-pandas.html

like image 11
Leo Liu Avatar answered Oct 20 '22 11:10

Leo Liu


You can use DataFrame.query(condition) to return a subset of the data frame matching condition like this:

df = pd.DataFrame(np.arange(9).reshape(3,3), columns=list('ABC'))
df
   A  B  C
0  0  1  2
1  3  4  5
2  6  7  8

df.query('C < 6')
   A  B  C
0  0  1  2
1  3  4  5


df.query('2*B <= C')
   A  B  C
0  0  1  2


df.query('A % 2 == 0')
   A  B  C
0  0  1  2
2  6  7  8

This is basically the same effect as an SQL statement, except the SELECT * FROM df WHERE is implied.

like image 5
user1717828 Avatar answered Oct 20 '22 09:10

user1717828