Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Vs SQL Speed

Tags:

sql

sqlite

pandas

I'm hearing different views on when one should use Pandas vs when to use SQL.

I tried to do the following in Pandas on 19,150,869 rows of data:

for idx, row in df.iterrows():
    tmp = int((int(row['M']) / PeriodGranularity))+1
    row['TimeSlot'] = str(row["D"]+1) + "-" + str(row["H"]) + "-" + str(tmp)

And found it was taking so long I had to abort after 20 minutes.

I performed the following in SQLLite:

Select strftime('%w',PlayedTimestamp)+1 as D,strftime('%H',PlayedTimestamp) as H,strftime('%M',PlayedTimestamp) as M,cast(strftime('%M',PlayedTimestamp) / 15+1 as int) as TimeSlot from tblMain

and found it took 4 seconds ("19150869 rows returned in 2445ms").

Note: For the Pandas code I ran this in the step before it to get the data from the db:

sqlStr = "Select strftime('%w',PlayedTimestamp)+1 as D,strftime('%H',PlayedTimestamp) as H,strftime('%M',PlayedTimestamp) as M from tblMain"
df = pd.read_sql_query(sqlStr, con)

Is it my coding that's at fault here or is it generally accepted that for certain tasks SQL is a lot faster?

like image 778
user1761806 Avatar asked Jun 22 '17 09:06

user1761806


People also ask

Is Panda faster than SQL?

This main difference can mean that the two tools are separate, however, you can also perform several of the same functions in each respective tool, for example, you can create new features from existing columns in pandas, perhaps easier and faster than in SQL.

How much faster is SQL than Python?

Using the python and SQL code seen below, I used the smaller dataset to first test the transformations. Python and SQL completed the task in 591 and 40.9 seconds respectively. This means that SQL was able to provide a speed-up of roughly 14.5X!

Which is better pandas or SQL?

In Pandas, one can simply divide features much easier when compared to SQL. The aforementioned code clearly states that how to divide the two separate columns and assigning those values to the latest column. In this case, one can do the feature creation task on the entire dataset.

Is pandas read SQL slow?

Reading SQL queries into Pandas dataframes is a common task, and one that can be very slow. Depending on the database being used, this may be hard to get around, but for those of us using Postgres we can speed this up considerably using the COPY command.


1 Answers

It seems you can use vectorize solution (PeriodGranularity is some variable):

df['TimeSlot'] = (df["D"]+1).astype(str) + "-" + 
                  df["H"].astype(str) + "-" + 
                 ((df['M'].astype(int) / PeriodGranularity).astype(int)+1).astype(str)

And for parse datetime to str use strftime.

DataFrame.iterrowsis really slow - check this.

First some comaprison of code for users coming from SQL background.

Comapring 2 technologies is really hard and I am not sure if some nice answer in SO (too broad reasons), but I find this.

like image 52
jezrael Avatar answered Oct 03 '22 17:10

jezrael