I'm hearing different views on when one should use Pandas vs when to use SQL.
I tried to do the following in Pandas on 19,150,869 rows of data:
for idx, row in df.iterrows():
tmp = int((int(row['M']) / PeriodGranularity))+1
row['TimeSlot'] = str(row["D"]+1) + "-" + str(row["H"]) + "-" + str(tmp)
And found it was taking so long I had to abort after 20 minutes.
I performed the following in SQLLite:
Select strftime('%w',PlayedTimestamp)+1 as D,strftime('%H',PlayedTimestamp) as H,strftime('%M',PlayedTimestamp) as M,cast(strftime('%M',PlayedTimestamp) / 15+1 as int) as TimeSlot from tblMain
and found it took 4 seconds ("19150869 rows returned in 2445ms").
Note: For the Pandas code I ran this in the step before it to get the data from the db:
sqlStr = "Select strftime('%w',PlayedTimestamp)+1 as D,strftime('%H',PlayedTimestamp) as H,strftime('%M',PlayedTimestamp) as M from tblMain"
df = pd.read_sql_query(sqlStr, con)
Is it my coding that's at fault here or is it generally accepted that for certain tasks SQL is a lot faster?
This main difference can mean that the two tools are separate, however, you can also perform several of the same functions in each respective tool, for example, you can create new features from existing columns in pandas, perhaps easier and faster than in SQL.
Using the python and SQL code seen below, I used the smaller dataset to first test the transformations. Python and SQL completed the task in 591 and 40.9 seconds respectively. This means that SQL was able to provide a speed-up of roughly 14.5X!
In Pandas, one can simply divide features much easier when compared to SQL. The aforementioned code clearly states that how to divide the two separate columns and assigning those values to the latest column. In this case, one can do the feature creation task on the entire dataset.
Reading SQL queries into Pandas dataframes is a common task, and one that can be very slow. Depending on the database being used, this may be hard to get around, but for those of us using Postgres we can speed this up considerably using the COPY command.
It seems you can use vectorize solution (PeriodGranularity
is some variable):
df['TimeSlot'] = (df["D"]+1).astype(str) + "-" +
df["H"].astype(str) + "-" +
((df['M'].astype(int) / PeriodGranularity).astype(int)+1).astype(str)
And for parse datetime
to str
use strftime
.
DataFrame.iterrows
is really slow - check this.
First some comaprison of code for users coming from SQL background.
Comapring 2 technologies is really hard and I am not sure if some nice answer in SO (too broad reasons), but I find this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With