All - I hope you'll be able to help as it's one of those tasks where I know I've almost cracked from the various postings on here and online, but haven't quite got it to work.
Essentially, I have the following data in a database that is returned to a Pandas object via psql.read_sql(sql, cnxn)
+------------------------------------+
| StartTime StartLevel |
+------------------------------------+
| 0 2015-02-16 00:00:00 480.000 |
| 1 2015-02-16 00:30:00 480.000 |
| 2 2015-02-16 00:34:00 390.000 |
| 3 2015-02-16 01:00:00 390.000 |
| 4 2015-02-16 01:30:00 390.000 |
| 5 2015-02-16 02:00:00 480.000 |
| 6 2015-02-16 02:17:00 420.000 |
+------------------------------------+
StartTime datetime64[ns]
StartLevel float64
dtype: object
I simply want to end up with a minute-by-minute interpolation of the above data.
I've also created a datetime series at minute frequency but for the life of me I can't work out to "map" my table onto this and then interpolate or how I could resample the StartTime to minute granularity and then interpolate the missing data.
Any assistance would be greatly appreciated (and I am certain I am going to kick myself when I find out the solution!) - Many thanks
UPDATE
Following the suggestions below, the code is as follows:
import datetime
import numpy as np
import pandas as pd
import pyodbc
import pandas.io.sql as psql
cnxn = pyodbc.connect('DSN=MySQL;DATABASE=db;UID=uid;PWD=pwd')
cursor = cnxn.cursor()
sql = """
SELECT
StartTime,StartLevel
FROM
aa.bb
where cc = 'dd'
and StartTime < '2015-02-16 02:30:00'
order by StartTime asc"""
old_df = psql.read_sql(sql, cnxn)
num_minutes = 120
base = datetime.datetime(2015, 02, 16, 00, 00, 00)
date_list = [base + datetime.timedelta(minutes=x) for x in range(0, num_minutes)]
# set num_minutes for whatever is the correct number of minutes you require
new_data = [dict(StartTime=d, fake_val=np.NaN) for d in date_list]
new_df = pd.DataFrame(new_data)
new_df['StartLevel'] = old_df['StartLevel']
new_df.interpolate(inplace=True)
the output from new_df at the prompt is:
+-----------------------------------------------+
| StartTime fake_val StartLevel |
+-----------------------------------------------+
| 0 2015-02-16 00:00:00 NaN 480 |
| 1 2015-02-16 00:01:00 NaN 480 |
| 2 2015-02-16 00:02:00 NaN 390 |
| 3 2015-02-16 00:03:00 NaN 390 |
| 4 2015-02-16 00:04:00 NaN 390 |
| 5 2015-02-16 00:05:00 NaN 480 |
| 6 2015-02-16 00:06:00 NaN 480 |
+-----------------------------------------------+
I'm quite certain this is not the most pythonic answer so I welcome comments to improve it but I believe you can do something like this
First create all the datetime objects you want values for
num_minutes = 120
base = datetime.datetime(2015, 02, 16, 00, 00, 00)
date_list = [base + datetime.timedelta(minutes=x) for x in range(0, num_minutes)]
# set num_minutes for whatever is the correct number of minutes you require
Then create a "fake" dataframe with those index values
new_data = [dict(StartTime=d, fake_val=np.NaN) for d in date_list]
new_df = pd.DataFrame(new_data)
EDIT: Corrected reponse
Now we want to merge the two dataframes into one (and sort by the date):
final_df = new_df.merge(df, how='outer', on='date').sort(columns='date')
final_df will now be sorted by date and contain the right values for StartLevel when you had data and NaN when you didn't have data for it. Then you can call interpolate
EDIT: Interpolate is not called inplace by default, so you either need to set that flag or save off the result
final_df = final_df.interpolate()
or
final_df.interpolate(inplace=True)
Obviously the fake_val column can be thrown out once you've merged in the good data. The purpose of creating that dataframe is to have one indexed with all the values you want (this is where I'm sure there is a more pythonic answer)
Full documentation for interpolate can be found here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With