Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interpolating missing values in Python

All - I hope you'll be able to help as it's one of those tasks where I know I've almost cracked from the various postings on here and online, but haven't quite got it to work.

Essentially, I have the following data in a database that is returned to a Pandas object via psql.read_sql(sql, cnxn)

+------------------------------------+
|              StartTime  StartLevel |
+------------------------------------+
| 0  2015-02-16 00:00:00     480.000 |
| 1  2015-02-16 00:30:00     480.000 |
| 2  2015-02-16 00:34:00     390.000 |
| 3  2015-02-16 01:00:00     390.000 |
| 4  2015-02-16 01:30:00     390.000 |
| 5  2015-02-16 02:00:00     480.000 |
| 6  2015-02-16 02:17:00     420.000 |
+------------------------------------+

StartTime     datetime64[ns]
StartLevel           float64
dtype: object

I simply want to end up with a minute-by-minute interpolation of the above data.

I've also created a datetime series at minute frequency but for the life of me I can't work out to "map" my table onto this and then interpolate or how I could resample the StartTime to minute granularity and then interpolate the missing data.

Any assistance would be greatly appreciated (and I am certain I am going to kick myself when I find out the solution!) - Many thanks

UPDATE

Following the suggestions below, the code is as follows:

import datetime
import numpy as np
import pandas as pd
import pyodbc
import pandas.io.sql as psql


cnxn = pyodbc.connect('DSN=MySQL;DATABASE=db;UID=uid;PWD=pwd')
cursor = cnxn.cursor()
sql = """
    SELECT
    StartTime,StartLevel
FROM
    aa.bb
    where cc = 'dd'
    and StartTime < '2015-02-16 02:30:00'
    order by StartTime asc"""

old_df = psql.read_sql(sql, cnxn)


num_minutes = 120
base = datetime.datetime(2015, 02, 16, 00, 00, 00)
date_list = [base + datetime.timedelta(minutes=x) for x in range(0, num_minutes)]
# set num_minutes for whatever is the correct number of minutes you require
new_data = [dict(StartTime=d, fake_val=np.NaN) for d in date_list]
new_df = pd.DataFrame(new_data)
new_df['StartLevel'] = old_df['StartLevel']
new_df.interpolate(inplace=True)

the output from new_df at the prompt is:

+-----------------------------------------------+
|              StartTime  fake_val  StartLevel  |
+-----------------------------------------------+
| 0   2015-02-16 00:00:00       NaN         480 |
| 1   2015-02-16 00:01:00       NaN         480 |
| 2   2015-02-16 00:02:00       NaN         390 |
| 3   2015-02-16 00:03:00       NaN         390 |
| 4   2015-02-16 00:04:00       NaN         390 |
| 5   2015-02-16 00:05:00       NaN         480 |
| 6   2015-02-16 00:06:00       NaN         480 |
+-----------------------------------------------+
like image 230
Patrick A Avatar asked Apr 10 '26 10:04

Patrick A


1 Answers

I'm quite certain this is not the most pythonic answer so I welcome comments to improve it but I believe you can do something like this

First create all the datetime objects you want values for

num_minutes = 120
base = datetime.datetime(2015, 02, 16, 00, 00, 00)
date_list = [base + datetime.timedelta(minutes=x) for x in range(0, num_minutes)]
# set num_minutes for whatever is the correct number of minutes you require

Then create a "fake" dataframe with those index values

new_data = [dict(StartTime=d, fake_val=np.NaN) for d in date_list]
new_df = pd.DataFrame(new_data)

EDIT: Corrected reponse

Now we want to merge the two dataframes into one (and sort by the date):

final_df = new_df.merge(df, how='outer', on='date').sort(columns='date')

final_df will now be sorted by date and contain the right values for StartLevel when you had data and NaN when you didn't have data for it. Then you can call interpolate

EDIT: Interpolate is not called inplace by default, so you either need to set that flag or save off the result

final_df = final_df.interpolate()

or

final_df.interpolate(inplace=True)

Obviously the fake_val column can be thrown out once you've merged in the good data. The purpose of creating that dataframe is to have one indexed with all the values you want (this is where I'm sure there is a more pythonic answer)

Full documentation for interpolate can be found here

like image 193
sedavidw Avatar answered Apr 11 '26 23:04

sedavidw