Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gap Filling OHLCV (Open High Low Close Volume) in TimescaleDB

I have some OHLCV data stored in TimescaleDB which contains missing data during certain time ranges. This data needs to be resampled to a different time period (i.e. 1 day) and contain contiguous, ordered time buckets.

TimescaleDB provides the function time_bucket_gapfill to do this. My current query is:

SELECT 
    time_bucket_gapfill(
        '1 day', 
        "timestamp",
        '2017-07-25 00:00', 
        '2018-01-01 00:00'
    ) as date,
    FIRST(open, "timestamp") as open,
    MAX(high) as high,
    MIN(low) as low,
    LAST(close, "timestamp") as close,
    SUM(volume) as volume
FROM ohlcv
WHERE "timestamp" > '2017-07-25'
GROUP BY date ORDER BY date ASC LIMIT 10

which gives as the results

date                    open        high        low         close       volume
2017-07-25 00:00:00+00                  
2017-07-26 00:00:00+00                  
2017-07-27 00:00:00+00  0.00992     0.010184    0.009679    0.010039    65553.5299999999
2017-07-28 00:00:00+00  0.00999     0.010059    0.009225    0.009248    43049.93
2017-07-29 00:00:00+00  
2017-07-30 00:00:00+00  0.009518    0.0098      0.009286    0.009457    40510.0599999999

...

Question: It looks like only the date column has been gapfilled. By modifying the SQL statement, is it possible to also gap-fill the columns open, high, low, close and volume such that we obtain the results:

date                    open        high        low         close       volume
2017-07-25 00:00:00+00  0           0           0           0           0               
2017-07-26 00:00:00+00  0           0           0           0           0               
2017-07-27 00:00:00+00  0.00992     0.010184    0.009679    0.010039    65553.5299999999
2017-07-28 00:00:00+00  0.00999     0.010059    0.009225    0.009248    43049.93
2017-07-29 00:00:00+00  0.009248    0.009248    0.009248    0.009248    0   
2017-07-30 00:00:00+00  0.009518    0.0098      0.009286    0.009457    40510.0599999999

...

Or is it recommended to perform this data inputation after receiving the query result, such as in Python/Nodejs?


Example of how it can be done using Python/pandas

Prefer to perform this gapfilling/inputation using TimescaleDB instead of using my Nodejs app because... using Nodejs to do this will be much slower and I do not want to introduce Python into the app just to do this processing

import pandas as pd

# Building the test dataset simulating missing values after time_bucket
data = [
    (pd.Timestamp('2020-01-01'), None, None, None, None, None),
    (pd.Timestamp('2020-01-02'), 100, 110, 90, 95, 3),
    (pd.Timestamp('2020-01-03'), None, None, None, None, None),
    (pd.Timestamp('2020-01-04'), 98, 150, 100, 100, 4),
]
df = pd.DataFrame(data, columns=['date', 'open' , 'high', 'low', 'close', 'volume']).set_index('date')

#              open   high    low  close  volume
# date                                          
# 2020-01-01    NaN    NaN    NaN    NaN     NaN
# 2020-01-02  100.0  110.0   90.0   95.0     3.0
# 2020-01-03    NaN    NaN    NaN    NaN     NaN
# 2020-01-04   98.0  150.0  100.0  100.0     4.0


# Perform gap filling
df.close = df.close.fillna(method='ffill')
df.volume = df.volume.fillna(0)                 # fill missing volume with 0
df['open'] = df['open'].fillna(df['close'])     # fill missing open by forward-filling close
df['high'] = df['high'].fillna(df['close'])     # fill missing high by forward-filling close
df['low'] = df['low'].fillna(df['close'])       # fill missing low by forward-filling close
df = df.fillna(0)                               # fill missing OHLC with 0 if no previous values available

#               open   high    low  close  volume
# date                                          
# 2020-01-01    0.0    0.0    0.0    0.0     0.0
# 2020-01-02  100.0  110.0   90.0   95.0     3.0
# 2020-01-03   95.0   95.0   95.0   95.0     0.0
# 2020-01-04   98.0  150.0  100.0  100.0     4.0
like image 561
Nyxynyx Avatar asked Feb 17 '20 01:02

Nyxynyx


2 Answers

SELECT "tickerId",
       "ts",
       coalesce("open", "close")  "open",
       coalesce("high", "close")  "high",
       coalesce("low", "close")   "low",
       coalesce("close", "close") "close",
       coalesce("volume", 0)      "volume",
       coalesce("count", 0)       "count"

FROM (
     SELECT "tickerId",
            time_bucket_gapfill('1 hour', at)   "ts",
            first(price, "eId")                 "open",
            MAX(price)                          "high",
            MIN(price)                          "low",
            locf(last(price, "eId"))            "close",
            SUM(volume)                         "volume",
            COUNT(1)                            "count"
     FROM "PublicTrades"
     WHERE at >= date_trunc('day', now() - INTERVAL '1 year')
       AND at < NOW()
     GROUP BY "tickerId", "ts"
     ORDER BY "tickerId", "ts" DESC
     LIMIT 100
 ) AS P

Notice: eId is Exchange Public Trade ID

like image 150
Anton Korneychuk Avatar answered Oct 20 '22 03:10

Anton Korneychuk


You need to specify in each column how to perform the gapfilling. My guess is that you probably want to use locf. See:

https://docs.timescale.com/latest/api#time_bucket_gapfill https://docs.timescale.com/latest/api#locf

like image 42
Mike Freedman Avatar answered Oct 20 '22 03:10

Mike Freedman