Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging different granularity time series in influxdb

I want to store trades as well as best ask/bid data, where the latter updates much more rapidly than the former, in InfluxDB.

I want to, if possible, use a schema that allows me to query: "for each trade on market X, find the best ask/bid on market Y whose timestamp is <= the timestamp of the trade".

(I'll use any version of Influx.)

For example, trades might look like this:

   Time      Price     Volume   Direction   Market
00:01.000     100        5          1       foo-bar
00:03.000     99         50         0       bar-baz       
00:03.050     99         25         0       foo-bar
00:04.000     101        15         1       bar-baz

And tick data might look more like this:

   Time       Ask        Bid     Market
00:00.763     100         99    bar-baz
00:01.010     101         99    foo-bar
00:01.012     101         98    bar-baz
00:01.012     101         99    foo-bar
00:01:238     100         99    bar-baz
...
00:03:021     101         98    bar-baz

I would want to be able to somehow join each trade for some market, e.g. foo-bar, with only the most recent ask/bid data point on some other market, e.g. bar-baz, and get a result like:

   Time      Trade Price    Ask     Bid
00:01.000        100        100      99
00:03.050        99         101      98

Such that I could compute the difference between the trade price on market foo-bar and the most recently quoted ask or bid on market bar-baz.

Right now, I store trades in one time series and ask/bid data points in another and merge them on the client side, with logic along the lines of:

function merge(trades, quotes, data_points)
  next_trade, more_trades = first(trades), rest(trades)

  quotes = drop-while (quote.timestamp < next_trade.timestamp) quotes
  data_point = join(next_trade, first(quotes))

  if more_trades
    return merge(more_trades, quotes, data_points + data_point)
    return data_points + data_point

The problem is that the client has to discard tons of ask/bid data points because they update so frequently, and only the most recent update before the trade is relevant.

There are tens of markets whose most recent ask/bid I might want to compare a trade with, otherwise I'd simply store the most recent ask/bid in the same series as the trades.

Is it possible to do what I want to do with Influx, or with another time series database? An alternative solution that produces lower quality results is to group the ask/bid data by some time interval, say 250ms, and take the last from each interval, to at least impose an upper bound on the amount of quotes the client has to drop before finding the one that's closest to the next trade.

like image 927
John Dorian Avatar asked Nov 08 '19 00:11

John Dorian


1 Answers

NB. Just a clarification on InfluxDB terminology. You're probably storing trade and tick data in different measurements(analogous to a table). Series is a subdivision withing a measurement based on tag values. e.g

   Time       Ask        Bid     Market
00:00.763     100         99    bar-baz

is one series

   Time       Ask        Bid     Market
00:01.010     101         99    foo-bar

is another series(assuming you are storing Market name/id as a tag and not a field)

Answer

  1. InfluxQL https://docs.influxdata.com/influxdb/v1.7/query_language/spec/ - I can't think of a way to achieve what you need with InfluxQL (Influx Query Language) as it does not support joins. Perhaps what you could do on the client side is instead of requesting all tick data for a period and discarding most of it, make a request per trade and market to get exactly the (the most recent with respect to the trade) ask/bid datapoint that you need. Something like:
function merge(trades, market)
  points = <empty list>
  for next_trade in trades
    quote = db.query("select last(ask), last(bid) from tick_data where time<=next_trade.timestamp and Market=market and time>next_trade.timestamp - 1m")
    // or to get a list per market with one query
    // quote_per_market = db.query("select last(ask), last(bid) from tick_data where time<=next_trade.timestamp group by Market")
    points = points + join(next_trade, quote)
  return points

Of course you'd have the overhead of querying the database more frequently but depending on the number of trades and your resource constraints it may be more efficient. NB. A potential pitfall here is that ask and bid retrieved this way are not retrieved as a pair but independently and while they are returned as a pair it could happen that they have different timestamps. If for some timestamp for some reason you only have an ask or a bid price you might run into this problem. However, as long as you write them in pairs and have no missing data it should be ok.

  1. Flux https://www.influxdata.com/products/flux/ - Flux is a more sophisticated query language that is part of Influxdb 1.7 and 2 that allows you to do joins and operations across different measurements. I can't give you any examples yet but it's worth having a look at.

  2. Other (relational) Times Series DBs that you could have a look at that would also allow you to do joins are CrateDB https://crate.io/ or Postgres + TimescaleDB https://www.timescale.com/products

like image 73
Nikolay Manolov Avatar answered Sep 18 '22 17:09

Nikolay Manolov