Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is timeseries data cardinality?

I've seen few places which give definition of time-series cardinality similar to:

Assume you have 1000 IoT devices in 20 locations, they're running one of 5 firmware versions, and report input from 5 types of sensor per device. The cardinality of this set is 500,000 (1000 x 20 x 5 x 5). This can quickly get unmanageable in some cases, as even adding and tracking a new firmware version for the devices would increase the set to 600,000 (1000 x 20 x 6 x 5)

https://questdb.io/blog/2021/06/16/high-cardinality-time-series-data-performance/#what-is-high-cardinality-data

or

https://blog.timescale.com/blog/what-is-high-cardinality-how-do-time-series-databases-influxdb-timescaledb-compare/

I feel this is very inflated definition. For example if you have a set of 10 rows and each row is for different device, from different location, different firmware, different sensor it will balloon cardinality to 10x10x10x10 = 10,000. And it's only 10 rows!

Can timeseries dataset cardinality exceed total number of rows of the dataset?

like image 339
Alex des Pelagos Avatar asked Jun 25 '21 16:06

Alex des Pelagos


2 Answers

In timeseries it is common to estimate cardinaliry of time series as all possible combinations of unique tag/label values and number of measurements. The estimation helps to understand how many different time series will be stored potentially in the database during its lifetime, i.e., not just at the current state. Note that the estimation assumes independence between labels, which is normally not hold. This definition of series cardinaliry in InfluxDB discusses this aspect and is an interesting read in addition to the links in the question.

It is good to understand possible cardinality of timeseries in advance, since some timeseries databases don't handle well high cardinalities. For example, see this article for dealing with high cardinality issue in InfluxDB.

Other timeseries databases, e.g., TimescaleDB, don't have any issues with handling high cardinalities, since there is no special treatment for labels. Understanding of cardinality might be useful when indexes will be created, since higher cardinality makes indexes more useful, but occupy more space.

like image 192
k_rus Avatar answered Oct 29 '22 09:10

k_rus


Time series cardinality is the number of unique time series actually stored in the database. That's it!

Let's start with basics. A time series contains a series of (timestamp, value) pairs ordered by timestamp. Each time series has a name (the name is constructed from measurement + field name in InfluxDB line protocol). Additionally, time series can have a set of key=value tags (they are named labels in some systems such as Prometheus). Every field in the InfluxDB line protocol share the same set of tags defined in the same line. A time series is uniquely identified by its name plus a set of tags. For example, temperature{city="Paris",country="France"} and temperature{city="Marseille",country="France"} are different time series, since they contain different values for the tag city.

Let's calculate the maximum possible cardinality for time series with temperature name given the following restrictions:

  • The number of cities in the world is 10000
  • The number of countries in the world is 250

Then the maximum possible cardinality would be 10000*250=2.5 millions. But this is incorrect calculations, since each city belongs exactly to a single country. So the maximum possible cardinality is limited by the number of cities, e.g. 10000. In practice the cardinality is usually lower, since it is limited by actual cities stored in the database.

There are two types of time series cardinality:

  • The number of active time series, e.g. time series with recently ingested samples.
  • The total number of time series stored in the database.

Some time series databases may consume memory proportional to the total numer of time series (for example, InfluxDB). Others may consume memory only for active time series (for example, VictoriaMetrics). There are also databases, which consume zero additional memory for each new time series (for example, TimescaleDB or ClickHouse). All these databases have various tradeoffs, performance characteristics and resource usage (cpu, disk, ram). So it is recommended evaluating them for a particular use case before selecting the best one for the given workload.

like image 2
valyala Avatar answered Oct 29 '22 08:10

valyala