Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Store Design for NxN Data Aggregation

I am trying to come up with a theoretical solution to an NxN problem for data aggregation and storage. As an example I have a huge amount of data that comes in via a stream. The stream sends the data in points. Each point has 5 dimensions:

  1. Location
  2. Date
  3. Time
  4. Name
  5. Statistics

This data then needs to be aggregated and stored to allow another user to come along and query the data for both location and time. The user should be able to query like the following (pseudo-code):

Show me aggregated statistics for Location 1,2,3,4,....N between Dates 01/01/2011 and 01/03/2011 between times 11am and 4pm

Unfortunately due to the scale of the data it is not possible to aggregate all this data from the points on the fly and so aggregation prior to this needs to be done. As you can see though there are multiple dimensions that the data could be aggregated on.

They can query for any number of days or locations and so finding all the combinations would require huge pre-aggregation:

  • Record for Locations 1 Today
  • Record for Locations 1,2 Today
  • Record for Locations 1,3 Today
  • Record for Locations 1,2,3 Today
  • etc... up to N

Preprocessing all of these combinations prior to querying could result in an amount of precessing that is not viable. If we have 200 different locations then we have 2^200 combinations which would be nearly impossible to precompute in any reasonable amount of time.

I did think about creating records on 1 dimension and then merging could be done on the fly when requested, but this would also take time at scale.

Questions:

  1. How should I go about choosing the right dimension and/or combination of dimensions given that the user is as likely to query on all dimensions?
  2. Are there any case studies I could refer to, books I could read or anything else you can think of that would help?

Thank you for your time.

EDIT 1

When I say aggregating the data together I mean combining the statistics and name (dimensions 4 & 5) for the other dimensions. So for example if I request data for Locations 1,2,3,4..N then I must merge the statistics and counts of name together for those N Locations before serving it up to the user.

Similarly if I request the data for dates 01/01/2015 - 01/12/2015 then I must aggregate all data between those periods (by adding summing name/statistics).

Finally If I ask for data between dates 01/01/2015 - 01/12/2015 for Locations 1,2,3,4..N then I must aggregate all data between those dates for all those locations.

For the sake of this example lets say that going through statistics requires some sort of nested loop and does not scale well especially on the fly.

like image 307
gimg1 Avatar asked Sep 27 '22 15:09

gimg1


1 Answers

Try a time-series database!

From your description it seems that your data is a time-series dataset. The user seems to be mostly concerned about the time when querying and after selecting a time frame, the user will refine the results by additional conditions.

With this in mind, I suggest you to try a time-series database like InfluxDB or OpenTSD. For example, Influx provides a query language that is capable of handling queries like the following, which comes quite close to what you are trying to achieve:

SELECT count(location) FROM events
WHERE time > '2013-08-12 22:32:01.232' AND time < '2013-08-13'
GROUP BY time(10m);

I am not sure what you mean by scale, but the time-series DBs have been designed to be fast for lots of data points. I'd suggest to definitely give them a try before rolling your own solution!

like image 93
Fabian Keller Avatar answered Sep 30 '22 06:09

Fabian Keller