Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django database planning - time series data

I would like some advice on how to best organize my django models/database tables to hold the data in my webapp

Im designing a site that will hold a users telemetry data from a racing sim game. So there will be a desktop companion app that will sample the game data every 0.1 seconds for a variety of information (car, track, speed, gas, brake, clutch, rpm, etc). For example, in a 2 minute race, each of those variables will hold 1200 data points (10 samples a second * 120 seconds).

The important thing here is that this list of data can be as many as 20 variables, and could potentially grow in the future. So 1200 * the number of variables you have is the amount of data for an individual race session. If a single user submits 100 sessions, and there are 100 users....the amount of data adds up very quickly.

The app will then ship all this data for a race session off to the database for the website. The data MUST be transferred between game and website via a CSV file. So structurally I am limited to what CSV can do. The website will then allow you to choose a race session/lap and plot this information on separate time series graphs (for each variable), and importantly allow you to plot your session against somebody elses to see where differences lie

My question here is how do you structure such a database to hold this much information?

The simplest structure I have in my mind is to have a separate table for each race track, then each row/entry will be a race session on that track. Fields in this table will be the variables above.

The problem I have is:

1) most of the variables in the list above are time series data and not individual values (e.g. var speed might look like: 70, 72, 74, 77, 72, 71, 65 where the values are samples spaced 0.1 seconds apart over the course of the entire lap). How do you store this type of information in a table/field?

2) The length of each var in the list above will always be the same length for any single race session (if your lap took 1min 35 then you all your vars will only capture data for that length of time), but given that I want to be able to compare different laps with each other, session times will be different for each lap. In other words, however I store the time series data for those variables, it must be variable in size

Any thoughts would be appreciated

like image 827
Simon Avatar asked Dec 30 '14 12:12

Simon


1 Answers

One thing that may help you with HUGE tables is partitioning. Judging by the postgresql tag that you set for your question, take a look here: http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html

But for a start I would go with a one, simple table, supported by a reasonable set of indexes. From what I understand, each data entry in the table will be identified by race session id, player id and time indicator. Those columns should be covered with indexes according to your querying requirements.

As for your two questions: 1) You store those informations as simple integers. Remember to set a proper data types for those columns. For e.g. if you are 100% sure that some values will be very small, you can use smallint data type. More on integer data types here: http://www.postgresql.org/docs/9.3/static/datatype-numeric.html#DATATYPE-INT

2) That won't be a problem if you every var list will be different row in the table. You will be able to insert as many as you'd like.

So, to sum things up. I would start with a VERY simple single table schema. From django perspective this would look something like this:

class RaceTelemetryData(models.Model):
  user = models.ForeignKey(..., index_db=True)
  race = models.ForeignKey(YourRaceModel, index_db=True)
  time = models.IntegerField()
  gas = models.IntegerField()
  speed = models.SmallIntegerField()
  # and so on...

Additionaly, you should create an index (manually) for (user_id, race_id, time) columns, so looking up, data about one race session (and sorting it) would be quick.

In the future, if you'll find the performance of this single table too slow, you'll be able to experiment with additional indexes, or partitioning. PostgreSQL is quite flexible in modifying existing database structures, so you shouldn't have many problems with it.

If you decide to add a new variable to the collection, you will simply need to add a new column to the table.

EDIT:

In the end you end up with one table, that has at least these columns: user_id - To specify which users data this row is about. race_id - To specify which race data this row is about. time - To identify the correct order in which to represent the data.

This way, when you want to get information on Joe's 5th race, you would look up rows that have user_id = 'Joe_ID' and race_id = 5, then sort all those rows by the time column.

like image 56
Maciek Avatar answered Oct 19 '22 14:10

Maciek