Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Time based data analysis with Python

I've got a project where physical sensors send data to the server. Data is send irregularly - after something activated a sensor, but not less often than every 20 minutes. On the server data is stored in a Posgresql database.

Data structure looks like:

Sensor Table
    sensor name - string
    sensor serial no - string
    sensor type - foreign key to types table

Sensor Data Table
    sensor - foreign key
    timestamp
    value 1 - boolean
    value 2 - boolean
    value 3 - integer
    value 4 - float
    ...

It's expected to be no more than total 100 request/second. Data records in database should be persisted for 90 days and even more in some cases (not only 2 weeks as I thought earlier). So the total amount of records would be no more than 120 960 000/14 days. This is "safe" estimation. In reality it might be 10 times less (10 req/second, 12 960 000 of records).

I need to do some analysis on the data, like:

  1. Do something when a new record comes and it's "value 2" is true
  2. Do something when sensor X's "value 2" is true for longer than some declared time (50 minutes, 1 hour or more other times)
  3. Do something when sensor X's total true time for "value 2" in 24 hours is more than some declared time
  4. Do something when sensor X's "value 3" is true for longer than some declared time and no other sensor of type XYZ was active in this period ...

The "declared time" above is greater than or equal to 1 second.

The whole server part is developed in Django (and django-rest-framework to gather data).

The questions is how to do such data analysis efficiently, assuming that there should be real time or close to real time (1 second) monitoring of data and of time periods to trigger desired actions.

My thoughts:

  1. Run a process that would query database every second for records that meet criteria and call specific actions (it probably would take more than 1 second)

  2. Run some separate processes (eventlet?) one for each analysis type and then query the database every 1 second and fire specific actions.

  3. Run one process per each sensor that continuously reports to it's subscribers: I'm true on "value 2" for longer than x seconds etc. Process is reset after new data for that sensor arrives. Some publish-subscribe solution like zeromq might be used here?

  4. Use some other/faster solution

    • Mongodb - the problem might be that mongodb's files are not compacted after data is removed (2 weeks).
    • Hadoop - isn't it too big and too complex for this class of problems?
    • Pandas and some HDF5 storage - the problem might be whether it's capable of doing the analysis I've described above and probably also with writes into files. But.. might work with mongo too.

Hints?

Update.

Currently the solution that seems to be simple and effective to me is:

  1. after data arrives on sensor A run all tests and
  2. store test results in some "tests" table (or redis) in a way that says:
    • today at 1:15 pm run action "sensor open longer than"
    • today at 1:30 pm run action "sensor open longer than in 24h period" ...
  3. continuously scan the above "tests" table and when it's today 1:15 pm then run desired action, etc.
  4. when a new signal arrives for sensor A then run all tests again, and also reset data in "tests" table.

This would require me to fire tests each time the request arrives for a specific sensor, but on the other side I'll have to scan only "tests" table, every 1 second.

Update 2

I've discovered PyTables (http://www.pytables.org/moin/PyTables), looks it's quite well suited for my use case as a data storage.

like image 550
eXt Avatar asked Apr 15 '13 11:04

eXt


People also ask

Is Python good for time series?

Python provides many easy-to-use libraries and tools for performing time series forecasting in Python. Specifically, the stats library in Python has tools for building ARMA models, ARIMA models and SARIMA models with just a few lines of code.

What is time data analysis?

Time series analysis is a specific way of analyzing a sequence of data points collected over an interval of time. In time series analysis, analysts record data points at consistent intervals over a set period of time rather than just recording the data points intermittently or randomly.

Which algorithm is best for time series data?

The most popular statistical method for time series forecasting is the ARIMA (Autoregressive Integrated Moving Average) family with AR, MA, ARMA, ARIMA, ARIMAX, and SARIMAX methods.


1 Answers

My first stab at this would be to create a multi-column index on "Sensor Data Table", of the like :

sensor->timestamp->value1 //Index 1
sensor->timestamp->value2 //Index 2
sensor->timestamp->value3 //Index 3
sensor->timestamp->value4 //Index 4

See if your SQL queries are fast enough. You could query it via eventlets or cron. From a performance perspective it doesn't matter which you use as long as this query is fast enough, it is most likely to be your bottleneck.

Another suggestion is to try MySQL Memory tables, or the postgre equivalent (In-memory table in PostgreSQL).

Yet another suggestion is to try Redis. You can store "Sensor Data" as a collection of sorted sets; One sorted set per sensor id and value field, and sort data by timestamps.

 ZADD sensor_id:value1 timestamp value
 ZADD sensor_id:value2 timestamp value

Redis will require some application logic to accumalate the data, but it will be very fast if it all fits in RAM.

Re: MongoDB. You can get good perf. as long as your queryable data + indexes can fit in RAM and there aren't too many write locks. Albeit it's an administrative (and coding) burden to run 2 heavy-weight databases that provide overlapping features. Given that, compaction is not really an issue. You can create TTL indexes on sensor data and mongo will delete older data in a bg thread. The file size will remain constant after a while.

Hope this helps

like image 67
Adil Avatar answered Sep 20 '22 10:09

Adil