We have a large and growing dataset of experimental data taken from around 30,000 subjects. For each subject, there are several recordings of data. Within each recording, there is a collection several time series of physiological data, each about 90 seconds long and sampled at 250Hz. I should note that any given instance of a time series is never extended, only additional recordings are added to the dataset. These recordings are not all of the same length, as well. Currently, the data for each recording is contained in its own flat file. These files are organized in a directory structure that is broken down hierarchically by version of the overall experiment, experiment location, date, and experiment terminal (in that hierarchical order).
Most of our analysis is done in MATLAB and we plan to continue to use MATLAB extensively for further analysis. The situation as it stands was workable (if undesirable) when all researchers were co-located. We are now spread around the globe and I am investigating the best solution to make all of this data available from remote locations. I am well-versed in MySQL and SQL Server, and could easily come up with a way to structure this data within such a paradigm. I am, however, skeptical as to the efficiency of this approach. I would value any suggestions that might point me in the right direction. Should I be considering something different? Time series databases (though those seem to me to be tuned for extending existing time series)? Something else?
Analysis does not need to be done online, though the possibility of doing so would be a plus. For now, our typical use case would be to query for a specific subset of recordings and pull down the associated time series for local analysis. I appreciate any advice you might have!
In my research, I've found this paper, where they are storing and analyzing very similar signals. They've chosen MongoDB for the following reasons:
These are all attractive advantages to me, as well. The development looks dead simple, and the ability to easily augment existing documents with the results of analysis is clearly helpful (though I know this isn't exactly difficult to do in the systems with which I am already familiar.
To be clear, I know that I can leave the data stored in flat files, and I know I could simply arrange for secure access to these flat files via MATLAB over the network. There are numerous reasons I want to store this data in a database. For instance:
The long and short of it is that I want to store these data in a data base for myriad reasons (space, efficiency, and ease of access considerations, among many others).
I seem to not be sufficiently describing the nature of these data, so I will attempt to clarify. These recordings are certainly time series data, but not in the way many people think of time series. I am not continually capturing data to be appended to an existing time series. I am really making multiple recordings, all with varying metadata, but of the same three signals. These signals can be thought of as a vector of numbers, and the length of these vectors vary from recording to recording. In a traditional RDBMS, I might create one table for recording type A, one for B, etc. and treat each row as a data point in the time series. However, this does not work as recordings vary in length. Rather, I would prefer to have an entity that represents a person, and have that entity associated with the several recordings taken from that person. This is why I have considered MongoDB, as I can nest several arrays (of varying lengths) within one object in a collection.
As an example, here's what I sketched as a potential MongoDB BSON structure for a subject:
{
"songs":
{
"order":
[
"R008",
"R017",
"T015"
],
"times": [
{
"start": "2012-07-02T17:38:56.000Z",
"finish": "2012-07-02T17:40:56.000Z",
"duration": 119188.445
},
{
"start": "2012-07-02T17:42:22.000Z",
"finish": "2012-07-02T17:43:41.000Z",
"duration": 79593.648
},
{
"start": "2012-07-02T17:44:37.000Z",
"finish": "2012-07-02T17:46:19.000Z",
"duration": 102450.695
}
]
},
"self_report":
{
"music_styles":
{
"none": false,
"world": true
},
"songs":
[
{
"engagement": 4,
"positivity": 4,
"activity": 3,
"power": 4,
"chills": 4,
"like": 4,
"familiarity": 4
},
{
"engagement": 4,
"positivity": 4,
"activity": 3,
"power": 4,
"chills": 4,
"like": 4,
"familiarity": 3
},
{
"engagement": 2,
"positivity": 1,
"activity": 2,
"power": 2,
"chills": 4,
"like": 1,
"familiarity": 1
}
],
"most_engaged": 1,
"most_enjoyed": 1,
"emotion_indices":
[
0.729994,
0.471576,
28.9082
]
},
"signals":
{
"test":
{
"timestamps":
[
0.010, 0.010, 0.021, ...
],
"eda":
[
149.200, 149.200, 149.200, ...
],
"pox":
[
86.957, 86.957, 86.957, ...
]
},
"songs":
[
{
"timestamps":
[
0.010, 0.010, 0.021, ...
],
"eda":
[
149.200, 149.200, 149.200, ...
],
"pox":
[
86.957, 86.957, 86.957, ...
]
},
{
"timestamps":
[
0.010, 0.010, 0.021, ...
],
"eda":
[
149.200, 149.200, 149.200, ...
],
"pox":
[
86.957, 86.957, 86.957, ...
]
},
{
"timestamps":
[
0.010, 0.010, 0.021, ...
],
"eda":
[
149.200, 149.200, 149.200, ...
],
"pox":
[
86.957, 86.957, 86.957, ...
]
}
]
},
"demographics":
{
"gender": "female",
"dob": 1980,
"nationality": "rest of the world",
"musical_background": false,
"musical_expertise": 1,
"impairments":
{
"hearing": false,
"visual": false
}
},
"timestamps":
{
"start": "2012-07-02T17:37:47.000Z",
"test": "2012-07-02T17:38:16.000Z",
"end": "2012-07-02T17:46:56.000Z"
}
}
Those signal
s are the time seria.
Quite often when people come to NoSQL databases, they come to it hearing that there's no schema and life's all good. However, IMHO this is a really wrong notion.
When dealing with NoSQL, You have to think in terms of "aggregates" . Typically an aggregate would be an entity that can be operated on as a single unit. In your case one possible (but not that efficient) way will be to model an user and his/her data as a single aggregate. This will ensure that your user aggregate can be data centre / shard agnostic. But if the data is going to grow - loading a user will also load all the related data and be a memory hog. (Mongo as such is bit greedy on memory)
Another option will be to have the recordings stored as an aggregate and "linked" back to the user with an id - this can be a synthetic key that you can create like a GUID. Even though this superficially seems like a join, its just a "look up by property" - Since there's no real referential integrity here. This maybe the approach that I'll take if files are going to get added constantly.
The place where MongoDb shines is the part where you can do adhoc queries by a property in the document(you will create an index for this property if you don't want to lose hair later down the road.). You will not go wrong with your choice for time series data storage in Mongo. You can extract data that matches an id, within a date range for e.g., without doing any major stunts.
Please do ensure that you have replica sets no matter which ever approach you take, and diligently chose your sharding approach early on - sharding later is no fun.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With