For one of my projects, I have to enter a big-ish collection of events into a database for later processing and I am trying to decide which DBMS would be best for my purpose.
I have:
About 400,000,000 discrete events at the moment
About 600 GB of data that will be stored in the DB
These events come in a variety of formats, but I estimate the count of individual attributes to be about 5,000. Most events only contain values for about 100 attributes each. The attribute values are to be treated as arbitrary strings and, in some cases, integers.
The events will eventually be consolidated into a single time series. While they do have some internal structure, there are no references to other events, which - I believe - means that I don't need an object DB or some ORM system.
My requirements:
Open source license - I may have to tweak it a bit.
Scalability by being able to expand to multiple servers, although only one system will be used at first.
Fast queries - updates are not that critical.
Mature drivers/bindings for C/C++, Java and Python. Preferrably with a license that plays well with others - I'd rather not commit myself to anything because of a technical decision. I think that most DB drivers do not have a problem here, but it should be mentioned, anyway.
Availability for Linux.
It would be nice, but not necessary, if it was also available for Windows
My ideal DB for this would allow me to retrieve all the events from a specified time period with a single query.
What I have found/considered so far:
Postgresql with an increased page size can apparently have up to 6,000 columns in each table. If my estimate of the attribute count is not off, it might do.
MySQL seems to have a limit of 4,000 columns per table. I could use multiple tables with a bit of SQL-fu, but I'd rather not.
MongoDB is what I am currently leaning towards. It would allow me to preserve the internal structure of the events, while still being able to query them. Its API also seems quite straight-forward. I have no idea how well it does performance-wise though - at least on a single server.
OpenTSDB and its metric collection framework sounds interesting.I could use a single time series for each attribute (which might help with some of my processing), have the attribute value as a tag and additionally tag the entries to associate them to a specific event. It probably has a steeper preparation curve that the three above, both from an administrator and an application programmer point of view. No idea about its performance.
Use HBase directly. This might fit my requirements better than OpenTSDB, although - judging from my past experience with hadoop - the administration overhead is probably still higher than the first three options.
There are probably other databases that could do it, so feel free to let me know - I would appreciate any suggestion or comment that might help me with this.
PS: I only have minimal experience as a DB administrator, so I apologise for any misconceptions.
Using tables with thousands of columns is madness. Especially when most of them are zero as you said.
You should first look into converting your data-structure from this:
table_1
-------
event_id
attribute_1
attribute_2
[...]
attribute_5000
into something like this:
table_1 event_values attributes
-------- ------------ ----------
event_id event_id attribute_id
attribute_id attribute_type
attribute_value
which can be used with any RDMS (your only constraint then would be the total database size and performance)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With