I'm building a component for recording historical data. Initially I expect it to do about 30 writes/second, and less than 1 read/second. The data will never be modified, only new data will be added. Reads are likely to be done with fresh records. The demand is likely to increase rapidly, expecting around 80 writes/second in one year time. I could choose to distribute my component and use a common database such as MySql, or I could go with a distributed database such as MongoDb. Either way, I'd like the database to handle writes very well. The database must be free. Open source would be a plus :-) Note: A record is plain text in variable size, typically 50 to 500 words.

Your question can be solved a few different ways, so let's break it down and look at the individual requirements you've laid out: <ol> <li>Writes - It sounds like the bulk of what you're doing is append only writes at a relatively low volume (80 writes/second). Just about any product on the market with a reasonable storage backend is going to be able to handle this. You're looking at 50-500 "words" of data being saved. I'm not sure what constitutes a word, but for the sake of argument let's assume that a word is an average of 8 characters, so your data is going to be some kind of metadata, a key/timestamp/whatever plus 400-4000 bytes of words. Barring implementation specific details of different RDBMSes, this is still pretty normal, we're probably writing at most (including record overhead) 4100 bytes per record. This maxes out at 328,000 bytes per second or, as I like to put it, not a lot of writing.</li> <li>Deletes - You also need the ability to delete your data. There's not a lot I can say about that. Deletes are deletes.</li> <li>Reading - Here's where things get tricky. You mention that it's mostly primary keys and reads are being done on fresh data. I'm not sure what either of these mean, but I don't think that it matters. If you're doing key only lookups (e.g. I want record 8675309), then life is good and you can use just about anything. </li> <li>Joins - If you need the ability to write actual joins where the database handles them, you've written yourself out of the major non-relational database products. </li> <li>Data size/Data life - This is where things get fun. You've estimated your writes at 80/second and I guess at 4100 bytes per record or 328,000 bytes per second. There are 86400 seconds in a day, which gives us 28,339,200,000 bytes. Terrifying! That's 3,351,269.53125 KB, 27,026 MB, or roughly 26 GB / day. Even if you're keeping your data for 1 year, that's 9633 GB, or 10TB of data. You can lease 1 TB of data from a cloud hosting provider for around $250 per month or buy it from a SAN vendor like EqualLogic for about $15,000.</li> </ol> Conclusion: I can only think of a few databases that couldn't handle this load. 10TB is getting a bit tricky and requires a bit of administration skill, and you might need to look at certain data lifecycle management techniques, but almost any RDBMS should be up to this task. Likewise, almost any non-relational/NoSQL database should be up to this task. In fact, almost any database of any sort should be up to the task. If you (or your team members) already have skills in a particular product, just stick with that. If there's a specific product that excels in your problem domain, use that. This isn't the type of problem that requires any kind of distributed magical unicorn powder.

Database choice: High-write, low-read

Tags:

database

distributed

I'm building a component for recording historical data. Initially I expect it to do about 30 writes/second, and less than 1 read/second.

The data will never be modified, only new data will be added. Reads are likely to be done with fresh records.

The demand is likely to increase rapidly, expecting around 80 writes/second in one year time.

I could choose to distribute my component and use a common database such as MySql, or I could go with a distributed database such as MongoDb. Either way, I'd like the database to handle writes very well.

The database must be free. Open source would be a plus :-)

Note: A record is plain text in variable size, typically 50 to 500 words.

478

asked Jul 12 '11 15:07

mikabytes

1 Answers

Your question can be solved a few different ways, so let's break it down and look at the individual requirements you've laid out:

Writes - It sounds like the bulk of what you're doing is append only writes at a relatively low volume (80 writes/second). Just about any product on the market with a reasonable storage backend is going to be able to handle this. You're looking at 50-500 "words" of data being saved. I'm not sure what constitutes a word, but for the sake of argument let's assume that a word is an average of 8 characters, so your data is going to be some kind of metadata, a key/timestamp/whatever plus 400-4000 bytes of words. Barring implementation specific details of different RDBMSes, this is still pretty normal, we're probably writing at most (including record overhead) 4100 bytes per record. This maxes out at 328,000 bytes per second or, as I like to put it, not a lot of writing.
Deletes - You also need the ability to delete your data. There's not a lot I can say about that. Deletes are deletes.
Reading - Here's where things get tricky. You mention that it's mostly primary keys and reads are being done on fresh data. I'm not sure what either of these mean, but I don't think that it matters. If you're doing key only lookups (e.g. I want record 8675309), then life is good and you can use just about anything.
Joins - If you need the ability to write actual joins where the database handles them, you've written yourself out of the major non-relational database products.
Data size/Data life - This is where things get fun. You've estimated your writes at 80/second and I guess at 4100 bytes per record or 328,000 bytes per second. There are 86400 seconds in a day, which gives us 28,339,200,000 bytes. Terrifying! That's 3,351,269.53125 KB, 27,026 MB, or roughly 26 GB / day. Even if you're keeping your data for 1 year, that's 9633 GB, or 10TB of data. You can lease 1 TB of data from a cloud hosting provider for around $250 per month or buy it from a SAN vendor like EqualLogic for about $15,000.

Conclusion: I can only think of a few databases that couldn't handle this load. 10TB is getting a bit tricky and requires a bit of administration skill, and you might need to look at certain data lifecycle management techniques, but almost any RDBMS should be up to this task. Likewise, almost any non-relational/NoSQL database should be up to this task. In fact, almost any database of any sort should be up to the task.

If you (or your team members) already have skills in a particular product, just stick with that. If there's a specific product that excels in your problem domain, use that.

This isn't the type of problem that requires any kind of distributed magical unicorn powder.

answered Oct 31 '22 08:10

Jeremiah Peschka

Related questions
                            
                                Stop query through pdo
                            
                                Case-insensitive search on a postgres ArrayField with django
                            
                                Node js and Code First
                            
                                How to use python to query database in parallel
                            
                                DataGrip / DBeaver cannot resolve table / column names
                            
                                Efficient MySQL database design for a Tinder like app
                            
                                Using firebase auth data and adding extra info to mongodb from firebase user
                            
                                How to reuse MongoDB connection in Go
                            
                                Storing EXIF information in Database
                            
                                NHibernate session management?
                            
                                Enums in the DB or NO Enums in the DB
                            
                                How to use SQLite database from MS Access?
                            
                                Best Practice for Foreign Key column position in table
                            
                                Are sessions faster than querying the database?
                            
                                Hibernate optimistic locking..how it works?
                            
                                How can I configure Hibernate to immediately apply all saves, updates, and deletes?
                            
                                Serializing .RData file to database
                            
                                Object vs Document Storage (Databases) = Difference (nosql)?
                            
                                MySQL 5.5 foreign key constraint fails when foreign key exists
                            
                                Using SQLite as production database, bad idea but

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With