Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Database recommendations needed -> Columnar, Embedded (if possible)

EDIT: As result of the answers so far I like to add more focus in what I like to zero in on: A database that allows writing in-memory (could be simple C# code) with persistence to storage options in order to access the data from within R. Redis so far looks the most promising. I also consider to actually use something similar to Lockfree++ or ZeroMQ, in order to avoid writing data concurrently to the database, but rather sending all to be persisted data over a message bus/other implementation and to have one "actor" handle all write operations to an in-memory db or other solution. Any more ideas aside Redis (some mentioned SQLite and I will need to still test its performance). Any other suggestions?

I am searching for the ideal database structure/solution that meets most of my below requirements but so far I utterly failed. Can you please help?

My tasks: I run a process in .Net 4.5 (C#) and generate (generally) value types that I want to use for further analysis in other applications and therefore like to either preserve in-memory or persist on disk. More below. The data is generated within different tasks/threads and thus a row based data format does not lend itself well to match this situation (because the data generated in different threads is generated at different times and is thus not aligned). Thus I thought a columnar data structure may be suitable but please correct me if I am wrong.

Example:

Tasks/Thread #1 generates the following data at given time stamps

datetime.ticks / value of output data

1000000001 233.23

1000000002 233.34

1000000006 234.23 ...

Taks/Thread #2 generates the following data at given time stamps

datetime.ticks / value of output data

1000000002 33.32

1000000005 34.34

1000000015 54.32 ...

I do not need to align the time stamps at the .Net run-time, I am first and foremost after preserving the data and to process the data within R or Python at a later point.

My requirements:

  • Fast writes, fast writes, fast writes: It can happen that I generate 100,000- 1,000,000 data points per second and need to persist (worst case) or retain in memory the data. Its ok to run the writes on its own thread so this process can lag the data generation process but limitation is 16gb RAM (64bit code), more below.

  • Preference is for columnar db format as it lends itself well to how I want to query the data later but I am open to any other structure if it makes sense in regards to the examples above (document/key-value also ok if all other requirements are met, especially in terms of write speed).

  • API that can be referenced from within .Net. Example: HDF5 may be considered capable by some but I find their .Net port horrible.Something that supports .Net a little better would be a plus but if all other requirements are met then I can deal with something similar to the HDF5 .Net port.

  • Concurrent writes if possible: As described earlier I like to write data concurrently from different tasks/threads.

  • I am constrained by 16gb memory (run .Net process in 64bit) and thus I probably look for something that is not purely in-memory as I may sometimes generate more data than that. Something in-memory which persists at times or a pure persistence model is probably preferable.

  • Preference for embedded but if a server in a client/server solution can run as a windows service then no issue.

  • In terms of data access I have strong preference for a db solution for which interfaces from R and Python already exist because I like to use the Panda library within Python for time series alignments and other analysis and run analyses within R.

  • If the API/library supports in addition SQL/SQL-like/Linq/ like queries that would be terrific but generally I just need the absolute bare bones such as load columnar data in between start and end date (given the "key"/index is in such format) because I analyze and run queries within R/Python.

  • If it comes with a management console or data visualizer that would be a plus but not a must.

  • Should be open source or priced within "reach" (no, KDB does not qualify in that regards ;-)

OK, here is what I have so far, and again its all I got because most db solution simply fail already on the write performance requirement:

  • Infobright and Db4o. I like what I read so far but I admit I have not checked into any performance stats
  • Something done myself. I can easily store value types in binary format and index the data by datetime.ticks , I just would need to somehow write scripts to load/deserialize the data in Python/R. But it would be a massive tasks if I wanted to add concurrency, a query engine, and other goodies. Thus I look for something already out there.
like image 879
Matt Avatar asked Nov 05 '12 01:11

Matt


People also ask

When would you use a columnar database?

While a relational database is optimized for storing rows of data, typically for transactional applications, a columnar database is optimized for fast retrieval of columns of data, typically in analytical applications.

What is the advantage of using columnar database?

The main benefit of a columnar database is faster performance compared to a row-oriented one. That's because it accesses less memory to output data. Because a columnar database stores data by columns instead of rows, it can store more data in a smaller amount of memory.

What is columnar database example?

Columnar databases that use CQL include Apache Cassandra, DataStax, Microsoft Azure Cosmos DB, and ScyllaDB, which is a native C++ rewrite of Cassandra. Other databases, such as Apache HBase, use their own query language.

What are popular columnar databases?

MariaDB, CrateDB, ClickHouse, Greenplum Database, Apache Hbase, Apache Kudu, Apache Parquet, Hypertable, MonetDB are some of the Top Column-Oriented Databases.


1 Answers

I can't comment -- low rep (I'm new here) -- so you get a full answer instead...

First, are you sure you need a database at all? If fast write speed and portability to R is your biggest concern then have you just considered a flat file mechanism? According to your comments you're willing to batch writes out but you need persistence; if those were my requirements I'd write a straight-to-disck buffering system that was lightning fast then build a separate task that periodically took the disk files and moved them into a data store for R, and that's only if R reading the flat files wasn't sufficient in the first place.

If you can do alignment after-the-fact, then you could write the threads to separate files in your main parallel loop, cutting each file off every so often, and leave the alignment and database loading to the subprocess.

So (in crappy pseudo_code), build a thread process that you'd call with backgroundworker or some such and include a threadname string uniquely identifying each worker and thus each filestream (task/thread):

file_name = threadname + '0001.csv' // or something
open(file_name for writing)
while(generating_data) {
    generate_data()
    while (buffer_not_full and very_busy) {
        write_data_to_buffer
        generate_data()
    }
    flush_buffer_to_disk(file_name)
    if(file is big enough or enough time has passed or we're not too busy) {
        close(file_name)
        move(file_name to bob's folder)
        increment file_name
        open(file_name for writing)
    }
)

Efficient and speedy file I/O and buffering is a straightforward and common problem. Nothing is going to be faster than this. Then you can just write another process to do the database loads and not sweat the performance there:

while(file_name in list of files in bob's folder sorted by date for good measure)
{
    read bob's file
    load bob's file to database
    align dates, make pretty
}

And I wouldn't write that part in C#, I'd batch script it and use the database's native loader which is going to be as fast as anything you can build from scratch.

You'll have to make sure the two loops don't interfere much if you're running on the same hardware. That is, run the task threads at a higher priority, or build in some mutex or performance limiters so that the database load doesn't hog resources while the threads are running. I'd definitely segregate the database server and hardware so that file I/O to the flat files isn't compromised.

FIFO queues would work if you're on Unix, but you're not. :-)

Also, hardware is going to have more of a performance impact for you than the database engine, I'd imagine. If you're on a budget I'm guessing you're on COTS hardware, so springing for a solid state drive may up performance fairly cheaply. As I said, separating the DB storage from the flat file storage would help, and the CPU/RAM for R, the Database, and your Threads should all be segregated ideally.

What I'm saying is that choice of DB vendor probably isn't your biggest issue, unless you have a lot of money to spend. You'll be hardware bound most of the time otherwise. Database tuning is an art, and while you can eek out minor performance gains at the top end, having a good database administrator will keep most databases in the same ballpark for performance. I'd look at what R and Python support well and that you're comfortable with. If you think in columnar fashion then look at R and C#'s support for Cassandra (my vote), Hana, Lucid, HBase, Infobright, Vertica and others and pick one based on price and support. For traditional databases on a single commodity machine, I haven't seen anything that MySQL can't handle.

like image 140
Chipmonkey Avatar answered Sep 29 '22 09:09

Chipmonkey