I was thinking of using a database like mongodb or ravendb to store a lot of stock tick data and wanted to know if this would be viable compared to a standard relational such as Sql Server. The data would not really be relational and would be a couple of huge tables. I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations. Example data: 500 symbols * 60 min * 60sec * 300 days... (per record we store: date, open, high,low,close, volume, openint - all decimal/float) So what do you guys think?

Since when this question was asked in 2010, several database engines were released or have developed features that specifically handle time series such as stock tick data: <ul> <li>InfluxDB - see my other answer</li> <li>Cassandra</li> </ul> With MongoDB or other document-oriented databases, if you target performance, the advices is to contort your schema to organize ticks in an object keyed by seconds (or an object of minutes, each minute being another object with 60 seconds). With a specialized time series database, you can query data simply with <pre class="prettyprint"><code>SELECT open, close FROM market_data WHERE symbol = 'AAPL' AND time > '2016-09-14' AND time < '2016-09-21' </code></pre> <blockquote> I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations. </blockquote> With InfluxDB, this is very straightforward. Here's how to get the daily minimums and maximums: <pre class="prettyprint"><code>SELECT MIN("close"), MAX("close") FROM "market_data" WHERE WHERE symbol = 'AAPL' GROUP BY time(1d) </code></pre> You can group by time intervals which can be in microseconds (<code>u</code>), seconds (<code>s</code>), minutes (<code>m</code>), hours (<code>h</code>), days (<code>d</code>) or weeks (<code>w</code>). <h3>TL;DR</h3> Time-series databases are better choices than document-oriented databases for storing and querying large amounts of stock tick data.

The answer here will depend on scope. MongoDB is great way to get the data "in" and it's really fast at querying individual pieces. It's also nice as it is built to scale horizontally. However, what you'll have to remember is that all of your significant "queries" are actually going to result from "batch job output". As an example, Gilt Groupe has created a system called Hummingbird that they use for real-time analytics on their web site. Presentation here. They're basically dynamically rendering pages based on collected performance data in tight intervals (15 minutes). In their case, they have a simple cycle: post data to mongo -> run map-reduce -> push data to webs for real-time optimization -> rinse / repeat. This is honestly pretty close to what you probably want to do. However, there are some limitations here: <ol> <li>Map-reduce is new to many people. If you're familiar with SQL, you'll have to accept the learning curve of Map-reduce.</li> <li>If you're pumping in lots of data, your map-reduces are going to be slower on those boxes. You'll probably want to look at slaving / replica pairs if response times are a big deal.</li> </ol> On the other hand, you'll run into different variants of these problems with SQL. Of course there are some benefits here: <ol> <li>Horizontal scalability. If you have lots of boxes then you can shard them and get somewhat linear performance increases on Map/Reduce jobs (that's how they work). Building such a "cluster" with SQL databases is lot more costly and expensive.</li> <li>Really fast speed and as with point #1, you get the ability to add RAM horizontally to keep up the speed.</li> </ol> As mentioned by others though, you're going to lose access to ETL and other common analysis tools. You'll definitely be on the hook to write a lot of your own analysis tools.

Are document databases good for storing large amounts of Stock Tick data? [closed]

2 Answers

Since when this question was asked in 2010, several database engines were released or have developed features that specifically handle time series such as stock tick data:

InfluxDB - see my other answer
Cassandra

With MongoDB or other document-oriented databases, if you target performance, the advices is to contort your schema to organize ticks in an object keyed by seconds (or an object of minutes, each minute being another object with 60 seconds). With a specialized time series database, you can query data simply with

SELECT open, close FROM market_data
WHERE symbol = 'AAPL' AND time > '2016-09-14' AND time < '2016-09-21'

I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.

With InfluxDB, this is very straightforward. Here's how to get the daily minimums and maximums:

SELECT MIN("close"), MAX("close") FROM "market_data" WHERE WHERE symbol = 'AAPL'
GROUP BY time(1d)

You can group by time intervals which can be in microseconds (u), seconds (s), minutes (m), hours (h), days (d) or weeks (w).

TL;DR

Time-series databases are better choices than document-oriented databases for storing and querying large amounts of stock tick data.

108

answered Oct 27 '22 01:10

Dan Dascalescu

The answer here will depend on scope.

MongoDB is great way to get the data "in" and it's really fast at querying individual pieces. It's also nice as it is built to scale horizontally.

However, what you'll have to remember is that all of your significant "queries" are actually going to result from "batch job output".

As an example, Gilt Groupe has created a system called Hummingbird that they use for real-time analytics on their web site. Presentation here. They're basically dynamically rendering pages based on collected performance data in tight intervals (15 minutes).

In their case, they have a simple cycle: post data to mongo -> run map-reduce -> push data to webs for real-time optimization -> rinse / repeat.

This is honestly pretty close to what you probably want to do. However, there are some limitations here:

Map-reduce is new to many people. If you're familiar with SQL, you'll have to accept the learning curve of Map-reduce.
If you're pumping in lots of data, your map-reduces are going to be slower on those boxes. You'll probably want to look at slaving / replica pairs if response times are a big deal.

On the other hand, you'll run into different variants of these problems with SQL.

Of course there are some benefits here:

Horizontal scalability. If you have lots of boxes then you can shard them and get somewhat linear performance increases on Map/Reduce jobs (that's how they work). Building such a "cluster" with SQL databases is lot more costly and expensive.
Really fast speed and as with point #1, you get the ability to add RAM horizontally to keep up the speed.

As mentioned by others though, you're going to lose access to ETL and other common analysis tools. You'll definitely be on the hook to write a lot of your own analysis tools.

answered Oct 27 '22 01:10

Gates VP

Related questions
                            
                                Show BASE64 video with node/express
                            
                                Using Sinatra and MongoDB - what's the recommended way to "keep alive" the mongodb connection between http requests?
                            
                                Rails 3: how to use active record and mongoid at the same time
                            
                                MongoDB Example for Yesod / Persistent
                            
                                Content tagging with MongoDB
                            
                                Pymongo cursor limit(1) returns more than 1 result
                            
                                Query a document on array elements in MongoDB using Java
                            
                                close() never close connections in pymongo?
                            
                                Creating text index in pymongo
                            
                                what is the JDBC driver class name for mongodb?
                            
                                Return array item by index in a meteor spacebars template
                            
                                MongoDB native: is there any difference between toString and toHexString methods?
                            
                                How to upsert document in MongoDB Java driver 3
                            
                                MongoDB C# driver - Change Id serialization for inherited class
                            
                                Is it better to save id of a document in another document as ObjectId or String
                            
                                Customizing Spring Data repository bean names for use with multiple data sources
                            
                                "Server Selection Timeout Error" MongoDB Go Driver with Docker
                            
                                "The $changeStream stage is only supported on replica sets" error while using mongodb-source-connect
                            
                                No bean named 'mongoTemplate' available. Spring Boot + MongoDB
                            
                                Multiple simultaneous updates with MongoDB/PyMongo?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Are document databases good for storing large amounts of Stock Tick data? [closed]

Tags:

database

mongodb

document

stocks

ravendb

dvkwong

People also ask

2 Answers

TL;DR

Dan Dascalescu

Gates VP

Recent Activity

Donate For Us