I was thinking of using a database like mongodb or ravendb to store a lot of stock tick data and wanted to know if this would be viable compared to a standard relational such as Sql Server.
The data would not really be relational and would be a couple of huge tables. I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.
Example data: 500 symbols * 60 min * 60sec * 300 days... (per record we store: date, open, high,low,close, volume, openint - all decimal/float)
So what do you guys think?
If you had no concerns about storage costs (managed MongoDB Atlas is a bit pricey), MongoDB is a clear winner for storing end-of-day OHLC data. It has the fastest reads and very good writes and multi-record appends.
You can save tick data to a flat file for all the software cares, but that would be really slow to access later. Column-oriented - all elements in a field are stored contiguously for better caching. Binary - all elements are ready for immediate use; no lexical casting required.
Since when this question was asked in 2010, several database engines were released or have developed features that specifically handle time series such as stock tick data:
With MongoDB or other document-oriented databases, if you target performance, the advices is to contort your schema to organize ticks in an object keyed by seconds (or an object of minutes, each minute being another object with 60 seconds). With a specialized time series database, you can query data simply with
SELECT open, close FROM market_data
WHERE symbol = 'AAPL' AND time > '2016-09-14' AND time < '2016-09-21'
I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.
With InfluxDB, this is very straightforward. Here's how to get the daily minimums and maximums:
SELECT MIN("close"), MAX("close") FROM "market_data" WHERE WHERE symbol = 'AAPL'
GROUP BY time(1d)
You can group by time intervals which can be in microseconds (u
), seconds (s
), minutes (m
), hours (h
), days (d
) or weeks (w
).
Time-series databases are better choices than document-oriented databases for storing and querying large amounts of stock tick data.
The answer here will depend on scope.
MongoDB is great way to get the data "in" and it's really fast at querying individual pieces. It's also nice as it is built to scale horizontally.
However, what you'll have to remember is that all of your significant "queries" are actually going to result from "batch job output".
As an example, Gilt Groupe has created a system called Hummingbird that they use for real-time analytics on their web site. Presentation here. They're basically dynamically rendering pages based on collected performance data in tight intervals (15 minutes).
In their case, they have a simple cycle: post data to mongo -> run map-reduce -> push data to webs for real-time optimization -> rinse / repeat.
This is honestly pretty close to what you probably want to do. However, there are some limitations here:
On the other hand, you'll run into different variants of these problems with SQL.
Of course there are some benefits here:
As mentioned by others though, you're going to lose access to ETL and other common analysis tools. You'll definitely be on the hook to write a lot of your own analysis tools.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With