Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are document databases good for storing large amounts of Stock Tick data? [closed]

I was thinking of using a database like mongodb or ravendb to store a lot of stock tick data and wanted to know if this would be viable compared to a standard relational such as Sql Server.

The data would not really be relational and would be a couple of huge tables. I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.

Example data: 500 symbols * 60 min * 60sec * 300 days... (per record we store: date, open, high,low,close, volume, openint - all decimal/float)

So what do you guys think?

like image 242
dvkwong Avatar asked Jul 08 '10 20:07

dvkwong


People also ask

Which type of database is recommended for storing financial data?

If you had no concerns about storage costs (managed MongoDB Atlas is a bit pricey), MongoDB is a clear winner for storing end-of-day OHLC data. It has the fastest reads and very good writes and multi-record appends.

Where is tick data stored?

You can save tick data to a flat file for all the software cares, but that would be really slow to access later. Column-oriented - all elements in a field are stored contiguously for better caching. Binary - all elements are ready for immediate use; no lexical casting required.


2 Answers

Since when this question was asked in 2010, several database engines were released or have developed features that specifically handle time series such as stock tick data:

  • InfluxDB - see my other answer
  • Cassandra

With MongoDB or other document-oriented databases, if you target performance, the advices is to contort your schema to organize ticks in an object keyed by seconds (or an object of minutes, each minute being another object with 60 seconds). With a specialized time series database, you can query data simply with

SELECT open, close FROM market_data
WHERE symbol = 'AAPL' AND time > '2016-09-14' AND time < '2016-09-21'

I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.

With InfluxDB, this is very straightforward. Here's how to get the daily minimums and maximums:

SELECT MIN("close"), MAX("close") FROM "market_data" WHERE WHERE symbol = 'AAPL'
GROUP BY time(1d)

You can group by time intervals which can be in microseconds (u), seconds (s), minutes (m), hours (h), days (d) or weeks (w).

TL;DR

Time-series databases are better choices than document-oriented databases for storing and querying large amounts of stock tick data.

like image 108
Dan Dascalescu Avatar answered Oct 27 '22 01:10

Dan Dascalescu


The answer here will depend on scope.

MongoDB is great way to get the data "in" and it's really fast at querying individual pieces. It's also nice as it is built to scale horizontally.

However, what you'll have to remember is that all of your significant "queries" are actually going to result from "batch job output".

As an example, Gilt Groupe has created a system called Hummingbird that they use for real-time analytics on their web site. Presentation here. They're basically dynamically rendering pages based on collected performance data in tight intervals (15 minutes).

In their case, they have a simple cycle: post data to mongo -> run map-reduce -> push data to webs for real-time optimization -> rinse / repeat.

This is honestly pretty close to what you probably want to do. However, there are some limitations here:

  1. Map-reduce is new to many people. If you're familiar with SQL, you'll have to accept the learning curve of Map-reduce.
  2. If you're pumping in lots of data, your map-reduces are going to be slower on those boxes. You'll probably want to look at slaving / replica pairs if response times are a big deal.

On the other hand, you'll run into different variants of these problems with SQL.

Of course there are some benefits here:

  1. Horizontal scalability. If you have lots of boxes then you can shard them and get somewhat linear performance increases on Map/Reduce jobs (that's how they work). Building such a "cluster" with SQL databases is lot more costly and expensive.
  2. Really fast speed and as with point #1, you get the ability to add RAM horizontally to keep up the speed.

As mentioned by others though, you're going to lose access to ETL and other common analysis tools. You'll definitely be on the hook to write a lot of your own analysis tools.

like image 31
Gates VP Avatar answered Oct 27 '22 01:10

Gates VP