Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Database design question

I accumulated a quite a lot of data in a raw form (csv and binary) - 4GB per day for a few months to be precise.

I decided to join the civilized world and use database to access the data and I wondered what would be the correct layout; the format is quite simple: a few rows for every time tick (bid, ask, timestamp, etc.) x up to 0.5Million/day x hundreds of financial instruments x monthes of data.

There is a MySQL server with MYISAM (which I understood would be the correct engine for this type of usage) running on commodity harware (2 x 1GB RAID 0 SATA, core 2 @ 2.7GHz)

What would be correct layout of the database? How should the tables/indices look like? What are the general recommendations with this scenario? What would you predict set me pitfalls along the way?

Edit: my common usage will be simple queries to extract time series information for a specific date and instruments, e.g.

SELECT (ask + bid) / 2
  WHERE instrument='GOOG'
  AND date = '01-06-2008'
  ORDER BY timeStamp;

Edit: I tried to stuff all my data in one table indexed by the timeStamp but it was way too slow - therefore I reckoned it would take a more elaborate scheme.

like image 259
Dani Avatar asked Dec 13 '22 05:12

Dani


2 Answers

You don't really say what your background is and how much you know about programming and database design. It sounds like you should do some reading. Conceptually though your design is fairly simple. Your description identifies a mere two entities:

  • Financial instrument; and
  • Quote.

So you need to then identify the attributes.

Financial instrument:

  • Security code;
  • Market;
  • etc.

Quote:

  • Timestamp;
  • Financial instrument;
  • Bid price; and
  • Ask price.

The reference to the financial instrument is what's called a foreign key. Each table also needs a primary key, probably just an auto-increment field.

Conceptually fairly simple.

CREATE TABLE instrument (
  id BIGINT NOT NULL AUTO_INCREMENT,
  code CHAR(4),
  company_name VARCHAR(100),
  PRIMARY KEY (id)
);

CREATE TABLE quote (
  id BIGINT NOT NULL AUTO_INCREMENT,
  intrument_id BIGINT NOT NULL,
  dt DATETIME NOT NULL,
  bid NUMERIC(8,3),
  ask NUMERIC(8,3),
  PRIMARY KEY (id)
)

CREATE INDEX instrument_idx1 ON instrument (code);

CREATE INDEX quote_idx1 ON quote (instrument_id, dt);

SELECT (bid + ask) / 2
FROM instrument i
JOIN quote q ON i.id = q.instrument_id
WHERE i.code = 'GOOG'
AND q.dt >= '01-06-2008' AND q.dt < '02-06-2008'

If your dataset is sufficiently large you might want to include (bid + ask) / 2 in the table so you don't have to calculate on the fly.

Ok, so that's the normalized view. After this you may need to start making performance optimizations. Consider this question about storing billions of rows in MySQL. Partitioning is a feature of MySQL 5.1+ (fairly new).

But another question to ask yourself is this: do you need to store all this data? The reason I ask this is that I used to be working in online broking and we only stored all the trades for a very limited window and trades would be a smaller set of data than quotes, which you seem to want.

Storing billions of rows of data is a serious problem and one you really need serious help to solve.

like image 132
cletus Avatar answered Dec 17 '22 23:12

cletus


What you need to do is to read up on database normalization. If you find that article too much, you should simply skim through a 3rd normal form tutorial.

like image 37
csl Avatar answered Dec 17 '22 23:12

csl