Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating Data "Tiles" for Massive Amounts of Price Data

I'm designing a JavaScript chart that I want to be awesome and I need your help.

Let's say I have 5 million records with price data like this:

open  high  low  close  volume  timeStart      timeEnd        timeDuration
10    20    5    15     500     1391895920860  1391895920920  60

So, I have 5 million records describing the price of something at the one minute resolution.

Now, I need to show a chart to the user representing all 5 million of these records. By default the chart is going to show 2 years of data at the one week interval. Two years of data at a one week interval is just 104 records. So, in order to generate my 104 weekly records I need to process all 5 million of my minute resolution records and group them into weeks.

Let's say the user zooms in and wants to see data grouped for every three days instead of the one week interval. I'm going to need to generate three day records which group the data from my one minute resolution data.

In fact, for each interval that I am going to allow the user to zoom into I need to summarize my minute resolution data ahead of time. To save load on the database, I am going to generate CSV "tiles" that the client can quickly pull down. Each tile will have 1,000 records in it.

Here are the resolutions ("zoom levels") I will support and the number of tiles I will need to generate ahead of time for two years of price data at the one minute resolution:

1 minute - 1440 tiles (estimated to be about 40 MB of data)
3 minute - 480 tiles
5 minute - 288 tiles
15 minute - 96 tiles
30 minute - 48 tiles
1 hour - 24 tiles
2 hours - 12 tiles
4 hours - 6 tiles
6 hours - 4 tiles
12 hours - 2 tiles
1 day - 1 tile
3 days - 1 tile
1 week - 1 tile
1 month - 1 tile
1 year - 1 tile

The idea here is that when the user first hits the chart, all I need to do is pull down the one week tile and it will have all 104 of my records. When they zoom into a section of the one week tile I will pull down the 3 day tile.

So here is where I need help. If the user zooms into the 12 hour tile from the 1 day zoom level, I need to intelligently know which of the 12 hour tiles I should grab. I think I need to have a naming scheme for my tiles that will allow the client to easily identify sub tiles for any parent tile.

Can anyone help me with a naming scheme for these tiles? I specifically need:

  1. A file naming scheme for all of my tiles I will generate
  2. An algorithm the chart will use to easily identify the correct child tile to zoom into or parent child to zoom out to

Would love to hear your thoughts and feedback!

like image 818
Kirk Ouimet Avatar asked Feb 08 '14 22:02

Kirk Ouimet


People also ask

What are tiles in data?

Tiles are used to divide raster data into small, manageable areas that are stored in many BLOBs in the database. Tiling allows the large raster datasets to be broken-up into manageable pieces.

How do I create a map tile package?

Create Tile Packages Using ArcMapOpen ArcMap and choose File > Sign In to sign in to your account if you have not done so already. Author the map as you want it displayed in the hosted layer. Choose File > Share As > Tile Package.

What is a tile layer?

A tile layer is a set of web-accessible tiles that reside on a server. The tiles are accessed by a direct URL request from the web browser. Because the tiles in a tile layer are not available as a service, they must be in a specific format for a web app such as the map viewer to display the layer on a map.


2 Answers

This depends on almost about anything, and you don't say much about your server side.

But, you must balance on what you compute before, and what you compute now. The before takes space and saves time, the now takes time and saves space. Do you have fast I/O (hdd/ssd/raid X) ? Do you have much space? Do you have powerful cpus? Does your server handle great bandwith? Why each tile should have 1000 records? What are the most popular zoom levels?

First of all you must start from somewhere. Your tiniest data, is the minute. So you must have details about minutes. This releaves you from holding miliseconds. From my experience a file with a range of id's is essential like a 23211239-23242175.csv file. So that "tile" is your start. The id I got is the miliseconds/1000/60 to become minutes. I see your data has time-range and that is a small problem. Every record should fill entirely into your timeframe, it's like you should care only about timeEnd and put the record in that time-frame.

To have good I/O check your file to be close to a multiple of the cluster size of your disks. For windows I think it is 4kb. The cluster size is the size that is brought back by the disk in one I/O, let's say you have 2 csv's that are 6kb each. It would require 2 I/O's to bring each file to the memory, so you got 4 I/O's to bring 2x6Kb files. But if you split these into 3 of 4kb each, it would require 3 I/O's to load them into memory. So check if these tiles are really close to cluster size.

See if you can get rid of some fields like timeStart,timeEnd,timeDuration because your filename will describe these fields. Try to precompute as much data as you can, to save proccessing time. From your details on zoom's levels, each level is computed by the previous level. Try to create subfolders that go through the next levels. You don't have to compute every zoom level, try to skip some, it will save space. Then upon requests you will find what users are most asking for and create it if it does not exist. Have folders like:

Path/1m/xxxxxxxx-xxxxxxx.csv
Path/30m/xxxxxxx-xxxxxx.csv
Path/6h/xxxxx-xxxxx.csv
Path/w/xx-xx.csv

Be sure you only have the 1 minute file at start and then create the others. I bet that when you zoom into a level, like for 2014 it will show 2014's months. So it is like you need at most X results for each chart. So maybe your tiles won't have the same lenth at each zoom level/folder. It depends on what they describe. Then it will be easy for you to find anything, just check the range the file describes and return it. Popular resolutions can be cached into memory for a period of time to make even faster the output proccess.

On the client side, you should try to keep as much data as you can. Don't ask server for data that you have previously got again. The problem with RAM caching files is that the file can be outdated, so be careful. If you watch results of past year (year is said because it is the most generic resolution level in this application) you should know that newcomers won't affect these old fields, so you can cache them safely.

This problem does not lie only on client side programming. To make a robust and efficient chart app you should have a great backend logic to accomplish it. I had to tell these even if I don't see any server tags (ror,nodejs,php) in this question...

like image 116
Theofilos Mouratidis Avatar answered Sep 24 '22 14:09

Theofilos Mouratidis


I would model the data in layers. One layer per resolution. Each layer is independent nominative and functionally. Every layer use same name convention for system stability.

If you are going to use JavaScript I would use JavaScript date convention if applicable.

http://www.ecma-international.org/ecma-262/5.1/#sec-15.9

For non standard "resolutions" (e.g. 6 hour division) is easier understandable applying a simple model: ordinary division

  • addition: Context."Bigger" unit part of date.
  • dividend: Parts of the element.
  • divisor: Size of divisions.
  • remainder: Index for the resolution.

Date for examples: 11 Feb 2014 08:14 22'.

Solution A

Only 1 bigger element is divided. For example in the 6 hour resolution only 1 day is divided. The rest of days are the "addition" or context.

layer 1:

Layer1/2014.csv

layer 7:

Layer7/2014/41/1.csv

*41 is day of the year.

Solution B

Split by timestamp range.

layer 7:

Layer7/1414908000-1414929600.csv
like image 32
Mardie Avatar answered Sep 24 '22 14:09

Mardie