Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interpolating Large Datasets On the Fly

Interpolating Large Datasets

I have a large data set of about 0.5million records representing the exchange rate between the USD / GBP over the course of a given day.

I have an application that wants to be able to graph this data or maybe a subset. For obvious reasons I do not want to plot 0.5 million points on my graph.

What I need is a smaller data set (100 points or so) which accurately (as possible) represents the given data. Does anyone know of any interesting and performant ways this data can be achieved?

Cheers, Karl

like image 324
Karl Avatar asked Mar 25 '10 11:03

Karl


2 Answers

There are several statistical methods for reducing a large dataset to a smaller, easier to visualize dataset. It's not clear from your question what summary statistic you want. I've just assumed that you want to see how the exchange rate changes as a function of time, but perhaps you are interested in how often the exchange rate goes above a certain value, or some other statistic that I'm not considering.

Summarizing a trend over time

Here is an example using the lowess method in R (from the documentation on scatter plot smoothing):

> library(graphics)
# print out the first 10 rows of the cars dataset
> cars[1:10,]
   speed dist
1      4    2
2      4   10
3      7    4
4      7   22
5      8   16
6      9   10
7     10   18
8     10   26
9     10   34
10    11   17

# plot the original data
> plot(cars, main = "lowess(cars)")
# fit a loess-smoothed line to the points
> lines(lowess(cars), col = 2)
# plot a finger-grained loess-smoothed line to the points
> lines(lowess(cars, f=.2), col = 3)

The parameter f controls how tightly the regression fits to your data. Use some thoughtfulness with this, as you want something that accurately fits your data without overfitting. Rather than speed and distance, you could plot the exchange rate versus time.

It's also straightforward to access the results of the smoothing. Here's how to do that:

> data = lowess( cars$speed, cars$dist )
> data
$x
 [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 16 16 17 17 17 18 18 18 18 19 19
[38] 19 20 20 20 20 20 22 23 24 24 24 24 25

$y
 [1]  4.965459  4.965459 13.124495 13.124495 15.858633 18.579691 21.280313 21.280313 21.280313 24.129277 24.129277
[12] 27.119549 27.119549 27.119549 27.119549 30.027276 30.027276 30.027276 30.027276 32.962506 32.962506 32.962506
[23] 32.962506 36.757728 36.757728 36.757728 40.435075 40.435075 43.463492 43.463492 43.463492 46.885479 46.885479
[34] 46.885479 46.885479 50.793152 50.793152 50.793152 56.491224 56.491224 56.491224 56.491224 56.491224 67.585824
[45] 73.079695 78.643164 78.643164 78.643164 78.643164 84.328698

The data object that you get back contains entries named x and y, which correspond to the x and y values passed into the lowess function. In this case, x and y represent speed and dist.

like image 167
James Thompson Avatar answered Oct 19 '22 23:10

James Thompson


One thought is use the DBMS to compress the data for you using an appropriate query. Something along the lines of having it take a median for a specific range, a pseudo-query:

SELECT truncate_to_hour(rate_ts), median(rate) FROM exchange_rates 
WHERE rate_ts >= start_ts AND rate_ts <= end_ts
GROUP BY truncate_to_hour(rate_ts)
ORDER BY truncate_to_hour(rate_ts)

Where truncate_to_hour is something appropriate to your DBMS. Or a similar approach with some kind of function to segment the time into unique blocks (such as round to nearest 5 minute interval), or another math function to aggregate the group thats appropriate in place of median. Given the complexity of the time segmenting procedure and how your DBMS optimizes it may be more efficient to run a query on a temporary table with the segmented time value.

like image 33
M. Jessup Avatar answered Oct 19 '22 23:10

M. Jessup