I would like to identify the largest possible contiguous subsample of a large data set. My data set consists of roughly 15,000 financial time series of up to 360 periods in length. I have imported the data into MATLAB as a 360 by 15,000 numerical matrix. <img src="https://i.stack.imgur.com/qOnzk.png" alt="enter image description here"> This matrix contains a lot of NaNs due to some of the financial data not being available for the entire period. In the illustration, NaN entries are shown in dark blue, and non-NaN entries appear in light blue. It is these light blue non-NaN entries which I would like to ideally combine into an optimal subsample. I would like to find the largest possible contiguous block of data that is contained in my matrix, while ensuring that my matrix contains a sufficient number of periods. In a first step I would like to sort my matrix from left to right in descending order by the number of non-NaN entries in each column, that is, I would like to sort by the vector obtained by entering <code>sum(~isnan(data),1)</code>. In a second step I would like to find the sub-array of my data matrix that is at least 72 entries along the first dimension and is otherwise as large as possible, measured by the total number of entries. What is the best way to implement this?

<h3>A big warning (may or may not apply depending on context)</h3> As Oleg mentioned, when an observation is missing from a financial time series, it's often missing for reason: eg. the entity went bankrupt, the entity was delisted, or the instrument did not trade (i.e. illiquid). Constructing a sample without NaNs is likely equivalent to constructing a sample where none of these events occur! For example, if this were hedge fund return data, selecting a sample without NaNs would exclude funds that blew up and ceased trading. Excluding imploded funds would bias estimates of expected returns upwards and estimates of variance or covariance downwards. Picking a sample period with the fewest time series with NaNs would also exclude periods like the 2008 financial crisis, which may or may not make sense. Excluding 2008 could lead to an underestimate of how haywire things could get (though including it could lead to overestimate the probability of certain rare events). Some things to do: <ol> <li>Pick a sample period as long as possible but be aware of the limitations.</li> <li>Do your best to handle survivorship bias: eg. if NaNs represent delisting events, try to get some kind of delisting return.</li> <li>You almost certainly will have an unbalanced panel with missing observations, and your algorithm will have to be deal with that.</li> <li>Another general finance / panel data point, selecting a sample at some time point t and then following it into the future is perfectly ok. But selecting a sample based upon what happens during or after the sample period can be incredibly misleading.</li> </ol> <h3>Code that does what you asked:</h3> This should do what you asked and be quite fast. Be aware of the problems though if whether an observation is missing is not random and orthogonal to what you care about. Inputs are a T by n sized matrix X: <pre class="prettyprint"><code>T = 360; % number of time periods (i.e. rows) in X n = 15000; % number of time series (i.e. columns) in X T_subsample = 72; % desired length of sample (i.e. rows of newX) % number of possible starting points for series of length T_subsample nancount_periods = T - T_subsample + 1; nancount = zeros(n, nancount_periods, 'int32'); % will hold a count of NaNs X_isnan = int32(isnan(X)); nancount(:,1) = sum(X_isnan(1:T_subsample, :))'; % 'initialize % We need to obtain a count of nans in T_subsample sized window for each % possible time period j = 1; for i=T_subsample + 1:T % One pass: add new period in the window and subtract period no longer in the window nancount(:,j+1) = nancount(:,j) + X_isnan(i,:)' - X_isnan(j,:)'; j = j + 1; end indicator = nancount==0; % indicator of whether starting_period, series % has no NaNs % number of nonan series of length T_subsample by starting period max_subsample_size_by_starting_period = sum(indicator); max_subsample_size = max(max_subsample_size_by_starting_period); % find the best starting period starting_period = find(max_subsample_size_by_starting_period==max_subsample_size, 1); ending_period = starting_period + T_subsample - 1; columns_mask = indicator(:,starting_period); columns = find(columns_mask); %holds the column ids we are using newX = X(starting_period:ending_period, columns_mask); </code></pre>

How to identify an optimal subsample from a data set with missing values in MATLAB

Tags:

matlab

I would like to identify the largest possible contiguous subsample of a large data set. My data set consists of roughly 15,000 financial time series of up to 360 periods in length. I have imported the data into MATLAB as a 360 by 15,000 numerical matrix.

enter image description here

This matrix contains a lot of NaNs due to some of the financial data not being available for the entire period. In the illustration, NaN entries are shown in dark blue, and non-NaN entries appear in light blue. It is these light blue non-NaN entries which I would like to ideally combine into an optimal subsample.

I would like to find the largest possible contiguous block of data that is contained in my matrix, while ensuring that my matrix contains a sufficient number of periods.

In a first step I would like to sort my matrix from left to right in descending order by the number of non-NaN entries in each column, that is, I would like to sort by the vector obtained by entering sum(~isnan(data),1).

In a second step I would like to find the sub-array of my data matrix that is at least 72 entries along the first dimension and is otherwise as large as possible, measured by the total number of entries.

What is the best way to implement this?

494

asked Feb 01 '16 01:02

Constantin

1 Answers

A big warning (may or may not apply depending on context)

As Oleg mentioned, when an observation is missing from a financial time series, it's often missing for reason: eg. the entity went bankrupt, the entity was delisted, or the instrument did not trade (i.e. illiquid). Constructing a sample without NaNs is likely equivalent to constructing a sample where none of these events occur!

For example, if this were hedge fund return data, selecting a sample without NaNs would exclude funds that blew up and ceased trading. Excluding imploded funds would bias estimates of expected returns upwards and estimates of variance or covariance downwards.

Picking a sample period with the fewest time series with NaNs would also exclude periods like the 2008 financial crisis, which may or may not make sense. Excluding 2008 could lead to an underestimate of how haywire things could get (though including it could lead to overestimate the probability of certain rare events).

Some things to do:

Pick a sample period as long as possible but be aware of the limitations.
Do your best to handle survivorship bias: eg. if NaNs represent delisting events, try to get some kind of delisting return.
You almost certainly will have an unbalanced panel with missing observations, and your algorithm will have to be deal with that.
Another general finance / panel data point, selecting a sample at some time point t and then following it into the future is perfectly ok. But selecting a sample based upon what happens during or after the sample period can be incredibly misleading.

Code that does what you asked:

This should do what you asked and be quite fast. Be aware of the problems though if whether an observation is missing is not random and orthogonal to what you care about.

Inputs are a T by n sized matrix X:

T = 360;              % number of time periods (i.e. rows) in X
n = 15000;            % number of time series (i.e. columns) in X
T_subsample = 72;     % desired length of sample (i.e. rows of newX)

% number of possible starting points for series of length T_subsample
nancount_periods = T - T_subsample + 1;   

nancount = zeros(n, nancount_periods, 'int32'); % will hold a count of NaNs

X_isnan = int32(isnan(X));

nancount(:,1) = sum(X_isnan(1:T_subsample, :))';  % 'initialize

% We need to obtain a count of nans in T_subsample sized window for each
% possible time period
j = 1;
for i=T_subsample + 1:T   
    % One pass: add new period in the window and subtract period no longer in the window 
    nancount(:,j+1) = nancount(:,j) + X_isnan(i,:)' - X_isnan(j,:)';
    j = j + 1;
end

indicator = nancount==0;  % indicator of whether starting_period, series
                          % has no NaNs 

% number of nonan series of length T_subsample by starting period
max_subsample_size_by_starting_period = sum(indicator); 
max_subsample_size = max(max_subsample_size_by_starting_period);

% find the best starting period
starting_period = find(max_subsample_size_by_starting_period==max_subsample_size, 1);
ending_period   = starting_period + T_subsample - 1;

columns_mask = indicator(:,starting_period);
columns      = find(columns_mask);   %holds the column ids we are using

newX = X(starting_period:ending_period, columns_mask);

answered Oct 05 '22 20:10

Matthew Gunn

Related questions
                            
                                How to check in MATLAB if a vector only contains zeros?
                            
                                Replace values in matrix with other values
                            
                                Assign a value to multiple cells in matlab
                            
                                Remove noise from wav file, MATLAB
                            
                                Matlab - How to make a figure current? How to make an axes current?
                            
                                process a list of files with a specific extension name in matlab
                            
                                Does Matlab eig always returns sorted values?
                            
                                Grid detection in matlab
                            
                                draw ellipse and ellipsoid in MATLAB
                            
                                Matlab segmentation fault when iterating vector assignment
                            
                                Is there a shortcut to execute the current line in Matlab code?
                            
                                Is there a way of selectively including code when publishing in Matlab?
                            
                                When should I be using `sparse`?
                            
                                Sending JSON data over WebSocket from Matlab using Python Twisted and Autobahn
                            
                                How to avoid MATLAB crash when opening too many figures?
                            
                                Histogram matching of two colored images in matlab
                            
                                How to read an animated gif with alpha channel
                            
                                axis equal in a Matlab loglog plot
                            
                                Access .mat file containing matlab classes in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With