Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are some good approaches to predicting the completion time of a long process?

tl;dr: I want to predict file copy completion. What are good methods given the start time and the current progress?

Firstly, I am aware that this is not at all a simple problem, and that predicting the future is difficult to do well. For context, I'm trying to predict the completion of a long file copy.

Current Approach:

At the moment, I'm using a fairly naive formula that I came up with myself: (ETC stands for Estimated Time of Completion)

ETC = currTime + elapsedTime * (totalSize - sizeDone) / sizeDone

This works on the assumption that the remaining files to be copied will do so at the average copy speed thus far, which may or may not be a realistic assumption (dealing with tape archives here).

  • PRO: The ETC will change gradually, and becomes more and more accurate as the process nears completion.
  • CON: It doesn't react well to unexpected events, like the file copy becoming stuck or speeding up quickly.

Another idea:

The next idea I had was to keep a record of the progress for the last n seconds (or minutes, given that these archives are supposed to take hours), and just do something like:

ETC = currTime + currAvg * (totalSize - sizeDone)

This is kind of the opposite of the first method in that:

  • PRO: If the speed changes quickly, the ETC will update quickly to reflect the current state of affairs.
  • CON: The ETC may jump around a lot if the speed is inconsistent.

Finally

I'm reminded of the control engineering subjects I did at uni, where the objective is essentially to try to get a system that reacts quickly to sudden changes, but isn't unstable and crazy.

With that said, the other option I could think of would be to calculate the average of both of the above, perhaps with some kind of weighting:

  • Weight the first method more if the copy has a fairly consistent long-term average speed, even if it jumps around a bit locally.
  • Weight the second method more if the copy speed is unpredictable, and is likely to do things like speed up/slow down for long periods, or stop altogether for long periods.

What I am really asking for is:

  • Any alternative approaches to the two I have given.
  • If and how you would combine several different methods to get a final prediction.
like image 939
Cam Jackson Avatar asked Oct 06 '11 07:10

Cam Jackson


People also ask

How is completion time determined?

Adding the time it takes to complete the tasks on the critical path to the starting date gives you the completion date according to the time estimation definition, says 2020 Project Management. You can maintain this date as long as you complete every critical activity within its projected duration.


1 Answers

If you feel that the accuracy of prediction is important, the way to go about about building a predictive model is as follows:

  1. collect some real-world measurements;
  2. split them into three disjoint sets: training, validation and test;
  3. come up with some predictive models (you already have two plus a mix) and fit them using the training set;
  4. check predictive performance of the models on the validation set and pick the one that performs best;
  5. use the test set to assess the out-of-sample prediction error of the chosen model.

I'd hazard a guess that a linear combination of your current model and the "average over the last n seconds" would perform pretty well for the problem at hand. The optimal weights for the linear combination can be fitted using linear regression (a one-liner in R).

An excellent resource for studying statistical learning methods is The Elements of Statistical Learning by Hastie, Tibshirani and Friedman. I can't recommend that book highly enough.

Lastly, your second idea (average over the last n seconds) attempts to measure the instantaneous speed. A more robust technique for this might be to use the Kalman filter, whose purpose is exactly this:

Its purpose is to use measurements observed over time, containing noise (random variations) and other inaccuracies, and produce values that tend to be closer to the true values of the measurements and their associated calculated values.

The principal advantage of using the Kalman filter rather than a fixed n-second sliding window is that it's adaptive: it will automatically use a longer averaging window when measurements jump around a lot than when they're stable.

like image 93
NPE Avatar answered Sep 19 '22 06:09

NPE