tl;dr: I want to predict file copy completion. What are good methods given the start time and the current progress? Firstly, I am aware that this is not at all a simple problem, and that predicting the future is difficult to do well. For context, I'm trying to predict the completion of a long file copy. Current Approach: At the moment, I'm using a fairly naive formula that I came up with myself: (ETC stands for Estimated Time of Completion) <pre class="prettyprint"><code>ETC = currTime + elapsedTime * (totalSize - sizeDone) / sizeDone </code></pre> This works on the assumption that the remaining files to be copied will do so at the average copy speed thus far, which may or may not be a realistic assumption (dealing with tape archives here). <ul> <li> PRO: The ETC will change gradually, and becomes more and more accurate as the process nears completion.</li> <li> CON: It doesn't react well to unexpected events, like the file copy becoming stuck or speeding up quickly.</li> </ul> Another idea: The next idea I had was to keep a record of the progress for the last n seconds (or minutes, given that these archives are supposed to take hours), and just do something like: <pre class="prettyprint"><code>ETC = currTime + currAvg * (totalSize - sizeDone) </code></pre> This is kind of the opposite of the first method in that: <ul> <li> PRO: If the speed changes quickly, the ETC will update quickly to reflect the current state of affairs. </li> <li> CON: The ETC may jump around a lot if the speed is inconsistent.</li> </ul> Finally I'm reminded of the control engineering subjects I did at uni, where the objective is essentially to try to get a system that reacts quickly to sudden changes, but isn't unstable and crazy. With that said, the other option I could think of would be to calculate the average of both of the above, perhaps with some kind of weighting: <ul> <li>Weight the first method more if the copy has a fairly consistent long-term average speed, even if it jumps around a bit locally.</li> <li>Weight the second method more if the copy speed is unpredictable, and is likely to do things like speed up/slow down for long periods, or stop altogether for long periods.</li> </ul> What I am really asking for is: <ul> <li>Any alternative approaches to the two I have given.</li> <li>If and how you would combine several different methods to get a final prediction.</li> </ul>

If you feel that the accuracy of prediction is important, the way to go about about building a predictive model is as follows: <ol> <li>collect some real-world measurements;</li> <li>split them into three disjoint sets: training, validation and test;</li> <li>come up with some predictive models (you already have two plus a mix) and fit them using the training set;</li> <li>check predictive performance of the models on the validation set and pick the one that performs best;</li> <li>use the test set to assess the out-of-sample prediction error of the chosen model.</li> </ol> I'd hazard a guess that a linear combination of your current model and the "average over the last n seconds" would perform pretty well for the problem at hand. The optimal weights for the linear combination can be fitted using linear regression (a one-liner in R). An excellent resource for studying statistical learning methods is <a href="http://www-stat.stanford.edu/~tibs/ElemStatLearn/">The Elements of Statistical Learning</a> by Hastie, Tibshirani and Friedman. I can't recommend that book highly enough. Lastly, your second idea (average over the last n seconds) attempts to measure the instantaneous speed. A more robust technique for this might be to use the Kalman filter, whose purpose is exactly this: <blockquote> Its purpose is to use measurements observed over time, containing noise (random variations) and other inaccuracies, and produce values that tend to be closer to the true values of the measurements and their associated calculated values. </blockquote> The principal advantage of using the Kalman filter rather than a fixed n-second sliding window is that it's adaptive: it will automatically use a longer averaging window when measurements jump around a lot than when they're stable.

What are some good approaches to predicting the completion time of a long process?

Tags:

language-agnostic

algorithm

time

file-copying

prediction

tl;dr: I want to predict file copy completion. What are good methods given the start time and the current progress?

Firstly, I am aware that this is not at all a simple problem, and that predicting the future is difficult to do well. For context, I'm trying to predict the completion of a long file copy.

Current Approach:

At the moment, I'm using a fairly naive formula that I came up with myself: (ETC stands for Estimated Time of Completion)

ETC = currTime + elapsedTime * (totalSize - sizeDone) / sizeDone

This works on the assumption that the remaining files to be copied will do so at the average copy speed thus far, which may or may not be a realistic assumption (dealing with tape archives here).

PRO: The ETC will change gradually, and becomes more and more accurate as the process nears completion.
CON: It doesn't react well to unexpected events, like the file copy becoming stuck or speeding up quickly.

Another idea:

The next idea I had was to keep a record of the progress for the last n seconds (or minutes, given that these archives are supposed to take hours), and just do something like:

ETC = currTime + currAvg * (totalSize - sizeDone)

This is kind of the opposite of the first method in that:

PRO: If the speed changes quickly, the ETC will update quickly to reflect the current state of affairs.
CON: The ETC may jump around a lot if the speed is inconsistent.

Finally

I'm reminded of the control engineering subjects I did at uni, where the objective is essentially to try to get a system that reacts quickly to sudden changes, but isn't unstable and crazy.

With that said, the other option I could think of would be to calculate the average of both of the above, perhaps with some kind of weighting:

Weight the first method more if the copy has a fairly consistent long-term average speed, even if it jumps around a bit locally.
Weight the second method more if the copy speed is unpredictable, and is likely to do things like speed up/slow down for long periods, or stop altogether for long periods.

What I am really asking for is:

Any alternative approaches to the two I have given.
If and how you would combine several different methods to get a final prediction.

939

asked Oct 06 '11 07:10

Cam Jackson

1 Answers

If you feel that the accuracy of prediction is important, the way to go about about building a predictive model is as follows:

collect some real-world measurements;
split them into three disjoint sets: training, validation and test;
come up with some predictive models (you already have two plus a mix) and fit them using the training set;
check predictive performance of the models on the validation set and pick the one that performs best;
use the test set to assess the out-of-sample prediction error of the chosen model.

I'd hazard a guess that a linear combination of your current model and the "average over the last n seconds" would perform pretty well for the problem at hand. The optimal weights for the linear combination can be fitted using linear regression (a one-liner in R).

An excellent resource for studying statistical learning methods is The Elements of Statistical Learning by Hastie, Tibshirani and Friedman. I can't recommend that book highly enough.

Lastly, your second idea (average over the last n seconds) attempts to measure the instantaneous speed. A more robust technique for this might be to use the Kalman filter, whose purpose is exactly this:

Its purpose is to use measurements observed over time, containing noise (random variations) and other inaccuracies, and produce values that tend to be closer to the true values of the measurements and their associated calculated values.

The principal advantage of using the Kalman filter rather than a fixed n-second sliding window is that it's adaptive: it will automatically use a longer averaging window when measurements jump around a lot than when they're stable.

answered Sep 19 '22 06:09

NPE

Related questions
                            
                                Minimum unique array sum
                            
                                Algorithm for finding all of the shared substrings of any length between 2 strings, and then counting occurrences in string 2?
                            
                                How to generalize my algorithm to detect if one string is a rotation of another
                            
                                I don't understand how the time complexity for this algorithm is calculated
                            
                                How do I guarantee that a DAG stays acyclic after insertion of a node?
                            
                                Algorithm putting point into square with maximal minimum distance
                            
                                What is the point of IDA* vs A* algorithm
                            
                                Flattening intersecting timespans
                            
                                Space requirements of a merge-sort
                            
                                How can I find words in matrix of letters
                            
                                An algorithm to sort a list of values into n groups so that the sum of each group is as close as possible
                            
                                Minimum of sum of absolute values
                            
                                Compute the area covered by cards randomly put on a table
                            
                                given a word, print its index, words can be increased accordingly
                            
                                How to find divisor to maximise remainder?
                            
                                Algorithm to classify a list of products? Take 2
                            
                                How to compute optimal paths for traveling salesman bitonic tour?
                            
                                A few sorting questions
                            
                                Algorithm for simplifying 3d surface?
                            
                                Search matrix for all rectangles of given dimensions (select blocks of seats)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With