Thanks to user3125280, D.W. and Evgeny Kluev the question is updated.
I have a list of webpages and I must download them frequently, each webpage got a different download frequency. Based on this frequency we group the webpages in 5 groups:
Items in group 1 are downloaded once per 1 hour
items in group 2 once per 2 hours
items in group 3 once per 4 hours
items in group 4 once per 12 hours
items in group 5 once per 24 hours
This means, we must download all the group 1 webpages in 1 hour, all the group 2 in 2 hours etc.
I am trying to make an algorithm. As input, I have:
a) DATA_ARR
= one array with 5 numbers. Each number represents the number of items in this group.
b) TIME_ARR
= one array with 5 numbers (1, 2, 4, 12, 24) representing how often the items will be downloaded.
b) X
= the total number of webpages to download per hour. This is calculated using items_in_group/download_frequently and rounded upwards. If we have 15 items in group 5, and 3 items in group 4, this will be 15/24 + 3/12 = 0.875 and rounded is 1.
Every hour my program must download at max X
sites. I expect the algorithm to output something like:
Hour 1: A1 B0 C4 D5
Hour 2: A2 B1 C2 D2
...
A1 = 2nd item of 1st group
C0 = 1st item of 3rd group
My algorithm must be as efficient as possible. This means:
a) the pattern must be extendable to at least 200+ hours
b) no need to create a repeatable pattern
c) spaces are needed when possible in order to use the absolute minimum bandwidth
d) never ever download an item more often than the update frequency, no exceptions
Example:
group 1: 0 items | once per 1 hour
group 2: 3 items | once per 2 hours
group 3: 4 items | once per 4 hours
group 4: 0 items | once per 12 hours
group 5: 0 items | once per 24 hours
We calculate the number of items we can take per hour: 3/2+4/4 = 2.5. We round this upwards and it's 3.
Using pencil and paper, we can found the following solution:
Hour 1: B0 C0 B1
Hour 2: B2 C1 c2
Hour 3: B0 C3 B1
Hour 4: B2
Hour 5: B0 C0 B1
Hour 6: B2 C1 c2
Hour 7: B0 C3 B1
Hour 8: B2
Hour 9: B0 C0 B1
Hour 10: B2 C1 c2
Hour 11: B0 C3 B1
Hour 12: B2
Hour 13: B0 C0 B1
Hour 14: B2 C1 c2
and continue the above.
We take C0
, C1
C2
, and C3
once every 4 hours. We also take B0
, B1
and B2
once every 2 hours.
Question: Please, explain to me, how to design an algorithm able to download the items, while using the absolute minimum number of downloads? Brute force is NOT a solution and the algorithm must be efficient CPU wise because the number of elements can be huge.
You may read the answer posted here: https://cs.stackexchange.com/a/19422/12497 as well as the answer posted bellow by user3125280.
Disadvantages of Round Robin AlgorithmLow slicing time reduces processor output. Spends more time on context switching. Performance depends on time quantum. Processes don't have priorities.
Six types of process scheduling algorithms are: First Come First Serve (FCFS), 2) Shortest-Job-First (SJF) Scheduling, 3) Shortest Remaining Time, 4) Priority Scheduling, 5) Round Robin Scheduling, 6) Multilevel Queue Scheduling.
Disadvantages of SJF Used for long-term scheduling in a batch system. Can't implement this algorithm for CPU scheduling for the short term as we can't predict the length of the upcoming CPU burst. Cause extremely long turnaround times or starvation. Knowledge about the runtime length of a process is necessary.
You problem is a typical scheduling problem. These kinds of problems are well studied in computer science so there is a huge array of literature to consult.
The code is kind of like Deficit round robin, but with a few simplifications. First, we feed the queues ourself by adding to the data_to_process
variable. Secondly, the queues just iterate through a list of values.
One difference is that this solution will get the optimal value you want, barring mathematical error.
Rough sketch: have not compiled (c++11) unix based, to spec code
#include <iostream>
#include <vector>
#include <numeric>
#include <unistd.h>
//#include <cmath> //for ceil
#define TIME_SCALE ((double)60.0) //1 for realtime speed
//Assuming you are not refreshing ints in the real case
template<typename T>
struct queue
{
const std::vector<T> data; //this will be filled with numbers
int position;
double refresh_rate; //must be refreshed ever ~ hours
double data_rate; //this many refreshes per hour
double credit; //amount of refreshes owed
queue(std::initializer_list<T> v, int r ) :
data(v), position(0), refresh_rate(r), credit(0) {
data_rate = data.size() / (double) refresh_rate;
}
int getNext() {
return data[position++ % data.size()];
}
};
double time_passed(){
static double total;
//if(total < 20){ //stop early
usleep(60000000 / TIME_SCALE); //sleep for a minute
total += 1.0 / 60.0; //add a minute
std::cout << "Time: " << total << std::endl;
return 1.0; //change to 1.0 / 60.0 for real time speed
//} else return 0;
}
int main()
{
//keep a list of the queues
std::vector<queue<int> > queues{
{{1, 2, 3}, 2},
{{1, 2, 3, 4}, 3}};
double total_data_rate = 0;
for(auto q : queues) total_data_rate += q.data_rate;
double data_to_process = 0; //how many refreshes we have to do
int queue_number = 0; //which queue we are processing
auto current_queue = &queues[0];
while(1) {
data_to_process += time_passed() * total_data_rate;
//data_to_process = ceil(data_to_process) //optional
while(data_to_process >= 1){
//data_to_process >= 0 will make the the scheduler more
//eager in the first time period (ie. everything will updated correctly
//in the first period and and following periods
if(current_queue->credit >= 1){
//don't change here though, since credit determines the weighting only,
//not how many refreshes are made
//refresh(current_queue.getNext();
std::cout << "From queue " << queue_number << " refreshed " <<
current_queue->getNext() << std::endl;
current_queue->credit -= 1;
data_to_process -= 1;
} else {
queue_number = (queue_number + 1) % queues.size();
current_queue = &queues[queue_number];
current_queue->credit += current_queue->data_rate;
}
}
}
return 0;
}
The example should now compile on gcc with --std=c++11 and give you what you want.
and here is test case output: (for non-time scaled earlier code)
Time: 0
From queue 1 refreshed 1
From queue 0 refreshed 1
From queue 1 refreshed 2
Time: 1
From queue 0 refreshed 2
From queue 0 refreshed 3
From queue 1 refreshed 3
Time: 2
From queue 0 refreshed 1
From queue 1 refreshed 4
From queue 1 refreshed 1
Time: 3
From queue 0 refreshed 2
From queue 0 refreshed 3
From queue 1 refreshed 2
Time: 4
From queue 0 refreshed 1
From queue 1 refreshed 3
From queue 0 refreshed 2
Time: 5
From queue 0 refreshed 3
From queue 1 refreshed 4
From queue 1 refreshed 1
As an extension, to answer the repeating pattern problem by allowing this scheduler to complete only the first lcm(update_rate * lcm(...refresh rates...), ceil(update_rate)) steps, and then repeating the pattern.
ALSO: this will, indeed, be unsolvable sometimes because of the requirement on hour boundaries. When I use your unsolvable example, and modify time_passed to return 0.1, the schedule is solved with updates every 1.1 hours (just not at the hour boundaries!).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With