I am trying to design a generic job scheduler to expand my architectural knowledge and ability to think about system design questions in interviews. So far, what I have come up with is below. Can you point out where I should work on to be comprehensive in my approach to tackling this type of problem?
I have read a lot of resources online but need some specific guidance in moving forward.
Design a generic job scheduler for company X (which is one of the big technology firms of today).
Use cases
Create/Read/Update/Delete Jobs
Investigate jobs that have been run in the past (type of job, time spent, details)
Constraints
How many jobs will be run on the system per sec?
= # jobs/hour due to users + # jobs/hour due to machines
= 1m * 0.5 /day/24/3600 + 1m/50*20/24/3600
~= 12 jobs/sec
How much data will the system need to store?
Reasoning: I am only storing the job execution details, the actual work (script execution) is done > on other machines and some of the data collected is end time, success/failure status,etc. These are > all likely just text, maybe with graphing for illustration purpose. I will be storing the data of > > all jobs executed on in the system via the job scheduler (i.e. over the past 10 years)
= (Size of page where job details are set up + size of data collected about job ) * # of jobs * 365 > days * 10 years = 1 MB * 900 000 * 365 * 10
~= 3600 000 000 MB
= 3600 000 GB
=3600 TB =3.6 PB
Abstract Design
Based on the information above, we do not need to have too many machines to hold the data. I would break up the design into the following:
Application layer: serves the requests, shows UI details.
Data storage layer: Acts like a big hash table: Stores the mappings of key-value (key would be the jobs organized by dateTime they were run, while the values would show details of these jobs). This is to enable easy search of the historical and/or scheduled jobs.
The bottlenecks:
Traffic : 12 jobs/sec is not too challenging. If this spikes, we can use a load balancer to distribute the jobs to different servers for execution.
Data: At 3.6 TB, we need a hash table that can be easily queried for fast access to the jobs which have been executed in the application.
Scaling the abstract design
The nature of this job scheduler is that it each job possesses one of a few states: Pending, Failed,Success, Terminated. No business logic Returns little data.
For handling the traffic we can have an application server that handles 12 requests/sec and a backup in case this one fails. In future, we can use load balancer to reduce the number of requests going to each server (assuming >1 server are in production) Advantage of this would be to reduce number of requests/server, increase availability (in case one server fails, and handle spike-y traffic well).
For data storage, to store 3.6 TB of data we will need a few machines to hold it in database. We can use a noSQL db or SQL db. Given how the latter has more widespreaduse and community support which would help in troubleshooting matters and is used by large firms at the moment, I would choose mySQL db.
As the data grows, I would adopt the following strategies to handle it:
1) Create unique index on the hash
2) Scale mySQL db vertically by adding more memory
3) Partition the data by sharding
4) Use a master-slave replication strategy with master-master replication to ensure redundancy of data
Conclusion
Hence, this would be my design of the components of a job scheduler.
Most large-scale job schedulers consider aspects not covered in your document.
Some of the key issues are: (in no particular order)
I'm sure there's a heap more - try looking through the docs on slurm or grid-engine for more ideas.
Further things to consider:
Much of what you describe have been implemented by different frameworks for scheduling jobs and executing them. One that I know of - Quartz. While I would implement few things differently in Quartz, it's well documented and will give you many ideas about jobs and the obstactes they face usually.
Approach that you are describing is good, but I would eliminate domain-specific concerns (such as parallel processing, sharding, scaling) from it. If jobs are going to be ran on different machines, that's because the concrete case (e.g. jobs running for financial bank) cannot fit into one machine. I don't think that you, as developer of the job engine, should be worried about that. Reason is that you are developing a framework, not a productized application.
If you are going to introduce sharding for job engine itself, I think you are over-estimating the complexity of the job engine itself. There will not be big contingency on the job execution (framework) portion itself. However, concrete implementation, such as banking software jobs, might need to work on the same data, but different sets of it, and then you have sharding. So, in short, it's outside of your work scope to introduce scaling mechanisms.
And one more, I don't see a parallel between job execution and messaging busses, so I am not commenting into that direction.
I would suggest you look into a message bus for this job. Or if you are looking to learn the architecture that such a bus would allow have a look at NServiceBus.
If you are using a bus you can easily throttle your queue. It might slow your processing down which means you will need to look into concurrency.
It's often presumed that writing such a service is easy. It is not.
Some other things to think about..
What happened when a message fails. Does it get lost? Do you rollback? How do you scale your architecture. Can you add new clients / consumers easily?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With