I'm working on a Mesos framework to run some jobs and it seems like a great opportunity to learn about making a highly available system. To that end, I'm doing some reading on distributed systems and I made the mistake of visiting wikipedia.
The passage in question is talking about a principle of HA engineering:
Reliable crossover. In multithreaded systems, the crossover point itself tends to become a single point of failure. High availability engineering must provide for reliable crossover.
My google-fu teaches me three things:
1) audio crossover devices split a single input into multiple outputs
2) genetic algorithms use crossover to combine solutions
3) buzzwordy white papers all copied from this wikipedia article :/
My question: What does a 'crossover point' mean in this context, and why is it single point of failure?
High availability refers to those systems that offer a high level of operational performance and quality over a relevant time period. FAQs What is High Availability? When it comes to measuring availability, several factors are salient.
High availability is a quality of a system or component that assures a high level of operational performance for a given period of time.
In computing, the term availability is used to describe the period of time when a service is available, as well as the time required by a system to respond to a request made by a user. High availability is a quality of a system or component that assures a high level of operational performance for a given period of time.
For instance, a system that guarantees 99% of availability in a period of one year can have up to 3.65 days of downtime (1%). These values are calculated based on several factors, including both scheduled and unscheduled maintenance periods, as well as the time to recover from a possible system failure. How Does High Availability Work ?
Reliable crossover in this context means:
The ability to switch from a node X (which is broken somehow) to a Node Y without losing data.
Non-reliable HA-database example:
Copy the database every 5 minutes to a passive node. => Here you can lose up to 5 minutes of data.
=> Here the copy action is the single point of failure.
Reliable HA-database example:
Setting up data replication where (per example) your insert statement only returns as "executed OK" when the transaction is copied to the secondary server.
(yes: data replication is more complex than this, this is a simplified example in the context of the question)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With