Availability

Availability



System availability refers to the percentage of time a system is operational and accessible. 




Strategies for Improving Availability


1. Redundancy: Duplication of critical components to avoid single points of failure.

2. Failover Mechanisms: Automatic switching to backup systems in case of failure.

3. Load Balancing: Distributes traffic to ensure smooth operation under heavy load.

4. Data Replication: Ensures data is accessible from multiple locations.

5. Monitoring and Alerts: Proactive system health tracking to prevent downtime.



Redundancy :


High availability is often achieved by adding redundancies. When one instance fails,

others take over.

   Redundancy involves having backup components that can take over when primary components fail.

   This redundancy can be applied to various parts of a system, such as servers, databases, network connections, or power supplies.


In a web server cluster, multiple servers can serve the same application, and if one server fails, user requests can be automatically routed to another functioning server.


In data storage, data can be replicated across multiple servers or data centers to ensure data availability even in the event of hardware failures.



Techniques:

  • Server Redundancy: Deploying multiple servers to handle requests, ensuring that if one server fails, others can continue to provide service.
  • Database Redundancy: Creating a replica database that can take over if the primary database fails.
  • Geographic Redundancy: Distributing resources across multiple geographic locations to mitigate the impact of regional failures.


Let's explore common architectures with different forms of redundancy and their tradeoffs.


Hot-Cold


In the hot-cold architecture, there is a primary instance that handles all reads and

writes from clients, as well as a backup instance. Clients interact only with the

primary instance and are unaware of the backup. The primary instance continuously

synchronizes data to the backup instance. If the primary fails, manual intervention is

required to switch clients over to the backup instance.






Cons: 


  • This architecture is straightforward but has some downsides. The backup instance represents waste resources since it is idle most of the time.
  • If the primary fails, there is potential for data loss depending on the last synchronization time. When recovering from the backup, manual reconciliation of the current state is required todetermine what data may be missing. This means clients need to tolerate potential data loss and resend missing information.


Hot-Warm


The hot-cold architecture wastes resources since the backup instance is under-

utilized. The hot-warm architecture optimizes this by allowing clients to read from the secondary/backup instance. 

If the primary fails, clients can still read from the secondary with reduced capacity.




Since reads are allowed from the secondary, data consistency between the primary and

secondary becomes crucial. Even if the primary instance is functioning normally, stale

data could be returned from reads since requests go to both instances.


the hot-warm architecture is more suitable for read-heavy

workloads like news sites and blogs. The tradeoff is potential stale reads even during

normal operation in order to utilize resources more efficently.


Hot-Hot


In the hot-hot architecture, both instances act as primaries and can handle reads and

writes. This provides flexibility, but it also means writes can occur to both instances,

requiring bidirectional state replication. This can lead to data conflicts if certain data

needs sequential ordering.

The hot-hot architecture works best when replication needs are minimal, usually involving

temporary data like user sessions and activities. Use caution with data requiring

strong consistency guarantees.




In modern systems dealing with large amounts of data, two instances are often

insufficient. A common scaling approach is:




If the hot instance fails, its requests fail over to the cold backup containing the same subset of data.



Fault Tolerance :


Enabling a system to continue operating even when components fail. Strategies for fault tolerance include:

  • Self-healing Systems: Systems can be designed to detect issues and automatically recover. For instance, if a service within a cluster crashes, an orchestrator can detect the failure and automatically restart the service on a healthy server. This proactive approach to system management reduces the need for manual intervention and minimizes downtime.
  • Graceful Degradation: If a system can’t function at full capacity, it should degrade gracefully, providing some level of service instead of a complete outage. For instance, if a video streaming service experiences issues, it might lower the video quality rather than abruptly cutting off the stream, allowing users to continue their viewing experience.


Failover Mechanisms: 


Failover mechanisms automatically switch to a redundant system when a failure is detected.





Techniques:


  • Active-Passive Failover: A primary active component is backed by a passive standby component that takes over upon failure.
  • Active-Active Failover: All components are active and share the load. If one fails, the remaining components continue to handle the load seamlessly.


Data Replication


Data replication involves copying data from one location to another to ensure that data is available even if one location fails.




Techniques:


  • Synchronous Replication: Data is replicated in real-time to ensure consistency across locations.
  • Asynchronous Replication: Data is replicated with a delay, which can be more efficient but may result in slight data inconsistencies.


Monitoring & Alerts

Continuous health monitoring involves checking the status of system components to detect failures early and trigger alerts for immediate action.


Techniques:


  • Heartbeat Signals: Regular signals sent between components to check their status.
  • Health Checks: Automated scripts or tools that perform regular health checks on components.
  • Alerting Systems: Tools like PagerDuty or OpsGenie that notify administrators of detected issues.

Comments

Popular posts from this blog

Distributed Tracing

Reverse Proxy Vs API gateway Vs Load Balancer

Scalability