How to avoid a single point of failure for virtual infrastructure?

When a data center has only physical servers, server failures typically affect only one workload. Virtual hosts run multiple workloads, which means that many applications can fail. Most enterprises using server virtualization use technologies such as failover clustering or replication as a way to handle hypervisor-level failures. However, this type of technology still has a long way to go to protect virtual workloads because clusters are often insufficient. Even if a virtual host high availability cluster is built, it may still fail. If some virtual infrastructure becomes a single point of failure, failures can occur.

Although it is possible to eliminate all conceivable single points of failure, this requires strong financial support. In most cases, companies must identify potential risks and then assess the likelihood that risks will translate into actual problems. Companies may have to spend money on the biggest risks. This also raises the question of what a potential single point of failure is. The true risk of failure can vary widely, depending on which vendor's product is used and how the virtual infrastructure is deployed. Some risks are related to hardware and some are related to software. Hardware-related failures can affect all hardware in the entire virtual infrastructure. Taking power management as an example, many virtual hosts are configured with redundant power modules. When one power module fails, the second power module can take over online, so it will not affect the host server. Consequences of failure.

A virtual host is usually connected to a UPS and can use a generator to generate power in the event of a power failure. However, if all servers are connected to the same generator in the event of a main power failure, the generator may also be a potential single point of failure. A risk assessment is required. Many things can go wrong before a spare generator fails and affects the entire virtual infrastructure, and power must be interrupted. It is not necessary to take into account the failure of the backup generator, because the possibility of a backup generator becoming a single point of failure is very small.

As mentioned earlier, although it is possible to eliminate all possible single points of failure, the cost is very high. Imagine a scenario where you have separate backup generators for a variety of servers. , Even this does not necessarily eliminate potential single points of failure. If the fuel for these backup generators comes from the same place and the fuel happens to be contaminated with water, then the generator fuel will become a single point of failure. Note that many things can go wrong before other failures occur.

In a clustered environment, shared storage becomes more common as a single point of failure. Cluster storage is usually configured with redundant disks. When the redundancy does not meet requirements, arrays, switches, and cables may fail. On the software side, the infrastructure server can become a single point of failure if it is not deployed in a redundant manner. For example, suppose an enterprise intends to deploy System Center Virtual Machine Manager (SCVMM) as a tool for managing Hyper-V. SCVMM can become a single point of failure unless it is deployed on a highly available virtual machine. Similarly, the SQL Server database that SCVMM relies on can also be a single point of failure, unless the database is also redundant. Other potential single points of failure may include DNS servers, domain controllers, DHCP servers, backup servers, or Internet gateways.

It is impossible for most enterprises to eliminate all possible single points of failure. A better strategy is to identify single points of failure and then evaluate the risk level of the single point of failure.

