This section describes settings for fault tolerance of user data and management functions in Multi-AZ configuration.
The following table shows combinations of settings that must be specified to create a plus-1 or plus-2 redundant configuration. Settings other than the combinations following are not allowed.
If a failure occurs in the storage cluster while the write back mode with cache protection is disabled, data on the snapshot volume might be lost. When the write back mode with cache protection is enabled, data on the snapshot volume is protected.
However, even if the write back mode with cache protection is enabled, if failures exceed the storage controller redundancy, the data on the snapshot volume is not protected.
Configuration |
Function settings |
|||
---|---|---|---|---|
User data protection method |
Redundancy of the storage controller1 |
Number of cluster master nodes |
Number of fault domains |
|
1. Plus-1 redundant configuration2 |
Mirroring Duplication |
OneRedundantStorageNode (degree = 2) |
3 nodes |
3 |
1. Do not explicitly specify the degree of redundancy of the storage controller, because it is automatically determined depending on the user data protection method. 2. In the following cases, only one failure is assumed to have occurred, irrespective of the number of failures:
|
Provides an overview, features, and notes for each.
User data protection methods
VSP One SDS Block supports Mirroring as a way to protect user data. Mirroring is a data protection method that stores a copy of user data on another storage node.
Configure by selecting Duplication for Mirroring. Note, however, that selecting either of these options might cause restrictions on the combination of functions that can be used.
-
User data and its copies are stored on two different storage nodes for redundancy.
-
At least three storage nodes (including one tiebreaker node) are required.
-
You can use a maximum of 40% to 48% of physical capacity.
However, if the rebuild capacity policy (rebuildCapacityPolicy) is set to "Fixed" (default), you can use a maximum of 40% to 48% of the physical capacity excluding the rebuild capacity on each storage node. For details about the rebuild capacity, see Rebuild capacity of a storage pool in the VSP One SDS Block System Administrator Operation Guide.
-
The allowable number of defective storage nodes or drives is 1. The number is the sum of the number of defective storage nodes and the number of defective drives. However, the number is counted as one failure in the following cases.
- One or more drive failures occurred on a faulty storage node.
- Drive failures occurred on a single storage node.
However, two or more failures can be allowed except in the following cases.
-
Condition 1: Storage node or drive failures occur on both storage nodes that redundant storage controllers belong to. For details about storage controllers, see Redundancy of the storage controller.
CAUTION:Failures might not be allowed even when Condition 1 is not met for the following cases:
-
After adding storage nodes until drive data relocation is completed
-
-
Condition 2: Failures occur on two or more cluster master nodes. For details about cluster master nodes, see Redundancy of the cluster master node.
Volumes cannot be created for tiebreaker node because tiebreaker node have no drives.
For how to design the capacity for the Mirroring Duplication method, see Capacity design (for Mirroring) in the VSP One SDS Block System Administrator Operation Guide.
Redundancy of the storage controller
See OneRedundantStorageNode (degree = 2) in Settings for fault tolerance of user data and management functions (Single-AZ configuration).
It is not possible to configure redundancy for tiebreaker node because it has no controller node.
Redundancy of the cluster master node
Storage nodes are classified into cluster master nodes and cluster worker nodes. Cluster master nodes are further classified into primary and secondary nodes. Only one cluster master node (primary) in a storage cluster manages and controls the entire storage cluster. If a failure occurs on the cluster master node (primary), one of the cluster master nodes (secondary) becomes the primary node, so that the entire storage cluster can continue to operate.
For the selected storage nodes, which one becomes the primary node (and which storage nodes become secondary nodes) is automatically determined in the storage cluster.
The system can continue to operate if a maximum of one cluster master node becomes faulty.
The following is a configuration example.
Fault domain
A fault domain is a group of storage nodes located in a single Availability Zone. The number of fault domains is three. By allocating different Availability Zones for each fault domain and deploying data, parity data, and storage controllers across different fault domains so that user data can be protected, operation can continue provided that the two other fault domains (that have no hardware failures such as in the AWS data center) are running normally.
-
Only plus-1 redundant configuration can be used. The allowable number of storage node or drive failures that can be allowed is as follows:
-
If the storage node or drive failure is in the same fault domain, it is acceptable if all storage nodes or drives fail.
-
If storage node or drive failures occur on different fault domains, the failures can be allowed except in the following cases.
-
Condition 1: Storage node or drive failures occur on both storage nodes that redundant storage controllers belong to.
-
Condition 2: Failures occur on two or more cluster master nodes.
-
- The number of storage nodes (including one tiebreaker node) that can be configured is: 3, 5, 7, 9, 11, 13, 15, 17, or 19.
-
Storage nodes can be added in units of 2.
-
Tiebreaker node cannot be added.
-
Storage nodes and tiebreaker node cannot be removed.
-
The same number of storage nodes are deployed for every two fault domains. Also, one cluster master node is deployed for each fault domain (the total number of nodes is three, including one tiebreaker node).
-
The following shows an example of Multi-AZ configuration (in which the number of fault domains is three).
-
When each fault domain in multiple-fault-domain configuration has large capacity, if failure occurs in the entire Availability Zone to which the fault domains belong, recovery from failure takes a long time because storage nodes are recovered one by one.
For specific recovery procedures and recovery times, see Performing maintenance recovery for storage nodes in the VSP One SDS Block System Administrator Operation Guide.
-
Communication between Availability Zones occur due to user data redundancy. Also, when communication between a storage cluster and a controller node or compute node spans multiple Availability Zones, communication between those Availability Zones occurs. For details about the communication pricing, see the AWS website.
Spread placement group
-
A group for EC2 instances, each deployed on different hardware in the AWS data center.
-
Every seven nodes are defined as one spread placement group within an Availability Zone.
-
For details about the allowable number of failures for storage nodes or drives, see Fault domain described earlier.