Replace the faulty drive with another drive.
Notes on the drive auto-recovery
function
The bare metal model provides the drive auto-recovery function
that recovers drives automatically from drive failure caused by drive response
delays. When operation conditions for the drive auto-recovery function are met, you
do not need to replace a drive. Wait until drive auto recovery completes.
For details about how to judge whether to wait for drive auto recovery or
to replace a drive, see Action to be taken when "Alerting" is
shown in the drive health status in the VSP One SDS Block Troubleshooting Guide.
Drive auto recovery will not work in the following cases. Perform drive
replacement as described in this section.
-
When the drive failure is not caused by drive response
delays
When the drive failure is caused by drive response delays, event
log KARS05012-E is output.
-
When the rebuild capacity allocation status is other than
"Sufficient" and Rebuild is inoperable
However, even if the rebuild capacity allocation status is other
than "Sufficient", Rebuild might be operable in some faulty drives. In such
a case, drive auto recovery also operates after Rebuild completes. For
details about the rebuild capacity allocation status and how to verify the
status, see Managing storage pools.
- When the allocated rebuildable capacity is not sufficient
- When the drive failure occurred during drive auto recovery
- When the drive cannot be returned to the state in which the
metadata redundancy for cache protection is not degraded
The drive auto-recovery function is always enabled and cannot be
disabled.
If you want to replace a drive that is subject to frequent response
delays (for example, repeated drive failure occurs due to drive response delays and
drive auto recovery), see Action to be taken to replace a drive
that is subject to frequent delays in the VSP One SDS Block Troubleshooting Guide, and then take
action.
Note:
Data of the drives where drive auto-recovery was performed has been
configured in another drive by Rebuild. In this case, the recovered drive
capacity is secured as a free space for Rebuild, and is not accessed unless an
event that causes a change of drive data location (such as Rebuild or storage
pool expansion) occurs.
-
Verify the ID of the faulty drive to be removed and the ID of the storage node
that has the faulty drive.
Also record the WWID of the faulty drive to be removed. WWID is
used to remove the faulty drive from the server.
Run either of the following commands with "Blockage" specified
for the query parameter "status."
REST API: GET /v1/objects/drives
CLI: drive_list
-
Verify the status of the storage node containing the faulty drive.
Run either of the following commands with the storage node ID
containing the faulty drive specified.
REST API: GET /v1/objects/storage-nodes/<id>
CLI: storage_node_show
Go to the next step when the status of the storage node is
"Ready" or "RemovalFailed."
-
Turn on the locator LED of the drive to be removed.
Run either of the following commands with "TurnOn" specified for
the operationType parameter (operation_type in the case of CLI).
REST API: POST /v1/objects/drives/<id>/actions/control-locator-led/invoke
CLI: drive_control_locator_led
Verify the job ID which is displayed after the command is
run.
CAUTION:
If a storage node failure occurs during a drive replacement
operation, the locator LED on/off status shown by using the REST API,
CLI, or VSP One SDS Block Administrator might become different
from the on/off status of the locator LED on the physical drive. The
locator LED on/off status shown by using the REST API, CLI, or VSP One SDS Block Administrator is updated and corrected after the
storage node is recovered from the failure.
Note:
If the configuration differs from those described in VSP One SDS Block Hardware Compatibility Reference, locator LED operation might not be
available. In this case, confirm the drive location by performing the
procedure indicated in Note in step 5.
-
Verify the state of the job.
Run either of the following commands with the job ID
specified.
REST API: GET /v1/objects/jobs/<jobID>
CLI: job_show
If the job state is "Succeeded", the job is completed.
-
On the server, find the drive whose locator LED is lit and confirm its mounting
position.
Then, remove the faulty drive from the server.
For details, see the documentation of your server vendor.
CAUTION:
If a failure occurs during a drive replacement operation,
the locator LED might be turned off. In such a case, resume from step
3.
If you interrupt a drive replacement operation and perform a
maintenance operation that requires the storage node to be restarted,
the locator LED might be turned off. In such a case, resume from step
3.
Note:
-
If the locator LED is not lit, confirm the mounting
position of the drive to be removed using the following method.
Find the drive that matches the WWID of the failed
drive recorded in step 1 and the WWN or EUI value of the drive
recorded at the time of expansion. Confirm the location in which
the drive recorded in association with the WWN or EUI is
installed.
-
If the value recorded at the time of drive addition
was a WWN, there might be a difference in the last 1 to 3 digits
of the right-side 16-digit part of the WWID recorded in step
1.
-
Perform the steps from Inserting drives
(Bare
metal) in Adding drives to step 5
of Expanding storage pools.
Note that the drive you are adding should be a new drive, not the
failed drive that you removed in step 5.
CAUTION:
If you are replacing multiple drives at the same time,
perform steps 1 through 6 (physical drive reduction and expansion
operations) one drive at a time. When you have completed step 6 on all
drives, complete step 7 or later.
-
Verify the state of the write back mode with cache protection.
REST API: GET /v1/objects/storage
CLI: storage_show
Take the following action according to the state of the write
back mode with cache protection (writeBackModeWithCacheProtection).
-
If the state is "Disabled" or "Enabling", go to step
9.
-
If the state is "Enabled" or "Disabling", go to the next
step.
-
See Confirming metadata redundancy for cache
protection to verify that metadata redundancy for cache protection is
not degraded.
When the redundancy is not degraded, go to the next step.
When the redundancy is degraded, wait until it has been
recovered. If event log KARS06596-E is output, take action according to the
event log. After taking action, perform step 8 again.
Note:
If the storage node is blocked, the metadata redundancy for
cache protection is not recovered unless the storage node is recovered
by maintenance operation. Recover the blocked storage node first by
performing maintenance operation.
-
See Verifying Rebuild status and determine whether
the Rebuild is being performed or whether an error has occurred during the
Rebuild.
If the Rebuild is not being performed and no error has occurred,
go to the next step.
If the Rebuild is being performed or an error has occurred during
the Rebuild, take appropriate action (see Verifying
Rebuild status).
Note:
Before proceeding to the next step, obtain a list of the
drives and verify that the target faulty drive exists. If the target
faulty drive does not exist, go to step 13. For how to verify the target
faulty drive, see step 1.
-
Remove the faulty drive.
Run either of the following commands with the faulty drive ID
obtained in step 1 specified.
REST API: POST /v1/objects/drives/<id>/actions/remove/invoke
CLI: drive_remove
Verify the job ID which is displayed after the command is run.
-
Verify the state of the job.
Run either of the following commands with the job ID
specified.
REST API: GET /v1/objects/jobs/<jobId>
CLI: job_show
If the job state is "Succeeded", the job is completed.
-
Obtain a list of drives and verify that the target faulty drive
has been removed.
After step 10, removal of the drive might take approximately one
minute.
REST API: GET /v1/objects/drives
CLI: drive_list
-
Back up the configuration information.
Perform this step by referring to Backing up
the configuration information
(Bare
metal).
If you continue operations with other procedures, you must back
up the configuration information after you have completed all operations.