HCP for cloud scale includes predefined Grafana dashboards.
A dashboard contains one or more rows of panels that present metrics in charts and graphs. Dashboards collect and show metrics such as S3 bucket and object data, system health information, and encryption activity.
The color of a panel indicates the health of the components that are represented in the panel. To see the colors used to represent health status, see Dashboard conventions.
The metrics that are shown in a dashboard reflect the reporting period selected in the time range list in the dashboard toolbar.
- Dashboard 1: System Health
-
The System Health dashboard is the primary source of information about the overall status of the system and is designed to be the first location to view for a high-level assessment of system health. This dashboard consists of a single row with multiple panels.
You can click a panel name to view additional information about that panel. The following list describes some of the panels provided. Additional information about overall system health, such as service uptime status, metadata space used and available, and the percentage of used metadata capacity, is presented in other panels in the dashboard.
- Database Partitions Per Node
- Shows the status of the number of metadata partitions per node. If there are no nodes with greater than 1000 partitions, this panel is green with a status of Normal.
- If any node has more than 1000 partitions, the panel color changes and a message is provided as described in the following table. The risk of data unavailablity increases as the partitions per node increase. Contact Hitachi Vantara to add additional nodes to distribute the partitions.
-
Panel Color Numer of Partitions Message Yellow 1000-1500 Partition count is high. More nodes are required soon. Light red 1501-2500 IMPORTANT: Partition count is very high. Add more nodes. Red Over 2500 WARNING: The partition count is extremely high. Add more nodes immediately. - Database Partition Protection
- Shows the database partition protection status.
- Healthy
- Shows that all partitions are fully protected and have copies on three nodes in a cluster.
- Degraded
- Shows that one or more partitions has a 2x protection level because it is copied on only two nodes in a cluster. Degraded partition protection does not affect the function of HCP for cloud scale, but might result in performance degration and requires attention to provide full protection.
- Failure Detected - Unprotected
- Shows that one or more partitions has no protection because the partition is on only one node. Unprotected partitions require immediate attention because S3 applications are likely to experience errors for PUT object requests.
- Overprotected
- Shows that one or more partitions has 4x protection. That is, the partitions are copied on 4 nodes in a cluster. This situation might occur because a node containing a partition was initially unavailable and then recovered.
- DLS: Delete Backend Objects policy
- Checks whether the Data Lifecycle service is examining objects for the DELETE_BACKEND_OBJECTS policy.
- If the status of this panel is HEALTHY, no action is required. If the status is UNHEALTHY or IDLE, no activity for this policy was detected in the last 24 hours.
- If the panel status is UNHEALTHY or IDLE for longer than 24 hours and the backlog of undeleted objects is high in the Undeleted objects backlog panel, check the metrics in the Data Lifecycle: DELETE_BACKEND_OBJECTS policy row in Dashboard 6: Services Health. If the values in the graphs shown in the row are trending up, it is an indication that object deletion is no longer idle and the status of this panel should return to HEALTHY within 4 hours. If the graphs are flat, repair the Data Lifecycle service as described in Repairing services.
- If object deletion remains idle, contact Hitachi Vantara for assistance.
- S3 I/O Balance
- Shows the status of the S3 I/O balance among S3 Gateway nodes. S3 I/O should be distributed evenly among these nodes. Investigate any value showing greater than a 15 percent imbalance.
- Metadata Partition Balance
- Shows the status of the database metadata partition balance among Metadata Gateway nodes. Partitions should be distributed evenly among these nodes. Investigate any node showing greater than a 15 percent imbalance.
- Undeleted objects backlog
- Shows the status of undeleted objects in the backend storage. If the status of this panel is Normal, no action is required.
- If the panel contains one of the following messages, action is required:
- Unavailable - Check Management API section for Storage Components in Object Storage Management UI
- The Management API configuration for one or more HCP S Series Node storage components is incorrect. To resolve this issue, see Panel messages.
- Delete backlog is over 100 million objects and parts
- or
- Very high delete backlog (> 1 billion objects and parts)
- There is a backlog of undeleted backend storage objects. The backlog is caused by one of the following factors:
- The DELETE_BACKEND_OBJECTS policy is under scaled
- The Data Lifecycle service DELETE_BACKEND_OBJECTS policy might be underscaled across HCP for cloud scale instances and cannot adequately manage the number object that require deletion.
If the Data Lifecycle service is underscaled, the performance of the service policies might degrade. For information about how to recognize and fix underscaled services, see Avoiding service underscaling.
- The value in the DLS: Delete Backend Objects policy panel is UNHEALTHY or IDLE for an extended time
- This panel corresponds with the DLS: Delete Backend Objects policy panel. See DLS: Delete Backend Objects policy.
- Bucket Expiration Lifecycle Considerations
- Shows the status Not Required if no action is required or the message Consider setting Expiration Lifecycle policy if monitored buckets contain objects that require deletion. The scenario occurs when S3 applications do not use the version ID to delete objects.
It is a best practice to set the Expiration Lifecycle policy for buckets listed in the Bucket Expiration Lifecycle policy considerations row in Dashboard 4: Buckets Stats and Activity.
To set the Expiration Lifecycle policy for buckets, see the Hitachi Content Platform for Cloud Scale S3 Console Guide.
- Dashboard 2: System Overview
- The System Overview dashboard shows information about the system capacity, objects, and data movement. This dashboard contains the following rows:
- System Overview
- Shows total system and object capacity, capacity and objects per bucket, total count for client and HCP S Series Node objects, and other capacity and object metrics.
- Recently Moved Data
- Shows the volume of data written and read.
- Object Count Information
- Shows the HCP for cloud scale object count, objects to be deleted, and the object recovery queue. Also shows HCP S Series Node object count and object count breakdown.
- System Capacity Information
- Shows the HCP for cloud scale and HCP S Series Node used capacity, as well as the HCP S Series Node used capacity breakdown.
- Dashboard 3: S3 Activity
- The S3 Activity dashboard shows information about overall S3 request activity as well as information about specific S3 requests. Use this dashboard to monitor the performance of your S3 operations. This dashboard contains the following rows:
- S3 Distribution
- Shows metrics for S3 I/O and operations distribution among nodes.
- System Overall S3 Activity
- Shows various S3 operation activities such as S3 throughput, input/output operations per second (IOPS), and latency.
- System S3 Ingest Activity
- Shows metrics for the data ingested, including throughput and latency metrics.
- System S3 Data GET Activity
- Shows metrics for GET requests, including throughput and latency metrics.
- System S3 Object Delete Activity
- Shows metrics for deleted objects, such as the rate of deletion, and the backlog for undeleted objects.
- S3 Failed Request Details
- Shows metrics for failed S3 requests, including the number of failures by failure type.
- Dashboard 4: Buckets Stats and Activity
- The Bucket Stats and Activity dashboard shows information about the buckets in the system, such as used capacity, number of objects, and average object size. This dashboard contains the following rows:
- Bucket Stats
- Shows metrics for buckets including the total number of buckets, capacity and other statistics per bucket, and the bucket with the maximum GET and PUT throughput.
- Bucket I/O Activity
- Shows metrics for GET and PUT throughput and IOPS by bucket.
- Bucket Expiration Lifecycle Policy Considerations
- Shows metrics for buckets that should be considered for application of the Expiration Lifecyle policy to ensure that objects are deleted based on specified criteria. It is a best practice to set the Expiration Lifecycle policy for buckets listed in this row.
-
This row corresponds to the Bucket Expiration Lifecycle Considerations panel in Dashboard 1: System Health.
- Dashboard 5: System Capacity
- The System Capacity dashboard shows capacity information for different components of the system such as the amount of HCP for cloud scale capacity that has been used, and the HCP S Series Node capacity of the system. This dashboard contains the following rows:
- System capacity and object count
- Shows metrics for HCP for cloud scale and HCP S Series Node, including object count, capacity used, and a breakdown of undetected objects.
- Metadata Capacity
- Shows metrics metadata capacity and space usage by HCP for cloud scale node.
- S-node Detailed Capacity Reports
- Shows metrics for HCP S Series Node objects, including capacity available, used, and total.
- Dashboard 6: Services Health
- The Services Health dashboard is designed for use by Hitachi Vantara Support. This dashboard is described in Services Health dashboard.
- Dashboard 7: Encryption Status
- The Encryption Status dashboard provides information about encryption activity taking place on the system, such as the percentage of objects in the system that are encrypted, the number of client objects on which encryption has been started, completed, or have conflicts, as well as the data encryption key (DEK) encryption count and rate. This dashboard contains the following rows:
- System Rekey Activity
- Shows metrics for rekey activity at the system level, including the number of rekey events initiated by an administrator.
- Rekey Activity by Node
- Shows metrics for rekey activity at the node level, including the DEK examination and re-encryption rates.
- S3 System Activity
- Shows metrics for rekey S3 operations, including IOPS, latency, and throughput by operation.