By using metrics in formulas, you can generate useful information about the behavior and performance of the HCP for cloud scale system.
Available capacity
The following expression graphs the total capacity of the storage component store54.company.com over time. Information is returned for HCP S Series Node storage components only. The output includes the label store, which identifies the storage component by domain name. The system collects data every five minutes.
storage_total_capacity_bytes{store="store54.company.com"}
The following expression graphs the used capacity of all HCP S Series Node storage components in the system over time. (This is similar to the information displayed on the Storage page.) Information is returned only if all storage components in the system are HCP S Series nodes. The output includes the label aggregate. The system collects data every five minutes.
storage_used_capacity_bytes(store="aggregate"}
Growth of active-object count
The following expression graphs the count of active objects (metadata_clientobject_active_count) over time. (This is similar to the graph displayed on the Storage page.) You can use this formula to determine the growth in the number of active objects.
sum(metadata_clientobject_active_count)
Monitoring deletion activities
The metric lifecycle_policy_deleted_backend_objects_count gives the total number of backend objects, including object versions, deleted by the policy DELETE_BACKEND_OBJECTS. You can graph this metric over time to monitor the rate of object deletion. In addition, the following expression graphs the count of deletion activities by the policy.
sum(lifecycle_policy_completed{policy="DELETE_BACKEND_OBJECTS"})
Sum of update queues
The following expression graphs the size of all update queues. You can use this formula to determine whether the system is keeping up with internal events that are processed asynchronously in response to S3 activity. If this graph increases over time, you might want to increase capacity.
sum(update_queue_size)
Changes in S3 put requests over time
The following expression graphs the count of S3 put requests, summed across all nodes, at one-minute intervals. If you remove the specifier {operation="S3PutObjectOperation"} the expression graphs all S3 requests.
sum(rate(http_s3_servlet_operations_total{operation="S3PutObjectOperation"}[1m]))
Request time service levels
The following expression divides the latency of requests (async_duq_latency_seconds_bucket) in seconds by the number of requests (async_duq_latency_seconds_count), for the bucket getWork and requests less than or equal to 10 ms, and graphs it over time. You can use this formula to determine the percentage of requests completed in a given amount of time.
sum(rate(async_duq_latency_seconds_bucket{op="getWork",le="0.01"}[1m]))/ sum(rate(async_duq_latency_seconds_count{op="getWork"}[1m]))
Here is a sample graph of data from a lightly loaded system:
Request time quantile estimates
The following expression estimates the quantile for the latency of requests (async_duq_latency_seconds_bucket) in seconds for the bucket getWork. You can use this formula to estimate the percentage of requests completed in a given amount of time.
histogram_quantile(.9, sum(rate(async_duq_latency_seconds_bucket{op="getWork"}[1m])) by (le))
Here is a sample graph of data from a lightly loaded system: