If services are underscaled for the usage of the product, responsiveness or performance can suffer. Scaling up services or nodes can alleviate problems.
A service is underscaled when it has fewer than the required or sufficient number of service instances running.
Symptoms of underscaling
The main symptom of underscaling is that performance degrades. You can identify underscaling by monitoring certain metrics.
The Data Lifecycle service manages, among other things, backend garbage collection. An object is deleted in a multistep process: the object metadata is replaced by a tombstone marker, the object itself is deleted from backend storage, and eventually, as per your retention policy, the tombstone is deleted as well. For a system used for rapid creation and deletion of objects, sometimes called a ring-buffer use case, underscaling might manifest itself as storage components using more capacity than expected because the cleanup policy is falling behind. You can monitor Data Lifecycle performance using the metric lifecycle_policy_concurrency or its rate per minute (rate(lifecycle_policy_concurrency[1m])) to show how many objects are being concurrently processed per lifecycle type. This should be zero if the policies are not running. If the metric continues to increase over time, the service might be underscaled.
The S3 Gateway service is designed to handle a set number of concurrent requests. If your workflow exceeds its capacity S3 requests can be delayed. You can monitor S3 Gateway performance using the metric http_s3_servlet_operations_total or its rate per minute (rate(http_s3_servlet_operations_total[1m])) to show how many operations are being completed.
The Metadata Gateway service manages metadata partitions. If partition counts become very high, as can happen on systems storing large numbers of objects or objects deleted after long retention periods, performance can degrade. You can monitor partitions using the metric mcs_partitions_per_instance.
Fixing underscaled services
Solutions to an underscaled service include:
- Distributing the service onto more nodes (physical instances)
- Installing additional nodes and distributing service instances onto them
For example, scaling up to two S3 Gateway service instances doubles the capacity for S3 request processing. Scaling up the Metadata Gateway service lets the system load-balance partitions from heavily burdened nodes to unburdened nodes and smooths performance. Scaling up the Data Lifecycle service provides additional processing capacity for object lifecycle management and speeds the cleanup of deleted objects.