Best practices for system sizing and scaling

Content Platform for Cloud Scale Administration Guide

Version
2.6.x
File Size
1945 KB
Audience
anonymous
Part Number
MK-HCPCS008-10

Using master instances

These best practices apply specifically to the scaling of product services across master instances (nodes). If you do not plan to scale product services across master instances then you can skip this section.

Running product services on master instances can maximize the use of all physical resources in the site. However, since master instances are critical to the orchestration and management of HCP for cloud scale, it is important to avoid any impact on them. Here are best practices for master instances.

First, do not over-provision master instances:

  • CPU/RAM: HCP for cloud scale can partition the usage of these resources, but an instance can still exhaust these resources.
  • Network: HCP for cloud scale cannot control usage of network resources.
  • Disk space: HCP for cloud scale cannot control usage of disk space. One service might consume all the free space and thus affect other services. This is most likely to happen with the Metadata Gateway and Message Queue services, but all stateful services consume disk space.

Second, consider the impact of running product services on master instances:

  • When scaling product services across master instances, you must take into account the resources (CPU, memory, etc.) already consumed there by system services.
  • If product services run into an issue, resolution could require restarting services (including system services) or even restarting the host (operating system). This can affect system services and thus the operation of the entire site.
  • When designing a solution, consider hardware failures and other causes of outages. If you plan to use the master instances for product services, ensure there are enough instances (nodes) that the cluster will still be able to manage both existing objects and the anticipated throughput (addition and deletion) of objects if one or even two hosts fail.

Resource impact of stateful and stateless services

You should consider the differing impacts of stateful and stateless services on resource consumption when planning how to scale system and worker service instances across your HCP for cloud scale system.

A stateful service is a service that permanently saves data to disk. Any data that is stored by a stateful service is critical for the operation of the system and the integrity of customer data, so all data that is stored by a stateful service is protected by having three identical copies. Each copy of the data resides on a different instance (node). Each instance of a stateful service runs on a different physical instance.

Stateful services are also persistent. A persistent service runs on a specific instance (node) that you designate. If an instance of a persistent service fails, HCP for cloud scale restarts the instance on the same node. (In other words, HCP for cloud scale does not automatically bring up a new stateful service instance on a different node.)

Because every stateful service is also persistent, a failure or even a planned outage of an instance (node) affects the copy of the data of all stateful service instances that had been running on this instance (node). For more information refer to Service failure recovery and Scaling Metadata Gateway instances.

Also, stateful services typically require more computing power to process, store, and read the data on disk securely, efficiently, and with high performance.

A stateless service is a service that does not save data to disk. Stateless services are usually also floating. A floating service can run on any node assigned to a pool of instances associated with the floating service. If an instance of a floating service fails, HCP for cloud scale restarts the instance on any node in the instance pool. Therefore, stateless floating service instances have less resource impact, and are typically easier to manage and recover, than stateful persistent service instances.

The impact of stateful and stateless services is as follows:

  • The number of floating/stateless services affects the speed of operations. If there are too few of them operations slow down.
  • The number of persistent/stateful services affects both the speed of operations and availability. If there are too few of them operations slow down and can fail entirely.
  • Not every stateful service is critical to application-facing operations (for example, S3 operations).
  • Not every stateful service is resource intensive.
  • Not every stateless service is easy and lightweight in terms of resource usage.

The following services can be resource intensive and should be scaled across worker and master instances carefully:

  • Data Lifecycle
  • Metadata Gateway
  • Message Queue
  • Mirror-In-Policy
  • Mirror-Out-Policy
  • S3 Notifications
  • S3 Gateway

Sizing and scaling models

Consider the following use cases:

  • Ring buffer: Data has a short life cycle (a few weeks) and will be deleted after a certain period of time.
  • Synchronization: Buckets are configured to synchronize (mirror) data out to or in from external services.
  • Performance sensitive: The planned usage must achieve a specific level of performance (an SLA).

If your planned usage of the HCP for cloud scale system matches one of these use cases it's best to size and scale it as follows:

  • The minimum cluster size is six instances (nodes).
  • With fewer than eight instances (nodes), do not scale resource-intensive services across more than one master node.
  • With eight or more instances (nodes), do not scale resource-intensive services across any master nodes.

If, however, your planned usage of the HCP for cloud scale system does not match any of these use cases it's best to size and scale it as follows:

  • The minimum cluster size is four instances (nodes).
  • With four or five instances (nodes), do not scale resource-intensive services across more than two master nodes.
  • With 6-11 instances (nodes), do not scale resource-intensive services across more than one master node.
  • With 12 or more instances (nodes), do not scale resource-intensive services across any master nodes.

Scaling guidance of partition counts

You should monitor the partition count per node on your HCP for cloud scale system and add more nodes proactively in order to prevent them from reaching their limits. Information about partition health, including partition count per node, is displayed on the Grafana dashboards.

On the HCP for cloud scale reference hardware (DS-120), the partition count should be kept under 1500 partitions per node. Once this limit is reached, system performance may be affected.
Important: When the partition count reaches 1500 partitions per node, system stability may be affected.
You will need to add more nodes to reduce the number of partitions per node in order to keep them under 1500.

For other scenarios, such as when the nodes are running on other hardware or VMs, the limit on the number of partitions can scale accordingly for more or less powerful hardware, with the most significant factor being core count (40 on DS-120).

In addition to being made available on the Grafana dashboards, the partition count per node is also viewable as a Prometheus metric: mcs_partitions_per_instance.