Demystifying Horizontal Scaling: Your Guide to Kubernetes HPA

In the dynamic world of Kubernetes, ensuring your applications can handle fluctuating traffic is paramount. One of the key mechanisms to achieve this is horizontal scaling. But what is horizontal scaling in the context of Kubernetes, and how can you leverage it effectively? This article delves deep into the concept of horizontal scaling, specifically focusing on Kubernetes Horizontal Pod Autoscaling (HPA), a powerful tool designed to automatically adjust your application’s resources to meet demand.

Horizontal scaling, in essence, is about scaling out your application by increasing the number of instances (or Pods in Kubernetes terminology) that are running. Imagine a website experiencing a sudden surge in visitors. With horizontal scaling, instead of trying to make your existing servers handle more load (which would be vertical scaling), you add more servers to distribute the traffic. In Kubernetes, this translates to deploying more Pods to manage the increased workload.

This approach contrasts with vertical scaling, where you would increase the resources (CPU, memory) allocated to the existing Pods. While vertical scaling has its place, horizontal scaling is often favored for its ability to provide greater elasticity, fault tolerance, and scalability in distributed systems like Kubernetes.

If the load on your application decreases, horizontal scaling also works in reverse. When the number of Pods exceeds the configured minimum and the demand lessens, the HorizontalPodAutoscaler instructs Kubernetes to scale back down, optimizing resource utilization and cost efficiency.

It’s crucial to understand that horizontal pod autoscaling is not a universal solution. It is not applicable to Kubernetes objects that are inherently not scalable, such as DaemonSets, which are designed to run a single Pod on each node.

The magic behind horizontal pod autoscaling in Kubernetes lies in the HorizontalPodAutoscaler (HPA), a Kubernetes API resource and controller. This controller operates within the Kubernetes control plane, continuously monitoring metrics like CPU utilization, memory utilization, or custom metrics you define. Based on these metrics, the HPA controller automatically adjusts the desired scale of your target workload, be it a Deployment, StatefulSet, or other scalable resource.

To illustrate the practical application of HPA, Kubernetes provides a comprehensive walkthrough example. This guide offers hands-on experience in setting up and utilizing horizontal pod autoscaling for your applications.

Unpacking the Mechanics: How HorizontalPodAutoscaler Works

To grasp what is horizontal autoscaling in practice, it’s essential to understand the inner workings of the HorizontalPodAutoscaler. Think of the HPA as a smart traffic manager that dynamically adjusts the number of application instances based on real-time demand.

graph BT
    hpa[Horizontal Pod Autoscaler] --> scale[Scale]
    subgraph rc[RC / Deployment]
        scale
    end
    scale --- pod1[Pod 1]
    scale --- pod2[Pod 2]
    scale --- pod3[Pod N]
    classDef hpa fill:#D5A6BD,stroke:#1E1E1D,stroke-width:1px,color:#1E1E1D;
    classDef rc fill:#F9CB9C,stroke:#1E1E1D,stroke-width:1px,color:#1E1E1D;
    classDef scale fill:#B6D7A8,stroke:#1E1E1D,stroke-width:1px,color:#1E1E1D;
    classDef pod fill:#9FC5E8,stroke:#1E1E1D,stroke-width:1px,color:#1E1E1D;
    class hpa hpa;
    class rc rc;
    class scale scale;
    class pod1,pod2,pod3 pod

The HorizontalPodAutoscaler operates as a control loop, periodically checking and adjusting the scale of your application. This interval, by default 15 seconds, is configurable via the --horizontal-pod-autoscaler-sync-period parameter of the kube-controller-manager.

In each cycle, the controller manager queries resource utilization metrics against the specifications defined in each HorizontalPodAutoscaler configuration. It identifies the target resource (defined by scaleTargetRef), selects the relevant Pods using the target resource’s .spec.selector labels, and retrieves metrics from either the resource metrics API (for per-pod resource metrics) or the custom metrics API (for other metrics).

Let’s break down metric retrieval:

Per-pod resource metrics (like CPU): The controller fetches metrics for each Pod from the resource metrics API. If a target utilization is set, it calculates utilization as a percentage of the resource request defined for containers within each Pod. If a target raw value is specified, it uses the raw metric values directly. The controller then averages these utilization or raw values across all targeted Pods to derive a ratio for scaling.

It’s crucial to note that if resource requests are not set for some containers in a Pod, CPU utilization for that Pod remains undefined, and the autoscaler will disregard this metric for that Pod. The algorithm details section below provides a more in-depth explanation of the autoscaling algorithm.
Per-pod custom metrics: The process mirrors per-pod resource metrics, but operates with raw values instead of utilization percentages.
Object and external metrics: For these metrics, a single metric is fetched, describing the object. This metric is compared to the target value to generate a scaling ratio. In the autoscaling/v2 API, this value can optionally be divided by the number of Pods before comparison.

Typically, HorizontalPodAutoscalers are configured to fetch metrics from aggregated APIs such as metrics.k8s.io, custom.metrics.k8s.io, and external.metrics.k8s.io. The metrics.k8s.io API is commonly provided by Metrics Server, an add-on that needs to be deployed separately. For detailed information on resource metrics, refer to the Metrics Server documentation.

For insights into the stability and support of these APIs, refer to the Support for metrics APIs section.

The HorizontalPodAutoscaler controller interacts with workload resources (Deployments, StatefulSets, etc.) that support scaling. These resources expose a scale subresource, an interface enabling dynamic adjustment of replicas and examination of their current states. For a broader understanding of Kubernetes API subresources, consult the Kubernetes API Concepts documentation.

Delving into the Algorithm: The Scaling Logic

To truly understand what is horizontal autoscaling means algorithmically, let’s examine the core logic of the HorizontalPodAutoscaler controller. At its heart, the HPA operates on a simple ratio: the desired metric value compared to the current metric value.

The fundamental formula guiding the desired replica count is:

$$begin{equation} desiredReplicas = ceilleftlceil currentReplicas times frac{currentMetricValue}{desiredMetricValue} rightrceil end{equation}$$

For example, if the current CPU utilization (currentMetricValue) is 200m and the desired utilization (desiredMetricValue) is 100m, the number of replicas will double because ( { 200.0 div 100.0 } = 2.0 ). Conversely, if the current value is 50m, the replica count will halve as ( { 50.0 div 100.0 } = 0.5 ). The control plane is designed to avoid unnecessary scaling actions when the ratio is close to 1.0, within a configurable tolerance (default is 0.1).

When targetAverageValue or targetAverageUtilization is specified, currentMetricValue is calculated by averaging the metric across all Pods within the HPA’s target.

Before finalizing scaling decisions, the controller considers several factors: missing metrics and Pod readiness. Pods with deletion timestamps (undergoing shutdown) are ignored, and failed Pods are discarded.

Pods lacking metrics are temporarily set aside. These Pods will be factored in later to refine the scaling calculation.

Specifically for CPU-based scaling, Pods that are not yet ready (still initializing or potentially unhealthy) or have outdated metric data from before becoming ready are also set aside.

Due to technical limitations in precisely determining the exact moment a Pod becomes ready, the HPA controller uses configurable time windows. A Pod is considered “not yet ready” if it’s unready and transitioned to ready within a short window since startup (configurable with --horizontal-pod-autoscaler-initial-readiness-delay, default 30 seconds). Once ready, any subsequent transition to ready within a longer window (configured by --horizontal-pod-autoscaler-cpu-initialization-period, default 5 minutes) is considered the first readiness.

The base scale ratio (( currentMetricValue over desiredMetricValue )) is then computed using the remaining, non-discarded, non-set-aside Pods.

If any metrics were missing, the controller recalculates the average more conservatively. For scale-down scenarios, it assumes missing-metric Pods are consuming 100% of the desired value. For scale-up, it assumes 0% consumption. This conservative approach dampens potential scaling fluctuations.

Similarly, if not-yet-ready Pods are present and a scale-up would occur without considering them, the controller conservatively assumes these Pods consume 0% of the desired metric, further mitigating scale-up magnitude.

After incorporating not-yet-ready Pods and missing metrics, the usage ratio is recalculated. If this new ratio reverses the scaling direction or falls within the tolerance, no scaling action is taken. Otherwise, the new ratio determines the adjustment to the number of Pods.

It’s important to note that the original average utilization value, before considering not-yet-ready Pods or missing metrics, is reported in the HorizontalPodAutoscaler status, even when the refined usage ratio is used for scaling decisions.

When multiple metrics are defined in an HPA, this calculation is performed for each metric. The highest desired replica count among all metrics is then selected. If any metric cannot be translated into a desired replica count (e.g., due to API errors) and a scale-down is indicated by the retrievable metrics, scaling is skipped entirely. However, the HPA can still scale up if other metrics suggest a desiredReplicas value exceeding the current count.

Finally, just before scaling the target, the scale recommendation is recorded. The controller considers all recommendations within a configurable window, selecting the highest recommendation within that window. This window, configured using --horizontal-pod-autoscaler-downscale-stabilization (default 5 minutes), ensures gradual scaledowns, smoothing out rapid metric fluctuations.

API Object: Defining Your Horizontal Autoscaler

The Horizontal Pod Autoscaler is a Kubernetes API resource residing within the autoscaling API group. The stable version is found in the autoscaling/v2 API, which includes support for memory and custom metrics. New fields introduced in autoscaling/v2 are preserved as annotations when used with autoscaling/v1.

When creating a HorizontalPodAutoscaler API object, ensure the specified name adheres to DNS subdomain name conventions. Detailed API object information can be found in the HorizontalPodAutoscaler Object documentation.

Navigating Workload Scale Stability

Managing replicas with HPA can sometimes lead to frequent fluctuations in the number of replicas, a phenomenon known as thrashing or flapping. This is due to the inherent dynamic nature of the metrics being evaluated. It’s analogous to the concept of hysteresis in control systems.

Autoscaling in Rolling Updates

Kubernetes facilitates rolling updates for Deployments. In such scenarios, the Deployment controller manages underlying ReplicaSets. When autoscaling a Deployment, the HPA is bound to the Deployment, controlling its replicas field. The deployment controller then distributes these replicas across the ReplicaSets during and after the rollout.

For StatefulSets with autoscaling, the StatefulSet directly manages its Pods, without an intermediary like ReplicaSets. Rolling updates in autoscaled StatefulSets are handled directly by the StatefulSet controller.

Leveraging Resource Metrics for Scaling

Any HPA target can be scaled based on the resource usage of its Pods. Defining resource requests like cpu and memory in the Pod specification is essential for utilization-based scaling. The HPA controller uses these requests to calculate resource utilization and trigger scaling actions.

To utilize resource utilization-based scaling, define a metric source like this:

type: Resource
resource:
  name: cpu
target:
  type: Utilization
  averageUtilization: 60

This configuration instructs the HPA controller to maintain an average CPU utilization of 60% across the Pods in the scaling target. Utilization is calculated as the ratio of current resource usage to the requested resources for the Pod. Refer to the Algorithm section for a detailed breakdown of utilization calculation and averaging.

Important Note:

Summing resource usages across all containers in a Pod can sometimes obscure individual container resource usage. This might result in situations where a single container is experiencing high utilization, but the overall Pod utilization remains within acceptable limits, preventing HPA from scaling out.

Container Resource Metrics: Granular Scaling Control

FEATURE STATE: Kubernetes v1.30 [stable] (enabled by default: true)

The HorizontalPodAutoscaler API extends its capabilities with container metric sources. This allows HPA to monitor resource usage of specific containers within Pods for scaling decisions. This granular control is particularly useful when you want to scale based on the resource consumption of the most critical containers, ignoring sidecar containers or other less relevant components.

If you update your target resource with a new Pod specification containing a different set of containers, remember to revise the HPA specification if the newly added container should also be considered for scaling. If the specified container in the metric source is absent or present only in a subset of Pods, those Pods are ignored, and the scaling recommendation is recalculated. See Algorithm for more details on calculation.

To scale based on container resources, define a metric source as follows:

type: ContainerResource
containerResource:
  name: cpu
  container: application
target:
  type: Utilization
  averageUtilization: 60

In this example, the HPA controller scales the target to maintain an average CPU utilization of 60% for the application container across all Pods.

Important Note:

When renaming a container tracked by HPA, a specific update sequence is recommended to ensure continuous and effective scaling. Before updating the resource defining the container (e.g., Deployment), update the HPA to track both the old and new container names. This ensures uninterrupted scaling calculations during the update process.

Once the container name change is rolled out to the workload resource, remove the old container name from the HPA specification to finalize the update.

Scaling with Custom Metrics

FEATURE STATE: Kubernetes v1.23 [stable]

(previously available as beta in autoscaling/v2beta2 API)

Using the autoscaling/v2 API version, you can configure HPA to scale based on custom metrics, metrics not built into Kubernetes. The HPA controller retrieves these custom metrics from the Kubernetes API.

Refer to Support for metrics APIs for API requirements.

Scaling Based on Multiple Metrics

FEATURE STATE: Kubernetes v1.23 [stable]

(previously available as beta in autoscaling/v2beta2 API)

The autoscaling/v2 API version allows you to define multiple metrics for HPA scaling. The controller evaluates each metric and proposes a scale based on each. HPA then applies the maximum recommended scale across all metrics, respecting the configured maximum replica limit.

Understanding Metrics API Support

By default, the HPA controller fetches metrics from a set of APIs. Cluster administrators must ensure these APIs are accessible for HPA to function correctly:

For detailed information on these metrics paths and their differences, consult the design proposals for HPA V2, custom.metrics.k8s.io, and external.metrics.k8s.io.

For practical examples, see the walkthrough for custom metrics and the walkthrough for external metrics.

Configurable Scaling Behavior: Fine-tuning Autoscaling

FEATURE STATE: Kubernetes v1.23 [stable]

(previously available as beta in autoscaling/v2beta2 API)

The v2 HorizontalPodAutoscaler API introduces the behavior field (see API reference) for configuring distinct scale-up and scale-down behaviors via scaleUp and scaleDown settings within the behavior field.

This feature allows you to define a stabilization window to mitigate replica count flapping and scaling policies to control the rate of replica changes during scaling.

Scaling Policies: Controlling the Pace of Change

Scaling policies, defined within the behavior section, allow fine-grained control over scaling actions. When multiple policies are specified, the policy permitting the most significant change is selected by default. Here’s an example illustrating scale-down policies:

behavior:
  scaleDown:
    policies:
    - type: Pods
      value: 4
      periodSeconds: 60
    - type: Percent
      value: 10
      periodSeconds: 60

periodSeconds defines the duration for which the policy must hold true. The maximum periodSeconds value is 1800 (30 minutes). The first policy (Pods) limits scale-down to 4 replicas per minute. The second policy (Percent) restricts scale-down to 10% of current replicas per minute.

By default, the policy allowing the largest change prevails. In this case, the Percent policy is only active when replicas exceed 40. Below 40 replicas, the Pods policy takes effect. For example, scaling down from 80 to 10 replicas initially reduces 8 replicas (10% of 80). In the next iteration with 72 replicas, 10% rounds up to 8 replicas. This recalculation happens in each autoscaler loop. Once replicas drop below 40, the Pods policy limits reduction to 4 replicas at a time.

The selectPolicy field can modify policy selection. Setting it to Min chooses the policy with the smallest replica change. Disabled completely disables scaling in that direction.

Stabilization Window: Smoothing Out Fluctuations

The stabilization window is crucial for preventing replica count flapping due to metric volatility. The autoscaling algorithm uses this window to establish a prior desired state, avoiding unnecessary workload scale changes.

Example stabilization window for scaleDown:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300

When metrics indicate a scale-down, the algorithm considers desired states from the past 5 minutes and selects the highest value. This rolling maximum approach prevents frequent, short-lived Pod removals and recreations.

Default Behavior: Sensible Defaults for Ease of Use

Not all fields need to be specified for custom scaling. Unspecified fields inherit default values, mirroring the existing HPA algorithm behavior.

Default behavior configuration:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
    - type: Percent
      value: 100
      periodSeconds: 15
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
    - type: Percent
      value: 100
      periodSeconds: 15
    - type: Pods
      value: 4
      periodSeconds: 15
      selectPolicy: Max

For scale-down, the stabilization window is 300 seconds (or --horizontal-pod-autoscaler-downscale-stabilization flag value). A single policy allows removing 100% of replicas, enabling scaling down to the minimum. Scale-up has no stabilization window, triggering immediate scaling. Two policies allow adding up to 4 Pods or 100% of current replicas every 15 seconds until steady state is reached.

Example: Custom Downscale Stabilization Window

To set a 1-minute downscale stabilization window:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 60

Example: Limiting Scale Down Rate

To limit Pod removal to 10% per minute:

behavior:
  scaleDown:
    policies:
    - type: Percent
      value: 10
      periodSeconds: 60

To further restrict removal to a maximum of 5 Pods per minute and prioritize the smaller change:

behavior:
  scaleDown:
    policies:
    - type: Percent
      value: 10
      periodSeconds: 60
    - type: Pods
      value: 5
      periodSeconds: 60
      selectPolicy: Min

Example: Disabling Scale Down

To prevent downscaling entirely:

behavior:
  scaleDown:
    selectPolicy: Disabled

kubectl Support for HorizontalPodAutoscaler

HorizontalPodAutoscaler is fully supported by kubectl. Use kubectl create hpa to create new autoscalers, kubectl get hpa to list them, kubectl describe hpa for details, and kubectl delete hpa to remove them.

The kubectl autoscale command provides a convenient way to create HPA objects. For example, kubectl autoscale rs foo --min=2 --max=5 --cpu-percent=80 creates an autoscaler for ReplicaSet foo with 80% target CPU utilization and a replica range of 2 to 5.

Implicit Maintenance-Mode Deactivation

You can implicitly deactivate HPA without modifying its configuration. Setting the target’s desired replica count to 0 while the HPA’s minimum replica count is above 0 will pause HPA adjustments (setting ScalingActive Condition to false) until reactivation via manual adjustment of the target’s or HPA’s replica counts.

Migrating Deployments and StatefulSets to Horizontal Autoscaling

When enabling HPA, it’s recommended to remove spec.replicas from Deployment and StatefulSet manifests. Leaving it can cause conflicts and thrashing when applying updates via kubectl apply -f deployment.yaml, as Kubernetes will try to enforce the spec.replicas value, overriding HPA’s scaling decisions.

Removing spec.replicas might cause a temporary Pod count reduction to 1 (the default). To avoid this, use one of these methods when modifying deployments:

kubectl apply edit-last-applied deployment/
In the editor, remove spec.replicas. Save and exit. No Pod count changes occur at this step.
Remove spec.replicas from the manifest and commit the changes.
Now, kubectl apply -f deployment.yaml will work as expected without replica conflicts.

When using Server-Side Apply, follow the transferring ownership guidelines for this specific scenario.

What’s Next? Expanding Your Autoscaling Strategy

If you’re implementing autoscaling, consider complementing HPA with node autoscaling to ensure you have the right number of nodes to accommodate your scaled applications.

For further exploration of HorizontalPodAutoscaler:

Horizontal Pod Autoscaler Walkthrough
Horizontal Pod Autoscaling in Kubernetes
Autoscaling in Kubernetes

This comprehensive guide has illuminated what is horizontal scaling in Kubernetes and how Horizontal Pod Autoscaler empowers you to dynamically manage your application’s resources, ensuring optimal performance and efficiency. By understanding its mechanisms and configurations, you can effectively leverage HPA to build resilient and scalable applications in your Kubernetes environment.

Demystifying Horizontal Scaling: Your Guide to Kubernetes HPA

Unpacking the Mechanics: How HorizontalPodAutoscaler Works

Delving into the Algorithm: The Scaling Logic

API Object: Defining Your Horizontal Autoscaler

Navigating Workload Scale Stability

Autoscaling in Rolling Updates

Leveraging Resource Metrics for Scaling

Important Note:

Container Resource Metrics: Granular Scaling Control

Important Note:

Scaling with Custom Metrics

Scaling Based on Multiple Metrics

Understanding Metrics API Support

Configurable Scaling Behavior: Fine-tuning Autoscaling

Scaling Policies: Controlling the Pace of Change

Stabilization Window: Smoothing Out Fluctuations

Default Behavior: Sensible Defaults for Ease of Use

Example: Custom Downscale Stabilization Window

Example: Limiting Scale Down Rate

Example: Disabling Scale Down

kubectl Support for HorizontalPodAutoscaler

Implicit Maintenance-Mode Deactivation

Migrating Deployments and StatefulSets to Horizontal Autoscaling

What’s Next? Expanding Your Autoscaling Strategy

Comments

Leave a Reply Cancel reply