Cluster Autoscaling
Autoscaling is only available for Kubernetes clusters.
Warning
This documentation is deprecated, please check here for its new home
Cluster Autoscaling
The Kubernetes Cluster Autoscaler observes the resource requests made by pods in the cluster, and:
- Add nodes if pods are stuck in the pending state due to lack of CPU or Memory.
- Remove nodes which have no pods running on them.
- Rebalance the pods in the cluster to improve overall resource usage.
Autoscaling is not enabled by default on the cluster and requires one label to be specified during cluster creation:
$ openstack coe cluster create <name> --cluster-template <cluster-template> \
--node-count 4
--merge-labels \
--labels auto_scaling_enabled=true
If using Kubernetes v1.19 or later, you must also set a maximum node count for the default-worker node group. Check the node group autodiscovery section to see how the autoscaler is configured on Kubernetes v1.19 and later.
For Kubernetes versions before v1.19 the autoscaler is not able to scale individual node groups, which means that no extra node groups should be created if you are using autoscaling. In this case the minimum and maximum node count default to 1 and --node-count respectively, if not set using these labels:
--labels min_node_count=3 --labels max_node_count=7
That's it! If you validate your CA pod logs it should look something like this:
$ kubectl -n kube-system log -l app=cluster-autoscaler
I0621 09:31:00.801171 1 leaderelection.go:217] attempting to acquire leader lease kube-system/cluster-autoscaler...
I0621 09:31:00.877710 1 leaderelection.go:227] successfully acquired lease kube-system/cluster-autoscaler
I0621 09:31:02.962222 1 magnum_manager_heat.go:293] For stack ID 366c1341-7af9-46e5-9c5f-86c107d6f0b1, stack name is dtomasgu-ca-xckx7eo7kk3x
I0621 09:31:03.254374 1 magnum_manager_heat.go:310] Found nested kube_minions stack: name dtomasgu-ca-xckx7eo7kk3x-kube_minions-gxpdgkj4tzzn, ID 49d3e4a3-555f-4781-91c0-18f67c6cfdb0
Node group autodiscovery
Kubernetes v1.19 clusters and later versions are able to scale multiple node groups and use node group autodiscovery, which is enabled by default. This means the the autoscaler will scale any node groups in the cluster that satisfy these two conditions:
- The node group has a
role
that the autoscaler is configured to look for. - The node group has its maximum node count property set.
By default the autoscaler will scale any node group with the role "worker" to match the default-worker node group, but the maximum node count must also be set after the cluster has been created.
$ openstack coe nodegroup update <cluster-name> default-worker replace /max_node_count=7
This can be changed at any time and the autoscaler will use the new value.
The minimum node count for a group can also be changed with /min_node_count=2
in the same way.
Setting the minimum node count to 0 is not supported yet.
To stop the autoscaler from scaling a particular node group, it is enough to simply unset the maximum node count.
$ openstack coe nodegroup update <cluster-name> default-worker remove /max_node_count
To configure the autoscaler to match node groups with other roles, the deployment must be edited.
$ kubectl -n kube-system edit deployment cluster-autoscaler
Look for the line with --node-group-auto-discovery
and add any new roles, comma separated.
--node-group-auto-discovery=magnum:role=worker,autoscaling,new-role
Any node groups created with --role new-role
are then able to be autoscaled as long as they have a maximum node count set.
To see which node groups are being scaled by the autoscaler, check the output of this command:
$ kubectl -n kube-system describe cm cluster-autoscaler-status
It will report the status for the entire cluster, and then list each node group that is being autoscaled. Once the default-worker node group has had its maximum node group set, it will show up in the output like this:
....
NodeGroups:
Name: default-worker-95d772e5
Health: Healthy (ready=4 unready=0 notStarted=0 registered=4 cloudProviderTarget=4 (minSize=1, maxSize=7))
ScaleUp: NoActivity (ready=4 cloudProviderTarget=4)
ScaleDown: CandidatesPresent (candidates=2)
Autoscaling GPU node groups
GPUs are a special case because the nodes have an additional resource type nvidia.com/gpu
which can be requested by pods.
Unlike CPU and memory this resource is not set on the nodes immediately, as the GPU drivers
must be installed first.
To let the cluster autoscaler know that it should wait for the GPU resource to be initialised before trying to scale up again, an extra label has to be added onto the node as soon as it is created.
This needs to be done by setting the kubelet_options
label when creating the node group, like so:
$ openstack coe nodegroup create <cluster> <node-group-name> \
--role gpu \
--merge-labels \
--labels kubelet_options="--feature-gates=RemoveSelfLink=false --node-labels=magnum.openstack.org/gpu=true" \
--flavor g2.xlarge \
--node-count 1 \
--min-nodes 1 \
--max-nodes 5
The --features-gates
parameter is the default of kubelet_options for Kubernetes v1.20 and --node-labels=magnum.openstack.org/gpu=true
has been added.
If using Kubernetes v1.19, the default feature gates are --feature-gates=CSINodeInfo=true,CSIDriverRegistry=true
.
When passing that, due to how Magnum parses labels which contain commas, you need to pass at least one other
--labels
parameter so that it does not split on the commas in the kubelet_options.
Adding another --labels
parameter with abc=xyz
will work, the name and value are not important.
Important Configurations:
-
The autoscaler is not aware of the resources available on OpenStack and so max_node_count should be set so that the cluster's resources will not exceed the quota limits of the OpenStack project.
-
The autoscaler will try to migrate pods if nodes are underutilised. By default, nodes with less than 50% utilisation are eligible for pod eviction if these pods fit on other nodes. To prevent pods from being evicted use the following annotation on your pods:
yaml annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
-
To prevent a specific node from being removed even if it is empty, use the following annotation on the node:
yaml annotations: cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"
-
Pod priority is also considered by the autoscaler. Pods with a priority below a threshold (default -10) do not trigger a scale up. These low priority pods will also be ignored when the autoscaler considers nodes for removal. I.e. a node that only has pods below the priority threshold will be considered empty.
Advanced features:
The Cluster Autoscaler is highly configurable. The responsiveness can be tuned by changing the parameters new-pod-scale-up-delay and scale-down-unneeded-time. To edit the Cluster Autoscaler deployment do:
kubectl -n kube-system edit deployment.apps/cluster-autoscaler
and add or modify your arguments under spec.template.spec.containers.command. A list of common arguments is given below, but you can check all the available Cluster Autoscaler arguments in the upstream FAQ.
Common CA arguments:
- scan-interval: How often cluster is reevaluated for scale up or down.
- max-graceful-termination-sec: Maximum number of seconds CA waits for pod termination when trying to scale down a node.
- new-pod-scale-up-delay: Pods less than this old will not be considered for scale-up, default 0 seconds
- scale-down-delay-after-add: How long after scale up that scale down evaluation resumes
- scale-down-unneeded-time: How long a node should be unneeded before it is eligible for scale down
- scale-down-utilization-threshold: Node utilization level, below which a node can be considered for scale down, default 0.5 (50%)
- expendable-pods-priority-cutoff: Pods with priority below cutoff will be expendable. They can be killed without any consideration during scale down and they don't cause scale up.