Skip to content

Scheduling Policies

Kubernetes Scheduling Policies


This documentation is deprecated, please check here for its new home

Scheduler configuration has changed in Kubernetes v1.19. For older clusters, skip to the section configuring the kube-scheduler.

On Kubernetes v1.19 or later

Kubernetes v1.19 supports configuring multiple scheduling policies with a single scheduler. We are using this to define a bin-packing scheduling policy in all v1.19 clusters by default.

To use that scheduling policy, all that is required is to specify the scheduler name bin-packing-scheduler in the Pod spec. For example:

apiVersion: apps/v1
kind: Deployment
  name: nginx
  replicas: 5
      app: nginx
        app: nginx
      schedulerName: bin-packing-scheduler
      - name: nginx
        image: nginx:1.17.8
            cpu: 200m

The pods of this deployment will be scheduled onto the nodes which already have the highest resource utilisation, to optimise for autoscaling or ensuring efficient pod placement when mixing large and small pods in the same cluster.

If a scheduler name is not specified then the default spreading algorithm will be used to distribute pods across all nodes.

Configuring the kube-scheduler

Note: These instructions are for Kubernetes v1.18 and earlier.

Setting non-default kube-scheduler policies requires some manual configuration on the master node.

$ ssh core@cluster-pj2n3k4nz5qi-master-0
$ sudo su
# vi /etc/kubernetes/scheduler

In this file, change the KUBE_SCHEDULER_ARGS line to add these two policy parameters:

KUBE_SCHEDULER_ARGS="--leader-elect=true --policy-configmap=scheduler-policy --policy-configmap-namespace=kube-system"

The scheduler policy must be provided in a json format, and stored in a ConfigMap with this name and namespace. A default policy can be taken from the Kubernetes examples, and an example including the ConfigMap is provided in the following section.

If the predicates, priorities, or any other sections of the policy configuration are not specified they will assume the default values. But, if the predicates section is given, then all the desired predicates must be listed. It is not possible to only add or remove single predicates. This also applies to the priorities section.

Once the ConfigMap has been created the kube-scheduler must be restarted on the master node in order to use the new configuration.

# systemctl restart kube-scheduler.service

This step must be repeated if the ConfigMap is changed.

Packing policy

The default kube-scheduler priorities include the priority LeastRequestedPriority, which gives preference to scheduling on nodes which have less CPU/memory usage than others. This causes pods to be spread out evenly among all nodes.

For cluster autoscaling it is better to pack pods together onto the fewest possible number of nodes, so that the autoscaler can remove nodes which are not needed. To achieve this, the LeastRequestedPriority can be changed for MostRequestedPriority. An example scheduler policy which has this change, and has a higher weighting for MostRequestedPriority than other priorities, can be taken from here:

$ cat scheduler-policy.yaml

kind: ConfigMap
apiVersion: v1
  name: scheduler-policy
  namespace: kube-system
  policy.cfg: |
        "kind": "Policy",
        "apiVersion": "v1",
        "priorities": [
          {"name": "MostRequestedPriority", "weight": 100},
          {"name": "ServiceSpreadingPriority", "weight": 1},
          {"name": "EqualPriority", "weight": 1},
          {"name": "ImageLocalityPriority", "weight": 1},
          {"name": "SelectorSpreadPriority", "weight": 1},
          {"name": "InterPodAffinityPriority", "weight": 1},
          {"name": "LeastRequestedPriority", "weight": 1},
          {"name": "BalancedResourceAllocation", "weight": 1},
          {"name": "NodePreferAvoidPodsPriority", "weight": 1},
          {"name": "NodeAffinityPriority", "weight": 1},
          {"name": "TaintTolerationPriority", "weight": 1}
kubectl create -f scheduler-policy.yaml

Last update: June 1, 2022