Scheduling Policies
Kubernetes Scheduling Policies
Scheduler configuration has changed in Kubernetes v1.19. For older clusters, skip to the section configuring the kube-scheduler.
On Kubernetes v1.19 or later
Kubernetes v1.19 supports configuring multiple scheduling policies with a single scheduler. We are using this to define a bin-packing scheduling policy in all v1.19 clusters by default.
To use that scheduling policy, all that is required is to specify the scheduler name bin-packing-scheduler
in the Pod spec. For example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 5
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
schedulerName: bin-packing-scheduler
containers:
- name: nginx
image: nginx:1.17.8
resources:
requests:
cpu: 200m
The pods of this deployment will be scheduled onto the nodes which already have the highest resource utilisation, to optimise for autoscaling or ensuring efficient pod placement when mixing large and small pods in the same cluster.
If a scheduler name is not specified then the default spreading algorithm will be used to distribute pods across all nodes.
Configuring the kube-scheduler
Note: These instructions are for Kubernetes v1.18 and earlier.
Setting non-default kube-scheduler policies requires some manual configuration on the master node.
$ ssh core@cluster-pj2n3k4nz5qi-master-0
$ sudo su
# vi /etc/kubernetes/scheduler
In this file, change the KUBE_SCHEDULER_ARGS
line to add these two policy parameters:
KUBE_SCHEDULER_ARGS="--leader-elect=true --policy-configmap=scheduler-policy --policy-configmap-namespace=kube-system"
The scheduler policy must be provided in a json format, and stored in a ConfigMap
with this name and namespace.
A default policy can be taken from the Kubernetes
examples,
and an example including the ConfigMap
is provided in the following section.
If the predicates, priorities, or any other sections of the policy configuration are not specified they will assume the default values. But, if the predicates section is given, then all the desired predicates must be listed. It is not possible to only add or remove single predicates. This also applies to the priorities section.
Once the ConfigMap
has been created the kube-scheduler must be restarted on the master node in order to use the new configuration.
# systemctl restart kube-scheduler.service
This step must be repeated if the ConfigMap
is changed.
Packing policy
The default kube-scheduler priorities include the priority LeastRequestedPriority
, which gives preference
to scheduling on nodes which have less CPU/memory usage than others. This causes pods to be spread out evenly
among all nodes.
For cluster autoscaling it is better to pack pods together onto the fewest possible number of nodes,
so that the autoscaler can remove nodes which are not needed. To achieve this, the LeastRequestedPriority
can be changed for MostRequestedPriority
. An example scheduler policy which has this change, and has
a higher weighting for MostRequestedPriority
than other priorities, can be taken from here:
$ cat scheduler-policy.yaml
kind: ConfigMap
apiVersion: v1
metadata:
name: scheduler-policy
namespace: kube-system
data:
policy.cfg: |
{
"kind": "Policy",
"apiVersion": "v1",
"priorities": [
{"name": "MostRequestedPriority", "weight": 100},
{"name": "ServiceSpreadingPriority", "weight": 1},
{"name": "EqualPriority", "weight": 1},
{"name": "ImageLocalityPriority", "weight": 1},
{"name": "SelectorSpreadPriority", "weight": 1},
{"name": "InterPodAffinityPriority", "weight": 1},
{"name": "LeastRequestedPriority", "weight": 1},
{"name": "BalancedResourceAllocation", "weight": 1},
{"name": "NodePreferAvoidPodsPriority", "weight": 1},
{"name": "NodeAffinityPriority", "weight": 1},
{"name": "TaintTolerationPriority", "weight": 1}
]
}
kubectl create -f scheduler-policy.yaml