Sandbox containers and AKS
Hello All!
In this article, I would like to talk about sandbox container, in Azure Kubernetes Service context.
We will start by introducing the concepts behind sandbox container, why it may be needed. Then we’ll look at the currently available solution onr sandbox container in Kubernetes. And last, we’ll go pragmatics and demonstrate how the sandbox container tehnologies are deployed and used in an AKS cluster.
1. Container sandboxing 101
Before understanding the need of sandbox containers, we need to step back a little and look at the basics of container architecture.
Remember, the idea behind container is to have a lighter abastraction layer btween the app and its hosts. In the virtualization world, we have the host, the hypervisor, and the virtual machine which then host the app binary.
In the container world, the hypervisor disappears, along with the Virtual Machine layer and its OS.
The result is what everyone knows, it’s way more fast and resource efficient to have those layers removed and depends only on the container engine and the kernel capabilities to isolate workloads.
However, because there is so much less layer, there is also less isolation. The kernel is shared, even if there is isolation by the mean of cgroups and namespace.
Because there are some case where a better isolation is needed, technical solutions were developped. Those solutions are usually based on one of the 2:
- Rule-based execution
- Machine level virtualization
Rule-based execution is used by identifying which syscall are made by the app, and by allowing only those calls from the container to the shared host kernel. Solutions using Rule-based execution are seccomp, SELinux or AppArmor.
Machine level virtualization is probably easier because it provides an isolation through a light-weight virtual machine.
While Machine level virtualization relies on light-weight virtual machine, it does impact the performance of the application that is hosted in this sandboxed container.
In the Machine level virtualization category, we find katacontainer, on which we will have a look at in the following part.
To conclude with the sandbox container solutions, we should have a look at gVisor. gVisor acts as an intermediary between the application and the host kernel. It intercepts the syscall from the application and pass only limited calls to the host kernel.
Ok that’s all for the intro on sandbox containers. In the following sections, we will let the Rule-based execution apart and focus more specifically on katacontainer and gVisor, in an AKS context.
2. Preparing for sandbox containers in AKS
Let’s go back to our cloud managed Kubernetes now. If we look at either gVisor or katacontainer, there are things to install on the kubernetes nodes. Except that, well, we usually don’t install anything on the nodes. Indeed, those nodes are actually instances from vm scale sets. By design, the scale sets are managed by the control plane of AKS, and the instances lifecycle is not managed by a human admin. Because of this, it is not really practicle to install any binary on a node each time it is provisioned.
There’s a way, though, to have a pod running on each nodes. For use case like that, we can rely on the daemonset controller, which will ensure that we have the pods described in the controller always deployed on each nodes of our cluster.
If we consider an AKS cluster, we can use daemon sets along with taints on node pools to have pods running on each sets of nodes in a node pools.
If we want to install gVisor (or any binary as a matter of fact) on all the nodes of a specific node pool, we could rely on a daemonset which would deploy a pod with an elevated container. If this said container was configured to execute the installation of gVisor, then we could achieve our goal.
There are a few watch points here however:
- First, we need to have a pod with elevated access, in this case access to the node local storage at least, so that we could deploy sandboxed container. There may be a contradiction here, don’t you think?
- Second, we need to modify the configuration of nodes managed by an Azure service. So while the approach wit the daemonset is typical of kubernetes environment and thus respect the best practice of kubernetes node management, it is kind of a grey zone in terms of Azure support. Meaning it should work but if it does not anymore, we’re on our own because there won’t be any support from Microsoft, or at least, not that much.
There’s an excellent article detailing this configuration, written by Daniel Neumann on his blog which details the configuration of said container and how to deploy It with a daemonset. We’ll reviw that a bit in the next section of this article.
Another way for sandbox container, this time supported by Microsoft, is to rely on a sandbox technology available on the underlying OS of the node image. Luckily for us, this is the case for katacontainer and the node image based on Mariner. In this specific case, no installation on the nodes is required, and we can only focus on the kubernetes part of creating our sandbox container.
If we dive a little deeper on our AKS architecture, we should remebmer that by default we have 1 node pool (a.k.a the default node pool). This node pool is by default (for now at least) an Ubuntu based node pool so no katacontainer there. Also, because it’s the default system node pool, it hosts all the AKS required pods for AKS to work. It’s better to leave it alone, so we’ll use additional node pool.
Regarding gvisor, it can be any node pool because this is self-managed sandbox software install, so there is no underlying architecture requirement. Adding a node pool is not too difficult and can be done from the portal, az cli or some terraform configuration.
For katacontainer, we do have a requirement to use a Mariner node pool.
It’s important to remember that it is currently a preview, and as such it requires to be activate in the provider with the command az feature register
command.
az feature register --namespace "Microsoft.ContainerService" --name "KataVMIsolationPreview"
Also, we have to specify the --workload-runtime
to KataMshvVmIsolation
so that the feature is activated on the node pool. Otherwise, the runtime class will not be available. More on that in the next part.
az aks nodepool add --cluster-name <AKS_Cluster_Name> --resource-group <AKS_Resource_Group> --name <Node_Pool_Name> --os-sku mariner --workload-runtime KataMshvVmIsolation --node-vm-size <VM_Size>
Interestingly enough, while there is a workload_runtime
parameter in the terraform provider, it currently only support OCIContainer
or WasmWasi
. So we are stuck with either az cli or ARM (or bicep).
That’s almost all on the node pool configuration. The last details are more in the kubernetes plane so we will have a look in the next part
3. Running Sandbox container in AKS with gvisor
first thing first, let’s have a look on our nodes. With a custom columns selection, we can get the taints and labels. Note that we’re selecting a specific label here:
yumemaru@azure$ k get nodes -o custom-columns='NodeName:.metadata.name,LabelAgentPool:.metadata.labels.agentpool,NodeTaintsKey:.spec.taints[].key,NodeTaintsValue:.spec.taints[].value,NodeTaintsEffect:.spec.taints[].effect'
NodeName LabelAgentPool NodeTaintsKey NodeTaintsValue NodeTaintsEffect
aks-aksnp0sbxcon-29325118-vmss00000p aksnp0sbxcon CriticalAddonsOnly true NoSchedule
aks-aksnp0sbxcon-29325118-vmss00000q aksnp0sbxcon CriticalAddonsOnly true NoSchedule
aks-aksnp0sbxcon-29325118-vmss00000r aksnp0sbxcon CriticalAddonsOnly true NoSchedule
aks-npgvisor-28185204-vmss00000b npgvisor gvisor true NoSchedule
aks-npgvisor-28185204-vmss00000c npgvisor gvisor true NoSchedule
aks-npkata-37278511-vmss000006 npkata KataContainer true NoSchedule
aks-npkata-37278511-vmss000007 npkata KataContainer true NoSchedule
aks-npkata-37278511-vmss000008 npkata KataContainer true NoSchedule
yumemaru@azure$
With that, we can install gvisor with a daemonset:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gvisor
namespace: kube-system
spec:
selector:
matchLabels:
app: gvisor
template:
metadata:
labels:
app: gvisor
spec:
hostPID: true
restartPolicy: Always
containers:
- image: docker.io/yumemaru1979/gvisor:latest
imagePullPolicy: Always
name: gvisor
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
securityContext:
privileged: true
readOnlyRootFilesystem: true
volumeMounts:
- name: k8s-node
mountPath: /k8s-node
volumes:
- name: k8s-node
hostPath:
path: /tmp/gvisor
tolerations:
- key: gvisor
operator: Exists
effect: NoSchedule
nodeSelector:
agentpool: npgvisor
The interesting part on a scheduling point of view here is the toleration which match the one from our node gvisor gvisor=true:NoSchedule
.
We are also using a nodeSelector
to match the agentpool: npgvisor
so that we are sure that the pods will only execute on this node pool.
For the image and what it does, really, I took the information from Daniel Neumann blog as mentionned earlier.
To summarize, installing gvisor requires adding runsc on the node, and modifying containerd accordingly. The gvisor documentation details this in the installation section.
When the daemonset is scheduled, we should see something like that:
yumemaru@azure$ k get daemonsets.apps gvisor -n kube-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gvisor 2 2 2 2 2 agentpool=npgvisor 34h
Checking on the portal, we can also see which pod run on which node:
With that ready, we need now to add a runtime class:
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: gvisor
handler: runsc
scheduling:
nodeSelector:
agentpool: npgvisor
And last, we can schedule pods on the node pool. To use the runtime class and thus get a sandboxed container, we need to specify the runtimeClassName
. We also specify the NodeSelector
and the tolerations
to ensure that the pod can run on the node pool that we want:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: gvisor
name: gvisortest
namespace: gvisordemo
spec:
runtimeClassName: gvisor
nodeSelector:
agentpool: npgvisor
tolerations:
- key: gvisor
operator: Exists
effect: NoSchedule
containers:
- image: nginx
name: gvisortest
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
For comparison purpose, we can also add a not sandboxed pod:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: test
name: test
namespace: gvisordemo
spec:
tolerations:
- key: gvisor
operator: Exists
effect: NoSchedule
containers:
- image: nginx
name: test
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
Checking the pod kernel version gives us 2 differents versions:
yumemaru@azure$ k exec -n gvisordemo test -- uname -r
5.4.0-1103-azure
yumemaru@azure$ k exec -n gvisordemo gvisortest -- uname -r
4.4.0
Any other pod with the standard ubuntu node should have the same kernel version:
yumemaru@azure$ k exec -n kube-system gvisor-nczx6 -- uname -r
5.4.0-1103-azure
And that’s it for gvisor sandbox container. Easyonce we have tackled the gvisor install, which is not so easy on AKS node. Let’s move to katacontainer now.
4. sanbox with katacontainer
With katacontainer, it’s way more managed. Since the Mariner nodes are already compatible with this sandbox technology, we just have to schedule a pod with the appropriate runtimclass.
To be sure to get the appropriate one, we can use the following command:
yumemaru@azure$ k get runtimeclasses.node.k8s.io
NAME HANDLER AGE
gvisor runsc 13s
kata-mshv-vm-isolation kata 11d
runc runc 11d
The one that we want is kata-mshv-vm-isolation
apiVersion: node.k8s.io/v1
handler: kata
kind: RuntimeClass
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"node.k8s.io/v1","handler":"kata","kind":"RuntimeClass","metadata":{"annotations":{},"labels":{"addonmanager.kubernetes.io/mode":"Reconcile","kubernetes.io/cluster-service":"true"},"name":"kata-mshv-vm-isolation"},"scheduling":{"nodeSelector":{"kubernetes.azure.com/kata-mshv-vm-isolation":"true"}}}
creationTimestamp: "2023-03-18T21:15:46Z"
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
name: kata-mshv-vm-isolation
resourceVersion: "554541"
uid: a18f3303-4af5-4d6e-95f4-9e492c1c94a7
scheduling:
nodeSelector:
kubernetes.azure.com/kata-mshv-vm-isolation: "true"
Now we can prepare a pod to use this runtime:
apiVersion: v1
kind: Pod
metadata:
labels:
run: katatest
name: katatest
namespace: katademo
spec:
nodeSelector:
agentpool: npkata
runtimeClassName: kata-mshv-vm-isolation
tolerations:
- key: KataContainer
operator: Exists
effect: NoSchedule
containers:
- image: nginx
name: katatest
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
And again create another one with the default runc runtimeclass:
apiVersion: v1
kind: Pod
metadata:
labels:
run: katatest
name: test2
namespace: katademo
spec:
nodeSelector:
agentpool: npkata
tolerations:
- key: KataContainer
operator: Exists
effect: NoSchedule
containers:
- image: nginx
name: katatest
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
If we check the kernel of those 2 pods, we can see different version for the kernel, while all other pods (on this node pool) have the same version:
yumemaru@azure$ k exec -n katademo katatest -- uname -r
5.15.48.1-9.cm2
yumemaru@azure$ k exec -n katademo test2 -- uname -r
5.15.92.mshv1-hvl1.m2
yumemaru@azure$ k exec -n katademo test3 -- uname -r
5.15.92.mshv1-hvl1.m2
And with that we are finished.
5. To conclude
So the good news is that sandbox containers are now officially supported thanks to the katacontainer support on Mariner nodes. The manual install through a daemonset is still possible but not supported from an Azure plane point of view.
Note that for now, there are some limitation with sandbox container with katacontainer, notably no support for CSI drivers.
That will be all for today. See you soon!