Seccomp in Kubernetes and AKS

Hi!

I’m currently trying (with moderate succes, hence the trying part 😽) the CKS certification. Among the multitude of topics to understand, there is one call Seccomp. So this article will not be anything in the breaking technologies stuff, but clearly a kind of walkthrough to use Seccomp in Kubernetes and AKS. Hope you’ll enjoiy it, and well, at least It will be a good cheat sheet for me 🤭

So the Agenda:

A little introduction to Seccomp
Trying to use Seccomp on a Kubernetes Cluster
What about Seccomp on AKS
Conclusion

1. A little introduction to Seccomp

1.1. A bit of history and prerequisites

From a Kubernetes standpoint, Seccomp, for Secure Compute, is a feature from Linux to restrict the calls that a process is able to perform, from the user namespace to the kernel.

If we dig a little bit more, we can find on wikipedia or lwn.net that Seccomp is quite old. It was introduced around 2005, to secure the execution of untrusted programs in grid computing. With time, it was more thoroughly adopted, notably by Chrome, to sandbox the execution of Adobe Flash, in Docker, and many others programs such as Firefox, OpenSSH…

Back to our use case now. Using Seccomp is a mean to ensure that programs running on Kubernetes, a.k.a in pods, are isolated from the hosts and limited in the system calls that can be done.

Obviously we need a recent kernel, above:

Linux kernel ≥ 2.6.12 from which basic seccomp strict mode was introduced.
Linux kernel ≥ 3.5 from which seccomp-BPF filter mode (the flexible one that allows whitelists/blacklists) was added.
Linux kernel ≥ 3.17 from which dedicated seccomp() syscall introduced (instead of only prctl).

Modern linux distributions fill those requirement. We can check this wit the uname -r command

vagrant@k8scilium1:~$ uname -r
6.14.0-28-generic

Additionaly, it seems that the kernel should be compiled CONFIG_SECCOMP=y and CONFIG_SECCOMP_FILTER=y options.

vagrant@k8scilium1:~$ grep SECCOMP /boot/config-$(uname -r)
CONFIG_HAVE_ARCH_SECCOMP=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP=y
CONFIG_SECCOMP_FILTER=y
# CONFIG_SECCOMP_CACHE_DEBUG is not set

And, last but not least, the container engine and kubernetes should be recent enough to support Seccomp. It means something like kubernete 1.19 at least, which is kind of old already.

vagrant@k8scilium1:~$ k version
Client Version: v1.32.8
Kustomize Version: v5.5.0
Server Version: v1.32.8

vagrant@k8scilium1:~$ k get node -o wide
NAME         STATUS   ROLES           AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
k8scilium1   Ready    control-plane   2d23h   v1.32.8   192.168.56.17   <none>        Ubuntu 24.04.2 LTS   6.8.0-64-generic   containerd://1.7.27

Now back to more Seccomp in Kubernetes.

1.2. Seccomp in Kubernetes basics

Taking the asumptions that Seccomp can be used on a Kubernetes Cluster, how can it be used to secure more the environment?

Well, the basics are not too difficult.

Leveraging the spec.securityContext.seccompProfile section in a pod manifest, we can configure Seccomp to limit what a process in the pod can do.

Note: Seccomp is not the only thing that can be configured in the spec.securityContext section, but we want to focus on Seccomp today.

In the seccompProfile section, we can set three values:

Unconfined: This value will enfore no restrictions. Let’s say that it is not our preferred configuration 😅.
RuntimeDefault: This value will enforce the container runtime’s default profile. More on that later
Localhost: This value is probably the most intersting, and the worst at the same time.It allows the use of a custom profile from the node’s filesystem.

About the container runtimle default’s profile, it seems that whatever the runtime i.e Docker, Containerd, Cri-O, it is based on Docker’s default profile. We can find information about it on the Docker documentation.

It’s a whitelist based profile, meaning that its default action is block syscalls, except those that are specificallly allowed. The doc includes a portion of the whitelisted syscall, with explanations for most syscalls. Digging a bit more, we can find on Docker’s github the default.json that is used for the profile.

Copy/pasting the full profile is not really relevant, but just for information, the structure looks like this.

{
	"defaultAction": "SCMP_ACT_ERRNO",
	"architectures": [
		"SCMP_ARCH_X86_64",
		"SCMP_ARCH_X86",
		"SCMP_ARCH_X32"
	],
	"syscalls": [
		{
			"name": "accept",
			"action": "SCMP_ACT_ALLOW",
			"args": []
		},
====truncated====
    ]
}

The default action "SCMP_ACT_ERRNO" blocks the unspecified syscalls, and the syscalls list let us specify all the allowed syscall.

This structure is also what is used for the custom profiles loaded with the Localhost value.

Ok, enough with the concepts, let’s try some stuffs ^^

2. Trying to use Seccomp on a Kubernetes cluster

2.1. Using the container runtime default profile

For this first hands-on part, I’ll use a single node kubeadm cluster. That’s because this way, I get full access to the node, which is not that easy with managed kubernetes such as AKS. But we’ll have a look at that in the next part.

For now, we’ll start with a sample pod, in which we’ll add the container runtime default in the security

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: sample
  name: sample
spec:
# Seccomp config
  securityContext:
    seccompProfile:
      type: RuntimeDefault
# Seccomp config end      
  containers:
  - image: nginx
    name: sample
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

With this specific container, a.k.a nginx, it’s working fine. It’s considered a security best practice to implement this profile by default.

However, there are some case where it’s either not secure enough, or where it’s too much, depending on the apps. In this case, we need to switch to custom profiles.

2.2. Using custom seccomp profile

Relying on the kubernetes documentation, we can find the following profiles.

audit.json, a profile that only audit syslogs

{
    "defaultAction": "SCMP_ACT_LOG"
}

violation.json, a profile that litteraly blocks all syslogs

{
    "defaultAction": "SCMP_ACT_ERRNO"
}

We can see that the defaultAction differs in both profiles. We can get information from another page of the kubernetes documentation.

Seccomp profile action	Description
SCMP_ACT_ERRNO	Return the specified error code.
SCMP_ACT_ALLOW	Allow the syscall to be executed.
SCMP_ACT_KILL_PROCESS	Kill the process.
SCMP_ACT_KILL_THREAD and SCMP_ACT_KILL	Kill only the thread.
SCMP_ACT_TRAP	Throw a SIGSYS signal.
SCMP_ACT_NOTIFY and SECCOMP_RET_USER_NOTIF.	Notify the user space.
SCMP_ACT_TRACE	Notify a tracing process with the specified value.
SCMP_ACT_LOG	Allow the syscall to be executed after the action has been logged to syslog or auditd.

Ok, Let’s create another pod, this time with the violation.json profile

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: sample
  name: sample
spec:
# Seccomp config
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/violation.json
# Seccomp config end      
  containers:
  - image: nginx
    name: sample
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

In this case we specify a path so that kubernetes can find the file and load it. The path evalueated is referenced by the seccomp location, which is, for a kubeadm cluster /var/lib/kubelet/seccomp. To load custom profiles, we need to have our json files in this path.

vagrant@k8scilium1:~$ ls /var/lib/kubelet/seccomp/profiles/
audit.json  finegrained.json  violation.json

Note that if we create a pod with a reference to an unexisting profile, it will naturally fail, as we can see below:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: sample
  name: sample
spec:
# Seccomp config
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/thisprofiledoesnotexist.json
# Seccomp config end      
  containers:
  - image: nginx
    name: sample
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

vagrant@k8scilium1:~$ k get pod sample-profilenotexisting 
NAME                        READY   STATUS                 RESTARTS   AGE
sample-profilenotexisting   0/1     CreateContainerError   0          10m

agrant@k8scilium1:~$ k get pod sample-profilenotexisting -o json | jq .status.containerStatuses[0].state
{
  "waiting": {
    "message": "failed to create containerd container: cannot load seccomp profile \"/var/lib/kubelet/seccomp/profiles/thisprofiledoesnotexist.json\": open /var/lib/kubelet/seccomp/profiles/thisprofiledoesnotexist.json: no such file or directory",
    "reason": "CreateContainerError"
  }
}

What about the sample pod with the violation profile. This time, we expect the pod to fail because we do block all the syscalls.

vagrant@k8scilium1:~$ k get pod sample-violationprofile 
NAME                      READY   STATUS              RESTARTS     AGE
sample-violationprofile   0/1     RunContainerError   2 (8s ago)   26s

vagrant@k8scilium1:~$ k get pod sample-violationprofile -o json | jq .status.containerStatuses
[
  {
    "containerID": "containerd://1c307e53296a09559e7eec632c47af3d27d8e3a46b073b0675d2313a156e58c6",
    "image": "docker.io/library/nginx:latest",
    "imageID": "docker.io/library/nginx@sha256:33e0bbc7ca9ecf108140af6288c7c9d1ecc77548cbfd3952fd8466a75edefe57",
    "lastState": {
      "terminated": {
        "containerID": "containerd://1c307e53296a09559e7eec632c47af3d27d8e3a46b073b0675d2313a156e58c6",
        "exitCode": 128,
        "finishedAt": "2025-08-22T15:28:10Z",
        "message": "failed to start containerd task \"1c307e53296a09559e7eec632c47af3d27d8e3a46b073b0675d2313a156e58c6\": cannot start a stopped process: unknown",
        "reason": "StartError",
        "startedAt": "1970-01-01T00:00:00Z"
      }
    },
    "name": "sample-violationprofile",
    "ready": false,
    "restartCount": 3,
    "started": false,
    "state": {
      "waiting": {
        "message": "back-off 40s restarting failed container=sample-violationprofile pod=sample-violationprofile_default(ec188dd8-4dfd-4bb7-8eb6-b88f9910b843)",
        "reason": "CrashLoopBackOff"
      }
    },
    "volumeMounts": [
      {
        "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount",
        "name": "kube-api-access-dp9qb",
        "readOnly": true,
        "recursiveReadOnly": "Disabled"
      }
    ]
  }
]

This means that Seccomp work as expected, and avoid pods that would require too much syscalls. But how are the Seccomp profiles defined?

Well, that’s a bit more complex. Let’s have a look.

2.3. Creating a custom Seccomp profile

Because a Seccomp profile is used to allow only the necessary, or at least the acceptable, syscalls, we need a way to find out which syscalls is used by the app.

One way to do this, among orhers, is the audit profile that we used earlier. Remember, this profile uses the SCMP_ACT_LOG which allows the syscall to be executed after logging its action in syslog.

So if we create a pod with this profile, as defined in the kubernetes documentation, we can then theorically parse the syslog to find about which syscall are used.

apiVersion: v1
kind: Pod
metadata:
  name: audit-pod
  labels:
    app: audit-pod
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/audit.json
  containers:
  - name: test-container
    image: hashicorp/http-echo:1.0
    args:
    - "-text=just made some syscalls!"
    securityContext:
      allowPrivilegeEscalation: false
---

Looking in the syslog for the pod related logs, we can find this.

vagrant@k8scilium1:~$ cat /var/log/syslog |grep "http-echo"
2025-08-25T07:54:39.281965+00:00 k8scilium1 kernel: audit: type=1326 audit(1756108479.280:484): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=35 compat=0 ip=0x4685d7 code=0x7ffc0000
2025-08-25T07:54:39.282463+00:00 k8scilium1 kernel: audit: type=1326 audit(1756108479.281:485): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x468ba3 code=0x7ffc0000
2025-08-25T07:55:39.282778+00:00 k8scilium1 kernel: audit: type=1326 audit(1756108539.281:486): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=35 compat=0 ip=0x4685d7 code=0x7ffc0000
2025-08-25T07:55:39.283486+00:00 k8scilium1 kernel: audit: type=1326 audit(1756108539.282:487): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x468ba3 code=0x7ffc0000
2025-08-25T07:56:39.284482+00:00 k8scilium1 kernel: audit: type=1326 audit(1756108599.283:488): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=35 compat=0 ip=0x4685d7 code=0x7ffc0000
2025-08-25T07:56:39.284497+00:00 k8scilium1 kernel: audit: type=1326 audit(1756108599.283:489): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x468ba3 code=0x7ffc0000
2025-08-25T07:57:39.285585+00:00 k8scilium1 kernel: audit: type=1326 audit(1756108659.283:490): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=35 compat=0 ip=0x4685d7 code=0x7ffc0000
2025-08-25T07:57:39.285612+00:00 k8scilium1 kernel: audit: type=1326 audit(1756108659.283:491): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x468ba3 code=0x7ffc0000
2025-08-25T07:58:39.285689+00:00 k8scilium1 kernel: audit: type=1326 audit(1756108719.284:492): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=35 compat=0 ip=0x4685d7 code=0x7ffc0000
2025-08-25T07:58:39.285731+00:00 k8scilium1 kernel: audit: type=1326 audit(1756108719.284:493): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x468ba3 code=0x7ffc0000
2025-08-25T07:59:39.285663+00:00 k8scilium1 kernel: audit: type=1326 audit(1756108779.284:494): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=35 compat=0 ip=0x4685d7 code=0x7ffc0000
2025-08-25T07:59:39.285692+00:00 k8scilium1 kernel: audit: type=1326 audit(1756108779.284:495): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x468ba3 code=0x7ffc0000
2025-08-25T08:00:39.286492+00:00 k8scilium1 kernel: audit: type=1326 audit(1756108839.285:496): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=35 compat=0 ip=0x4685d7 code=0x7ffc0000
2025-08-25T08:00:39.286513+00:00 k8scilium1 kernel: audit: type=1326 audit(1756108839.285:497): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x468ba3 code=0x7ffc0000
2025-08-25T08:01:39.286584+00:00 k8scilium1 kernel: audit: type=1326 audit(1756108899.285:498): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=35 compat=0 ip=0x4685d7 code=0x7ffc0000
2025-08-25T08:01:39.287884+00:00 k8scilium1 kernel: audit: type=1326 audit(1756108899.286:499): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x468ba3 code=0x7ffc0000
2025-08-25T08:02:39.287542+00:00 k8scilium1 kernel: audit: type=1326 audit(1756108959.286:500): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=35 compat=0 ip=0x4685d7 code=0x7ffc0000
2025-08-25T08:02:39.288536+00:00 k8scilium1 kernel: audit: type=1326 audit(1756108959.287:501): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x468ba3 code=0x7ffc0000
2025-08-25T08:03:39.289216+00:00 k8scilium1 kernel: audit: type=1326 audit(1756109019.287:502): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=35 compat=0 ip=0x4685d7 code=0x7ffc0000
2025-08-25T08:03:39.289231+00:00 k8scilium1 kernel: audit: type=1326 audit(1756109019.287:503): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=4640 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x468ba3 code=0x7ffc0000

We can find reference to syscalls 35 and 202. To find out which syscall corresponds to which number, we can look on the syscall table available on the chromium documentation.

Syscall number	Syscall name
35	nanosleep
202	futex

Another way to identify syscalls used is the strace tool. For this to work, we need the Pid associated to the container. We can get this through crictl.

vagrant@k8scilium1:~$ sudo crictl ps |grep audit-pod
842fd0188adce       04fa556e62bdd       2 hours ago         Running             test-container            0                   6379e71292e6a       audit-pod                            default
vagrant@k8scilium1:~$ 

vagrant@k8scilium1:~$ sudo crictl inspect 842fd0188adce |grep pid
            "pid": 1
    "pid": 4640,
            "type": "pid"

Then use strace to get information on the syscall

cricvagrant@k8scilium1:~$ sudo strace -p 4640 -f
strace: Process 4640 attached with 7 threads
[pid  4658] futex(0xc0000da948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  4657] futex(0xc0000da548, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  4655] epoll_pwait(5,  <unfinished ...>
[pid  4640] futex(0x86d3a8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  4656] futex(0x89b738, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  4653] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid  4654] futex(0x89b8c0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  4653] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid  4653] nanosleep({tv_sec=0, tv_nsec=10000000}, NULL) = 0
[pid  4653] futex(0x86d760, FUTEX_WAIT_PRIVATE, 0, {tv_sec=60, tv_nsec=0}

We can see, as expectted, that the syslogs registered are the same as when parsing the syslog. We can also see some additional syscalls such as epoll_pwait

So we can create a custom seccomp profile as below.

{
    "defaultAction": "SCMP_ACT_ERRNO",
    "architectures": [
        "SCMP_ARCH_X86_64",
        "SCMP_ARCH_X86",
        "SCMP_ARCH_X32"
    ],
    "syscalls": [
        {
            "names": [
                "futex",
                "epoll_pwait",
                "nanosleep"
            ],
            "action": "SCMP_ACT_ALLOW"
        }
    ]
}

Write a pod definition with the corresponding profile.

apiVersion: v1
kind: Pod
metadata:
  name: audit-pod-custom
  labels:
    app: audit-pod-custom
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/custom.json
  containers:
  - name: test-container
    image: hashicorp/http-echo:1.0
    args:
    - "-text=just made some syscalls!"
    securityContext:
      allowPrivilegeEscalation: false

Create the pod… and see it failing 😅

vagrant@k8scilium1:~$ k get pod audit-pod-custom 
NAME               READY   STATUS             RESTARTS      AGE
audit-pod-custom   0/1     CrashLoopBackOff   5 (64s ago)   3m58s

vagrant@k8scilium1:~$ k describe pod audit-pod-custom 
Name:              audit-pod-custom
Namespace:         default
Priority:          0
Service Account:   default
Node:              k8scilium1/192.168.56.17
Start Time:        Mon, 25 Aug 2025 11:58:12 +0200
Labels:            app=audit-pod-custom
Annotations:       <none>
Status:            Running
SeccompProfile:    Localhost
LocalhostProfile:  profiles/custom.json
IP:                100.64.0.117
IPs:
  IP:  100.64.0.117
Containers:
  test-container:
    Container ID:  containerd://01a40fc001d251ec4d4aeab63f72e401124efe85914fee427c6ff3feb2713af8
    Image:         hashicorp/http-echo:1.0
    Image ID:      docker.io/hashicorp/http-echo@sha256:fcb75f691c8b0414d670ae570240cbf95502cc18a9ba57e982ecac589760a186
    Port:          <none>
    Host Port:     <none>
    Args:
      -text=just made some syscalls!
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: seek /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pode0fedf80_62ac_4559_a22f_b4b30e285fc9.slice/cri-containerd-01a40fc001d251ec4d4aeab63f72e401124efe85914fee427c6ff3feb2713af8.scope/cgroup.freeze: no such device: unknown
      Exit Code:    128
      Started:      Thu, 01 Jan 1970 01:00:00 +0100
      Finished:     Mon, 25 Aug 2025 11:58:13 +0200
    Ready:          False
    Restart Count:  1
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4nwpt (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  kube-api-access-4nwpt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  14s               default-scheduler  Successfully assigned default/audit-pod-custom to k8scilium1
  Warning  Failed     13s               kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: seek /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pode0fedf80_62ac_4559_a22f_b4b30e285fc9.slice/cri-containerd-test-container.scope/cgroup.freeze: no such device: unknown
  Warning  BackOff    12s               kubelet            Back-off restarting failed container test-container in pod audit-pod-custom_default(e0fedf80-62ac-4559-a22f-b4b30e285fc9)
  Normal   Pulled     1s (x3 over 14s)  kubelet            Container image "hashicorp/http-echo:1.0" already present on machine
  Normal   Created    1s (x3 over 14s)  kubelet            Created container: test-container
  Warning  Failed     0s (x2 over 13s)  kubelet            Error: failed to start containerd task "test-container": cannot start a stopped process: unknown

Refering to the Kubernetes documentation, we can see the proposed fine-grained profile is more like this. If you wonder why the strace analysis did not gave us all the syscalls, welcome to the band 😆.

{
    "defaultAction": "SCMP_ACT_ERRNO",
    "architectures": [
        "SCMP_ARCH_X86_64",
        "SCMP_ARCH_X86",
        "SCMP_ARCH_X32"
    ],
    "syscalls": [
        {
            "names": [
                "accept4",
                "epoll_wait",
                "pselect6",
                "futex",
                "madvise",
                "epoll_ctl",
                "getsockname",
                "setsockopt",
                "vfork",
                "mmap",
                "read",
                "write",
                "close",
                "arch_prctl",
                "sched_getaffinity",
                "munmap",
                "brk",
                "rt_sigaction",
                "rt_sigprocmask",
                "sigaltstack",
                "gettid",
                "clone",
                "bind",
                "socket",
                "openat",
                "readlinkat",
                "exit_group",
                "epoll_create1",
                "listen",
                "rt_sigreturn",
                "sched_yield",
                "clock_gettime",
                "connect",
                "dup2",
                "epoll_pwait",
                "execve",
                "exit",
                "fcntl",
                "getpid",
                "getuid",
                "ioctl",
                "mprotect",
                "nanosleep",
                "open",
                "poll",
                "recvfrom",
                "sendto",
                "set_tid_address",
                "setitimer",
                "writev",
                "fstatfs",
                "getdents64",
                "pipe2",
                "getrlimit"
            ],
            "action": "SCMP_ACT_ALLOW"
        }
    ]
}

A new pod with this profile would look like this

apiVersion: v1
kind: Pod
metadata:
  name: audit-pod-custom
  labels:
    app: audit-pod-custom
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/custom.json
  containers:
  - name: test-container
    image: hashicorp/http-echo:1.0
    args:
    - "-text=just made some syscalls!"
    securityContext:
      allowPrivilegeEscalation: false

Remember that the profile needs to be available on the seccomp path, which is var/lib/kubelet/seccomp. This time the pod runs. However, it showed us that it’s far from easy to get a full list of syscalls for a running container.

Last but not least for this part, it could be tempting to get a custom profile through Gen AI. I won’t go to the process on how to get results out of ChatGpt. To summarize, we get a GEN-AI generated profile that looks like this.

Note: I DID NOT test this profile, and I can only recommand to verify that everything is as expected and works fine before using this.

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "archMap": [
    {
      "architecture": "SCMP_ARCH_X86_64",
      "subArchitectures": [
        "SCMP_ARCH_X86",
        "SCMP_ARCH_X32"
      ]
    }
  ],
  "syscalls": [
    {
      "names": [
        "accept",
        "accept4",
        "access",
        "brk",
        "close",
        "dup",
        "dup2",
        "dup3",
        "epoll_create1",
        "epoll_ctl",
        "epoll_pwait",
        "eventfd2",
        "execve",
        "exit",
        "exit_group",
        "fstat",
        "futex",
        "getpid",
        "getppid",
        "getrandom",
        "madvise",
        "mmap",
        "mprotect",
        "munmap",
        "nanosleep",
        "openat",
        "pipe2",
        "pread64",
        "prlimit64",
        "read",
        "recvfrom",
        "recvmsg",
        "rt_sigaction",
        "rt_sigprocmask",
        "rt_sigreturn",
        "sendmsg",
        "sendto",
        "setitimer",
        "setsockopt",
        "sigaltstack",
        "socket",
        "stat",
        "write",
        "writev"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

But the most important thing, IMHO, are the sources mentioned to get this result:

The Docker documentation, and all the references to the runtime default profile, that we already mentioned.
The kubernetes documentation, and specifically the fine grained profile that is defined in the samples.
The syscall2seccomp github repository, which funily, mention a lot of hours to debug why strace does not show all the required syscalls.

And last, a very interesting article on 4armed.com that details some exotic ways to upload custom profile on a cluster. Since this is something that will be helpful for configuring Seccomp on an AKS cluster, we’ll discuss this in the next part.

3. What about Seccomp on AKS

3.1. What we can do through AKS - the normal way

Up until now, we focused on the Seccomp configuration, and how it works, but we did not consider an AKS cluster (or any managed kubernetes cluster for instance) where we don’t get access to the nodes.

Out of the box, on an AKS cluster, what can we do?

For a reminder, AKS is composed of node pools, which are actually Virtual Machine Scale Sets, that are visible in the Azure portal, but managed by AKS. Which means also that we should not interact with those nodes directly on the portal but only through the AKS API. The counter part is also true by the way, managing nodes from the Kubernetes control plane, while working, may not persist on scaling events for example.

However, AKS is still a Kubernetes cluster, which follows quite closely the Kubernetes release rhythm. So that means we have all the Seccomp features available that we’ve discussed, somehow.

There is a documentation page on the topic, but not detailed enough from my point of view.

Considering the following deployment, the default runtime container profile works fine on an AKS cluster.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: demoapp
  name: demoapp
  namespace: demoapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: demoapp
  strategy: {}
  template:
    metadata:
      labels:
        app: demoapp
    spec:
      securityContext: 
        seccompProfile:
          type: RuntimeDefault     
      containers:
      - image: nginx
        name: nginx
        resources: {}
status: {}

What else?

Well, we can consider the node customization, described on the documentation, and on this community article.

Looking deeper into the cluster and node pool objects, we can find in the ARM reference the following paramter inside the kubelet configuration.

It gets interesting if we cross-reference the previously mentionned documentation:

On the Seccomp for AKS page, we mentioned the option to use the node customzation for the use of the default runtime container profile
On the Node customization reference, we can find samples on how to perform the node customization.

First thing first, we need some prerequisite. The KubeletDefaultSeccompProfilePreview feature should be registered on the AKS provider.

We can check this with the following command.

yumemaru@azure~$ az feature show --namespace "Microsoft.ContainerService" --name "KubeletDefaultSeccompProfilePreview"
{
  "id": "/subscriptions/49816259-cb52-4fe5-8d6f-9358ad94332c/providers/Microsoft.Features/providers/Microsoft.ContainerService/features/KubeletDefaultSeccompProfilePreview",
  "name": "Microsoft.ContainerService/KubeletDefaultSeccompProfilePreview",
  "properties": {
    "state": "Registered"
  },
  "type": "Microsoft.Features/providers/features"
}

If it shows Registered, everything is fine, if not, it must be registered. Again, the AKS Seccomp doc details the steps required.

Once we are ok on this feature activation, the next step is the customization of the nodes. It works with the --kubelet-config and a json file which is read when creating either the cluster or the node pool.

We will test this with a node pool here. The command to create the node pool would look like this.

az aks nodepool add --name <nodepool_name> --cluster-name myAKSCluster --resource-group <resourcegroup_name> --kubelet-config ./linuxkubeletconfig.json

And the json file to pass through the cli looks like this.

{
  "seccompDefault": "RuntimeDefault"
}

After the provisioning of the node pool, we can check its kubelet configuration, versus the other(s) node pools for which we did not specified the seccomp parameter.

yumemaru@azure~$ az aks nodepool list --cluster-name <aks_cluster_name> -g <aks_rg_name> -o json | jq '.[].name,.[].kubeletConfig.seccompDefault'
"aksnp0lab"
"npuser1"
"npseccomp"
null
null
"RuntimeDefault"

It’s mentionned in the AKS documentation that afterward, we should connect to the node(s) and check the seccomp configuration, but the steps are not specified (I did say it was not detailed enough for me…). Instead, we will create 2 differents pods, with node afinities. For this test, I added a specific label to my nodes through the AKS command az aks nodepool update and the --labels argument. Specifically, I set the label seccompDefaultEnabled=true/false.

yumemaru@azure$ az aks nodepool update --cluster-name <aks_cluster_name> -g <aks_rg_name> --name aksnp0lab --labels seccompDefaultEnabled=false

The nodes’ labels can be displayed with an kubectl command.

yumemaru@azure~$ k get nodes -o json | jq .items[].metadata.labels
{
  "agentpool": "aksnp0lab",
======truncated======
  "kubernetes.io/hostname": "aks-aksnp0lab-61429621-vmss000000",
  "kubernetes.io/os": "linux",
  "node.kubernetes.io/instance-type": "standard_d2s_v4",
  "seccompDefaultEnabled": "false",
  "storageprofile": "managed",
  "storagetier": "Premium_LRS",
  "topology.disk.csi.azure.com/zone": "",
  "topology.kubernetes.io/region": "francecentral",
  "topology.kubernetes.io/zone": "0"
}
{
  "agentpool": "aksnp0lab",
======truncated======
  "kubernetes.io/hostname": "aks-aksnp0lab-61429621-vmss000001",
  "kubernetes.io/os": "linux",
  "node.kubernetes.io/instance-type": "standard_d2s_v4",
  "seccompDefaultEnabled": "false",
  "storageprofile": "managed",
  "storagetier": "Premium_LRS",
  "topology.disk.csi.azure.com/zone": "",
  "topology.kubernetes.io/region": "francecentral",
  "topology.kubernetes.io/zone": "0"
}
{
  "agentpool": "aksnp0lab",
======truncated======
  "kubernetes.io/hostname": "aks-aksnp0lab-61429621-vmss000002",
  "kubernetes.io/os": "linux",
  "node.kubernetes.io/instance-type": "standard_d2s_v4",
  "seccompDefaultEnabled": "false",
  "storageprofile": "managed",
  "storagetier": "Premium_LRS",
  "topology.disk.csi.azure.com/zone": "",
  "topology.kubernetes.io/region": "francecentral",
  "topology.kubernetes.io/zone": "0"
}
{
  "agentpool": "npseccomp",
======truncated======
  "kubernetes.io/hostname": "aks-npseccomp-10613327-vmss000001",
  "kubernetes.io/os": "linux",
  "node.kubernetes.io/instance-type": "Standard_D4ds_v5",
  "seccompDefaultEnabled": "true",
  "topology.disk.csi.azure.com/zone": "francecentral-2",
  "topology.kubernetes.io/region": "francecentral",
  "topology.kubernetes.io/zone": "francecentral-2"
}
{
  "agentpool": "npuser1",
======truncated======
  "kubernetes.io/hostname": "aks-npuser1-78178449-vmss000000",
  "kubernetes.io/os": "linux",
  "node.kubernetes.io/instance-type": "Standard_D2S_v4",
  "seccompDefaultEnabled": "false",
  "storageprofile": "managed",
  "storagetier": "Premium_LRS",
  "topology.disk.csi.azure.com/zone": "francecentral-2",
  "topology.kubernetes.io/region": "francecentral",
  "topology.kubernetes.io/zone": "francecentral-2"
}
{
  "agentpool": "npuser1",
======truncated======
  "kubernetes.io/hostname": "aks-npuser1-78178449-vmss000002",
  "kubernetes.io/os": "linux",
  "node.kubernetes.io/instance-type": "Standard_D2S_v4",
  "seccompDefaultEnabled": "false",
  "storageprofile": "managed",
  "storagetier": "Premium_LRS",
  "topology.disk.csi.azure.com/zone": "francecentral-1",
  "topology.kubernetes.io/region": "francecentral",
  "topology.kubernetes.io/zone": "francecentral-1"
}

Finally, to test the differences, we deploy pods with the amicontained image, an useful container image for Seccomp analysis taht I found parsing kodekloud notes.

apiVersion: v1
kind: Pod
metadata:
  labels:
    run: amicontained
  name: amicontained
spec:
  affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: seccompDefaultEnabled
                operator: In
                values:
                - "true" 
#  securityContext:
#    seccompProfile:
#      type: Unconfined #RuntimeDefault #
  containers:
    - args:
        - amicontained
      image: yumemaru1979/amicontained
      name: amicontained
      securityContext:
        allowPrivilegeEscalation: false
---
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: amicontained-nodeselector
  name: amicontained-nodeselector
spec:
  affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: seccompDefaultEnabled
                operator: In
                values:
                - "true" 
#  securityContext:
#    seccompProfile:
#      type: Unconfined #RuntimeDefault #
  containers:
    - args:
        - amicontained
      image: yumemaru1979/amicontained
      name: amicontained-nodeselector
      securityContext:
        allowPrivilegeEscalation: false

Creating the pods, we can check the expected nodes are used.. We can also see that, because we did not specified the spec.securitycontext.seccompProfile, we do not see the configuration of a seccomp profile reflected in the pod configuration output.

yumemaru@azure~$ k get pod -o wide
NAME                        READY   STATUS    RESTARTS        AGE     IP             NODE                                NOMINATED NODE   READINESS GATES
amicontained                1/1     Running   1 (3m35s ago)   3m36s   100.64.3.81    aks-npuser1-78178449-vmss000002     <none>           <none>
amicontained-nodeselector   1/1     Running   0               12s     100.64.6.122   aks-npseccomp-10613327-vmss000001   <none>           <none>

yumemaru@azure~$ k get pod -o json |jq .items[].spec.securityContext
{}
{}

However, when we check the log of the pods, we can see the Seccomp value either to disabled or filtering, depending on the node pool configuration. We can also see the difference with the number of syscalls filtered for the different nodes.

yumemaru@azure~$ k logs amicontained
Container Runtime: not-found
Has Namespaces:
        pid: true
        user: false
AppArmor Profile: cri-containerd.apparmor.d (enforce)
Capabilities:
        BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap
Seccomp: disabled
Blocked Syscalls (21):
        SYSLOG SETPGID SETSID VHANGUP PIVOT_ROOT ACCT SETTIMEOFDAY SWAPON REBOOT SETHOSTNAME SETDOMAINNAME INIT_MODULE DELETE_MODULE KEXEC_LOAD PERF_EVENT_OPEN FANOTIFY_INIT OPEN_BY_HANDLE_AT FINIT_MODULE KEXEC_FILE_LOAD BPF USERFAULTFD
Looking for Docker.sock

yumemaru@azure~$ k logs amicontained-nodeselector 
Container Runtime: not-found
Has Namespaces:
        pid: true
        user: false
AppArmor Profile: cri-containerd.apparmor.d (enforce)
Capabilities:
        BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap
Seccomp: filtering
Blocked Syscalls (55):
        MSGRCV SYSLOG SETPGID SETSID USELIB USTAT SYSFS VHANGUP PIVOT_ROOT _SYSCTL ACCT SETTIMEOFDAY MOUNT UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE DELETE_MODULE GET_KERNEL_SYMS QUERY_MODULE QUOTACTL NFSSERVCTL GETPMSG PUTPMSG AFS_SYSCALL TUXCALL SECURITY LOOKUP_DCOOKIE CLOCK_SETTIME VSERVER MBIND SET_MEMPOLICY GET_MEMPOLICY KEXEC_LOAD ADD_KEY REQUEST_KEY KEYCTL MIGRATE_PAGES UNSHARE MOVE_PAGES PERF_EVENT_OPEN FANOTIFY_INIT OPEN_BY_HANDLE_AT SETNS KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF USERFAULTFD
Looking for Docker.sock

So we have access to the default seccomp profile, either through the pods configuration, or by setting the parameter on the node pools. The second option is quite interesting to enforce default security, at the node level rather than the pod level.

Now what about the use of custom profile?

3.2. What we can do on AKS - the exotic way

Currently, there are only 2 options for the seccomp configuration at the node level. Either we don’t set anything and we get the unconfined profile, or we set the default runtime profile, which allows for a minimal security baseline.

As we’ve seen in the previous part, the use of custom profiles requires access to the node, by adding the profiles definition in a specific folder.

And that’s where the trouble start, because we’re not really supposed to access the node, and it implies that we access each nodes, every time an scale-out event occurs.

If we take a scenario with a node pool without autoscaling, we can use the kubectl node-shell command as specified in the documentation to access the node and add the seccomp profile file in the /var/lib/kubelet/seccomp folder

yumemaru@azure~$ k node-shell aks-npseccomp-10613327-vmss000001
spawning "nsenter-j91j0e" on "aks-npseccomp-10613327-vmss000001"
If you don't see a command prompt, try pressing enter.

root@aks-npseccomp-10613327-vmss000001:/#
root@aks-npseccomp-10613327-vmss000001:/# mkdir -p /var/lib/kubelet/seccomp/profiles
root@aks-npseccomp-10613327-vmss000001:/# vim /var/lib/kubelet/seccomp/profiles/audit.json

{
    "defaultAction": "SCMP_ACT_LOG"
}

Then we can create a pod using this profile, from another terminal not connected to the node

yumemaru@azure~$ k get pod audit-pod 
NAME        READY   STATUS    RESTARTS   AGE
audit-pod   1/1     Running   0          7m24s

yumemaru@azure~$ k describe pod audit-pod 
Name:              audit-pod
Namespace:         default
Priority:          0
Service Account:   default
Node:              aks-npseccomp-10613327-vmss000001/172.21.17.75
Start Time:        Tue, 26 Aug 2025 15:43:31 +0200
Labels:            app=audit-pod
Annotations:       <none>
Status:            Running
SeccompProfile:    Localhost
LocalhostProfile:  profiles/audit.json
IP:                100.64.6.129
IPs:
  IP:  100.64.6.129
Containers:
  test-container:
    Container ID:  containerd://c378a2d81cb1b4dc17ab06791114888317fc48fb7169bfd858c72f99b154d0a3
    Image:         hashicorp/http-echo:1.0
    Image ID:      docker.io/hashicorp/http-echo@sha256:fcb75f691c8b0414d670ae570240cbf95502cc18a9ba57e982ecac589760a186
    Port:          <none>
    Host Port:     <none>
    Args:
      -text=just made some syscalls!
    State:          Running
      Started:      Tue, 26 Aug 2025 15:43:34 +0200
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pnc8t (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  kube-api-access-pnc8t:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  13s   default-scheduler  Successfully assigned default/audit-pod to aks-npseccomp-10613327-vmss000001
  Normal  Pulling    12s   kubelet            Pulling image "hashicorp/http-echo:1.0"
  Normal  Pulled     10s   kubelet            Successfully pulled image "hashicorp/http-echo:1.0" in 2.344s (2.344s including waiting). Image size: 4631705 bytes.
  Normal  Created    10s   kubelet            Created container: test-container
  Normal  Started    10s   kubelet            Started container test-container

And going back into the node’s shell, check the syslog for the syscall audit log

root@aks-npseccomp-10613327-vmss000001:/# tail /var/log/syslog | grep 'http-echo'
Aug 26 13:45:34 aks-npseccomp-10613327-vmss000001 kernel: [18362.129539] audit: type=1326 audit(1756215934.835:338): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=267486 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=35 compat=0 ip=0x4685d7 code=0x7ffc0000
Aug 26 13:45:34 aks-npseccomp-10613327-vmss000001 kernel: [18362.129613] audit: type=1326 audit(1756215934.835:339): auid=4294967295 uid=65532 gid=65532 ses=4294967295 subj=cri-containerd.apparmor.d pid=267486 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x468ba3 code=0x7ffc0000
root@aks-npseccomp-10613327-vmss000001:/# 

And that works. But that’s not very practical. So how could we avoid creating manually on each nodes the profiles?

Time to refers back to the 4armed.com. It leverage the use of an init container to create the

From a custom seccomp profile file, we create a secret.

yumemaru@azure~$ k create secret generic customseccompprofilesecret --from-file <path_to_profile> --dry-run=client -o yaml > <path_to_yaml_file>

We then use this secret as a volume in a pod which also mount the /var/lib/kubelet folder from the host. The init container uses a busybox container and pass the mkdir -p /host/seccomp && cp /seccomp/*.json /host/seccomp/ command

apiVersion: v1
data:
  customprofile.json: ewogIC=====truncated=====BdCn0K
kind: Secret
metadata:
  creationTimestamp: null
  name: customseccompprofilesecret
---
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  volumes:
  - name: hostkubelet
    hostPath:
      path: /var/lib/kubelet
      type: Directory
  - name: seccomp-profiles
    secret:
      secretName: customseccompprofilesecret
  - name: localvol
    emptyDir: {}
  initContainers:
  - name: seccomp
    image: busybox
    volumeMounts:
    - name: hostkubelet
      mountPath: /host
    - name: seccomp-profiles
      mountPath: /seccomp
    - name: localvol
      mountPath: /local
    command:
    - "sh"
    - "-c"
    - "mkdir -p /host/seccomp && cp /seccomp/*.json /host/seccomp/; test -f /host/seccomp/customprofile.json && echo 'customprofile.json exists.' > /local/checkfile"
  containers:
  - name: web
    image: nginx
    securityContext:
      seccompProfile:
        type: Localhost
        localhostProfile: customprofile.json
    volumeMounts:
    - name: localvol
      mountPath: /local
    livenessProbe:
      exec:
        command:
        - cat
        - /local/checkfile
    resources: {}

We can check that everything works.

yumemaru@azure~$ k get pod nginx -o wide
NAME    READY   STATUS    RESTARTS   AGE     IP            NODE                              NOMINATED NODE   READINESS GATES
nginx   1/1     Running   0          3m31s   100.64.4.74   aks-npuser1-78178449-vmss000000   <none>           <none>

yumemaru@azure~$ k describe pod nginx 
Name:             nginx
Namespace:        default
Priority:         0
Service Account:  default
Node:             aks-npuser1-78178449-vmss000000/172.21.17.72
Start Time:       Tue, 26 Aug 2025 16:34:15 +0200
Labels:           app=nginx
Annotations:      <none>
Status:           Running
IP:               100.64.4.74
IPs:
  IP:  100.64.4.74
Init Containers:
  seccomp:
    Container ID:  containerd://0b04053cb97a138396b380d1e5bdbfb4d55c87939464fe16bef8399095078186
    Image:         busybox
    Image ID:      docker.io/library/busybox@sha256:ab33eacc8251e3807b85bb6dba570e4698c3998eca6f0fc2ccb60575a563ea74
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      mkdir -p /host/seccomp && cp /seccomp/*.json /host/seccomp/
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 26 Aug 2025 16:34:18 +0200
      Finished:     Tue, 26 Aug 2025 16:34:18 +0200
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /host from hostkubelet (rw)
      /seccomp from seccomp-profiles (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ckk5p (ro)
Containers:
  web:
    Container ID:        containerd://0844ee9181cbd7cff4ad5aa7db614df8a4c4d50a40795969cfc019d9d4a250c2
    Image:               nginx
    Image ID:            docker.io/library/nginx@sha256:33e0bbc7ca9ecf108140af6288c7c9d1ecc77548cbfd3952fd8466a75edefe57
    Port:                <none>
    Host Port:           <none>
    SeccompProfile:      Localhost
      LocalhostProfile:  customprofile.json
    State:               Running
      Started:           Tue, 26 Aug 2025 16:34:19 +0200
    Ready:               True
    Restart Count:       0
    Environment:         <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ckk5p (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  hostkubelet:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet
    HostPathType:  Directory
  seccomp-profiles:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  customseccompprofilesecret
    Optional:    false
  kube-api-access-ckk5p:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  34s   default-scheduler  Successfully assigned default/nginx to aks-npuser1-78178449-vmss000000
  Normal  Pulling    34s   kubelet            Pulling image "busybox"
  Normal  Pulled     31s   kubelet            Successfully pulled image "busybox" in 2.501s (2.501s including waiting). Image size: 2223685 bytes.
  Normal  Created    31s   kubelet            Created container: seccomp
  Normal  Started    31s   kubelet            Started container seccomp
  Normal  Pulling    31s   kubelet            Pulling image "nginx"
  Normal  Pulled     30s   kubelet            Successfully pulled image "nginx" in 737ms (737ms including waiting). Image size: 72324501 bytes.
  Normal  Created    30s   kubelet            Created container: web
  Normal  Started    30s   kubelet            Started container web

Connecting to the node, we can see that the profile is present as expected.

yumemaru@azure~$ k node-shell aks-npuser1-78178449-vmss000000
spawning "nsenter-oc2wnq" on "aks-npuser1-78178449-vmss000000"
If you don't see a command prompt, try pressing enter.
root@aks-npuser1-78178449-vmss000000:/# ls /var/lib/kubelet/seccomp/
customprofile.json
root@aks-npuser1-78178449-vmss000000:/# cat /var/lib/kubelet/seccomp/customprofile.json 
{
  "defaultAction": "SCMP_ACT_LOG",
  "archMap": [
    {
      "architecture": "SCMP_ARCH_X86_64",
      "subArchitectures": [
        "SCMP_ARCH_X86",
        "SCMP_ARCH_X32"
      ]
    }
  ],
  "syscalls": [
    {
      "names": [
        "accept",
        "accept4",
        "access",
        "brk",
        "close",
        "dup",
        "dup2",
        "dup3",
        "epoll_create1",
        "epoll_ctl",
        "epoll_pwait",
        "eventfd2",
        "execve",
        "exit",
        "exit_group",
        "fstat",
        "futex",
        "getpid",
        "getppid",
        "getrandom",
        "madvise",
        "mmap",
        "mprotect",
        "munmap",
        "nanosleep",
        "openat",
        "pipe2",
        "pread64",
        "prlimit64",
        "read",
        "recvfrom",
        "recvmsg",
        "rt_sigaction",
        "rt_sigprocmask",
        "rt_sigreturn",
        "sendmsg",
        "sendto",
        "setitimer",
        "setsockopt",
        "sigaltstack",
        "socket",
        "stat",
        "write",
        "writev"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

That’s that.

Additionaly, we could also use the same concept but instead of an init container, which impact the time to start the pod, use a daemonset that would copy the profile on each node. A yaml definition would look like this.

apiVersion: v1
data:
  customprofileds.json: ewogIC=====truncated=====BdCn0K
kind: Secret
metadata:
  name: anothercustomseccompprofilesecret
  namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: customseccompprofileset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: seccomp-ds
  template:
    metadata:
      labels:
        name: seccomp-ds
    spec:
      tolerations:
      # these tolerations are to have the daemonset runnable on control plane nodes
      # remove them if your control plane nodes should not run pods
      - key: node-role.kubernetes.io/control-plane
        operator: Exists
        effect: NoSchedule
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule
      containers:
      - name: copycustomseccomp
        image: busybox
        resources: {}
        volumeMounts:
        - name: hostkubelet
          mountPath: /host
        - name: seccomp-profiles
          mountPath: /seccomp
        - name: localvol
          mountPath: /local
        command:
        - "sh"
        - "-c"
        - "mkdir -p /host/seccomp && cp /seccomp/*.json /host/seccomp/; test -f /host/seccomp/customprofileds.json && echo 'customprofileds.json exists.' > /local/checkfile; sleep 15"
      terminationGracePeriodSeconds: 30
      volumes:
      - name: hostkubelet
        hostPath:
          path: /var/lib/kubelet
          type: Directory
      - name: seccomp-profiles
        secret:
          secretName: anothercustomseccompprofilesecret
      - name: localvol
        emptyDir: {}
---
apiVersion: v1
kind: Pod
metadata:
  name: nginx-testds
  labels:
    app: ginx-testds
spec:
  volumes:
  - name: hostkubelet
    hostPath:
      path: /var/lib/kubelet
      type: Directory
  containers:
  - name: web
    image: nginx
    securityContext:
      seccompProfile:
        type: Localhost
        localhostProfile: customprofileds.json
    volumeMounts:
    - name: hostkubelet
      mountPath: /host
      readOnly: true
    readinessProbe:
      exec:
        command:
        - "sh"
        - "-c"
        - test -f /host/seccomp/customprofileds.json && echo 'customprofileds.json exists.'
      initialDelaySeconds: 20
      periodSeconds: 5
    resources: {}

One might question the relevance of mounting folders from the host to achieve this security requirement though 🤫

Note: While those methods works on a cluster without access to the nodes, the results would be the sames with a self-managed cluster on which we can access to the nodes.

To avoid that, we could use the dedicated operator to manage seccomp profile.

The documentation can be found here. Because it deserves a more thorough study, we are only mentionning today, and may come back to it another day.

Ok time to wrap up!

4. Summary

Soooooo!

It’s been an eventful journey right?

To summarize:

Seccomp is a powerful (while not new) tool to give us a measure of security on kubernetes
As expected for a security feature, it implies that we know a bit about kubernetes and the potential limitations of the environment.
There is room for customisation, but one would say that the minimum requirement should be to at least enforce the runtime’s default profile.
Because AKS also needs security, and people at microsoft think about our need, we found that there is a way to enforce this default profile inb Azure environment
Custom profiles are a pain, both for the creation but to manage, specifically on multi-nodes clusters, and even more on cloud-managed cluster. While we can find ways to push profiles to clusters, creating profiles is far from easy.

And that’s all!

I hope it was useful. Now I’ll go back to study to the CKS, which is far from done 🤪

Until then…