AKS Networking considerations - part 2
Hi all!
It’s here, part 2 of the AKS network considertions series. Remember, in part 1 we talked mostly of the control plane. Which needed some specific Azure PaaS network concepts.
Today, we move on to talk about the worker plane, mostly, but also some other considerations in the use of subnets for AKS.
Let’s get started
1. Network considerations for the worker plane
When talking about Kubernetes, we have to consider how the pods are talking to each other, or to the outside world.
That’s where comes the CNI topic. CNI stands for Container Network Interface and is the part which allows for different providers (Network providers) to developpe their solution and plug it to Kubernetes. There are a lot of CNI solutions, in the Kubernetes project. But from an AKS point of view, for a long time, we had only 2 options:
- Kubenet
- Azure CNI
Let’s start with Kubenet
1.1. Kubenet
Kubenet is the default network configuration for the worker plane, and I guess that’s because there is less planning to do from an network perspective. AKS with kubenet relies on a NAT based technology. Pods have their own IP range, hidden from the Virtual Network range. The positive impact of this is that we only have to plan for the nodes IP consumption. We can for instance plan for a /26 for the AKS subnet and we’ll have up to 59 nodes for our cluster.
What should be planned are the ranges for the pods and for the kubernetes service. But there is a default value for both ranges
Pod CIDR default value | Service CIDR default value |
---|---|
10.244.0.0/16 |
10.0.0.0/16 |
A cluster with kubenet will display kubenet in the network plugin value and a value for both pod cidr and service cidr:
Note that those range are private from an Azure standpoint. Which means tht we can re-use those on other clusters.
Now let’s talk about limitations. Not all feature for AKS are available with kubenet, even if it got better with time. It is not possible to mutualized the AKS subnet for multiple clusters with kubenet. The rule is One cluster / 1 subnet Second, and maybe the most important point, there is an the added latency due to the additional hop for flows when reaching a workload inside the cluster. This additioal latency is not easy to evaluate but seems sufficiant so that Microsoft does not recommand kubenet for production environment.
Looking at the network flows, Those go first through one of the node and then to the pods. To do so, a User defined route (UDR) associated to the aks subnet is required. This UDR is managed and updated by the control plane and is define next hops for chunks of the Pod CIDR, associating a /24 from the range to each nodes:
Additional nodes would be declared with additional routes.
Let’s note also the non-compatibility of kubenet with Windows Node pools. IMHO, Windows based containers are not necessarily a good idea, but that’s just my 2 cents on the topic.
If we want better performances or Windows container, then we need to consider Azure CNI.
1.2. Azure CNI
A cluster configured with Azure CNI will NOT display a range for the pod CIDR.
That’s because of the nature of the CNI which offers a better integration with the virtual network.
Instead of NAT, Azure CNI creates a bridge for the Pod to be directly visible inside the Vnet. There is no NAT so no additional hop, which means performance similar to VM to VM communication. It’ also possible to have Windows node pools or virtual nodes.
On the less good, we have to plan the virtual network range, since the pods will consume IPs also. As described in the documentation, we need to plan for the upgrade time when new nodes are created. Also, the max number of pods per node has to be taken into consideration. For an Azure CNI cluster, the default value is 30 pods per nodes.
The formula (number of nodes + 1) + ((number of nodes + 1) * maximum pods per node that you configure)
from the documentation can be used to plan the requirement in terms of network range on the subnet hosting the cluster.
However, this formula does not take into account the maxSurge parameter which define how many nodes are created at the upgrade time. If the maxSurge is indeed configured to 1, then the formula is valid. If on the other we have a max surge configured to 33% with a 9 nodes cluster, then we would create 3 nodes for the upgrade.
So the formula would become something like (number of nodes + Additional nodes configured for upgrade) + ((number of nodes + Additional nodes configured for upgrade) * maximum pods per node that you configure)
. Pushing this exemple to the end with the 9 nodes cluster and 30 pods per nodes would gives us a requirement for (9+3)+ (9+3) * 30
meaning 279 IPs.
That’s a lot of IP for a not so big cluster.
We can see that the cluster reserves the IP in the Vnet if we check the connected device:
There’s an option to optimize the network IP exhaustion describe in the documentation. This is called Azure CNI with dynamic IP allocation. In this case the pods are using another subnet than the nodes. Subnets should be created before the cluster. And afteward, the pods subnet appears as a delegated subnet
yumemaru@azure:~$ az network vnet create -n aks-vnet2 -g aksntwdemo
{
"newVNet": {
"addressSpace": {
"addressPrefixes": [
"10.0.0.0/16"
]
},
"enableDdosProtection": false,
"etag": "W/\"e8b27507-8277-4c4f-8488-5d75153d6de3\"",
"id": "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/aksntwdemo/providers/Microsoft.Network/virtualNetworks/aks-vnet2",
"location": "eastus",
"name": "aks-vnet2",
"provisioningState": "Succeeded",
"resourceGroup": "aksntwdemo",
"resourceGuid": "3d7afbe5-5605-4411-9f60-749d0dc85aaf",
"subnets": [],
"type": "Microsoft.Network/virtualNetworks",
"virtualNetworkPeerings": []
}
}
yumemaru@azure:~$ az network vnet subnet create -n subnet-aks -g aksntwdemo --address-prefixes 10.0.0.0/24 --vnet-name aks-vnet2
{
"addressPrefix": "10.0.0.0/24",
"delegations": [],
"etag": "W/\"9c35d528-88ff-427c-b95a-771b915210f0\"",
"id": "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/aksntwdemo/providers/Microsoft.Network/virtualNetworks/aks-vnet2/subnets/subnet-aks",
"name": "subnet-aks",
"privateEndpointNetworkPolicies": "Disabled",
"privateLinkServiceNetworkPolicies": "Enabled",
"provisioningState": "Succeeded",
"resourceGroup": "aksntwdemo",
"type": "Microsoft.Network/virtualNetworks/subnets"
}
yumemaru@azure:~$ az network vnet subnet create -n subnet-pods -g aksntwdemo --address-prefixes 10.0.2.0/23 --vnet-name aks-vnet2
{
"addressPrefix": "10.0.2.0/23",
"delegations": [],
"etag": "W/\"51ede150-fe87-4228-b766-4b51ff810949\"",
"id": "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/aksntwdemo/providers/Microsoft.Network/virtualNetworks/aks-vnet2/subnets/subnet-pods",
"name": "subnet-pods",
"privateEndpointNetworkPolicies": "Disabled",
"privateLinkServiceNetworkPolicies": "Enabled",
"provisioningState": "Succeeded",
"resourceGroup": "aksntwdemo",
"type": "Microsoft.Network/virtualNetworks/subnets"
}
yumemaru@azure:~$ az aks create -n aks-ntwdemo10 -g aksntwdemo --vnet-subnet-id /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/aksntwdemo/providers/Microsoft.Network/virtualNetworks/aks-vnet2/subnets/subnet-aks --pod-subnet-id /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/aksntwdemo/providers/Microsoft.Network/virtualNetworks/aks-vnet2/subnets/subnet-pods --network-plugin azure --service-cidr 10.1.0.0/16 --dns-service-ip 10.1.0.10
yumemaru@azure:~$ az network vnet subnet show --vnet-name aks-vnet2 --name subnet-pods -g aksntwdemo | jq .delegations
[
{
"actions": [
"Microsoft.Network/virtualNetworks/subnets/join/action"
],
"etag": "W/\"acadaa68-1a02-4481-aae7-08d468a6ddd3\"",
"id": "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/aksntwdemo/providers/Microsoft.Network/virtualNetworks/aks-vnet2/subnets/subnet-pods/delegations/aks-delegation",
"name": "aks-delegation",
"provisioningState": "Succeeded",
"resourceGroup": "aksntwdemo",
"serviceName": "Microsoft.ContainerService/managedClusters",
"type": "Microsoft.Network/virtualNetworks/subnets/delegations"
}
]
Still, a overlay would be good, so are there options available?
1.3. Azure CNI with overlay
That’s the part that i prefer with AKS, the product team really seems to follow Microsoft clients need. There is now a possibility to avoid IP exhaustion with Azure CNI with overlay. This is kind of a doped kubenet. We could say that the idea is to get the best of both Azure CNI and Kubebet. The documentation described that each node gets assigned a /24 CIDR, much like in kubenet. But there are less limitations, such as more nodes (1000 vs 400), performance on par with Azure CNI and no need for UDRs, and compatibility with both Windows and Linux nodes. Because the overlay traffic isn’t encapsulated, we need to beware of NSG filtering on the subnet, which must allow traffic as follow, in addition to egress traffic requirements:
- Traffic from the node CIDR to the node CIDR on all ports and protocols
- Traffic from the node CIDR to the pod CIDR on all ports and protocols (required for service traffic routing)
- Traffic from the pod CIDR to the pod CIDR on all ports and protocols (required for pod to pod and pod to service traffic, including DNS)
A cluster with Azure CNI overlay will display Azure CNI as a plugin but also a Pod CIDR:
There are some limitation, one to note is the impossibility to use Application Gateway as Ingress Controller. We’ll keep in mind that but since AGIC is evolving for a Gateway API option, there’s a chance that it will also evolve in the near future. We’ll also note the fact that the dual stack ipv4 ipv6 is not available while it is for kubenet or Azure CNI.
1.4. Azure CNI powered by Cilium
Another kind of doped kubenet is Azure CNI powered by Cilium. It is supposed to combines Azure CNI overlay with a data plane management powered by Cilium. It’s interesting because it provides the option to have some of Cilium option without having to manage the Cilium installation and updates.
From a cluster point of view, we don’t get to see a lot of differences. However checking the pods in kube-system, we can see a deployment refering to Cilium.
yumemaru@azure:~$ k describe deployments.apps -n kube-system cilium-operator
Name: cilium-operator
Namespace: kube-system
CreationTimestamp: Tue, 24 Oct 2023 16:24:02 +0200
Labels: app.kubernetes.io/managed-by=Helm
helm.toolkit.fluxcd.io/name=cilium-adapter-helmrelease
helm.toolkit.fluxcd.io/namespace=6537d1ecd4ba270001ceae9f
io.cilium/app=operator
kubernetes.azure.com/managedby=aks
name=cilium-operator
Annotations: deployment.kubernetes.io/revision: 1
meta.helm.sh/release-name: cilium
meta.helm.sh/release-namespace: kube-system
Selector: io.cilium/app=operator,name=cilium-operator
Replicas: 1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 1 max surge
Pod Template:
Labels: io.cilium/app=operator
kubernetes.azure.com/ebpf-dataplane=cilium
name=cilium-operator
Annotations: cilium.io/cilium-configmap-checksum: fe30fdc0ca01479274381ba26a5e74a606fbe6a1f46fb095407b94c9281777c2
prometheus.io/port: 9963
prometheus.io/scrape: true
Service Account: cilium-operator
Containers:
cilium-operator:
Image: mcr.microsoft.com/oss/cilium/operator-generic:1.12.10
Port: 9963/TCP
Host Port: 9963/TCP
Command:
cilium-operator-generic
Args:
--config-dir=/tmp/cilium/config-map
--debug=$(CILIUM_DEBUG)
Liveness: http-get http://127.0.0.1:9234/healthz delay=60s timeout=3s period=10s #success=1 #failure=3
Environment:
K8S_NODE_NAME: (v1:spec.nodeName)
CILIUM_K8S_NAMESPACE: (v1:metadata.namespace)
CILIUM_DEBUG: <set to the key 'debug' of config map 'cilium-config'> Optional: true
Mounts:
/tmp/cilium/config-map from cilium-config-path (ro)
Volumes:
cilium-config-path:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: cilium-config
Optional: false
Priority Class Name: system-cluster-critical
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: cilium-operator-8cff7865b (1/1 replicas created)
Events: <none>
There’s a lot to tell about Cilium and this article is long enough already. So we’ll keep it for another time. Let’s have a look at the last option now
1.5. Bring your own CNI
It’s possible to create a cluster without a CNI, by specifying None
for the network plugin parameter:
az aks create -n aks-ntwdemo8 -g aksntwdemo -l eastus --enable-managed-identity --node-count 1 --network-plugin none
Checking the nodes, we will see that there are in a NotReady
State, waiting to have a CNI deployed:
yumemaru@azure:~$ k get no
NAME STATUS ROLES AGE VERSION
aks-nodepool1-24351847-vmss000001 NotReady agent 140m v1.26.6
We could deploy any CNI with the associated documentation. Again, let’s stop it here for now. Because Cilium has such traction lately, there’s a chance that another article about this will come around anyway.
For now, we’ve seen eveything we had to see regarding the network plugin options. Let’s take a step back andd look at the Virtual NEtwork again.
2. Subnet usage
Up until now, we’ve seen that the nodes live in their own subnet. Depending on the CNI, we may or may not need a bigger range. However there are some case where we can add other subnets. We’ve seen one with the API Server vnet integrated, or the Azure CNI with dynamic IP allocation. In both case we get an additional subnet, configured as delegated for use by AKS.
There are 2 others cases where we can have additional subnets.
The first one is with additional node pools:
As illustrated, we can specify an additional subnet and add another nodepool pointing to this subnet. A few thing to take into consideration:
- The network cannot be managed by AKS, meaning that we have to prepare the network before the cluster creation, with the required subnet.
- Because AKS manages an NSG on the node pools directly, adding a new subnet for another node pool does not bring network segregation from a filtering point of view. If on the other hand, each subnet has its own NSG, then yes there is additional network segregation.
yumemaru@azure:~$ az network vnet create -n aks-vnet3 -g aksntwdemo --address-prefixes 172.20.0.0/24
{
"newVNet": {
"addressSpace": {
"addressPrefixes": [
"172.20.0.0/24"
]
},
"enableDdosProtection": false,
"etag": "W/\"df04b35c-803d-412d-b0b0-fa045e091b62\"",
"id": "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/aksntwdemo/providers/Microsoft.Network/virtualNetworks/aks-vnet3",
"location": "eastus",
"name": "aks-vnet3",
"provisioningState": "Succeeded",
"resourceGroup": "aksntwdemo",
"resourceGuid": "94ff2b33-f0dc-4d96-8b86-28cab6baf844",
"subnets": [],
"type": "Microsoft.Network/virtualNetworks",
"virtualNetworkPeerings": []
}
}
yumemaru@azure:~$ az network vnet subnet create -n aks-subnet -g aksntwdemo --vnet-name aks-vnet3 --address-prefixes 172.20.0.0/26
{
"addressPrefix": "172.20.0.0/26",
"delegations": [],
"etag": "W/\"04beae60-ba2f-43bb-8f32-0e4b234d6407\"",
"id": "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/aksntwdemo/providers/Microsoft.Network/virtualNetworks/aks-vnet3/subnets/aks-subnet",
"name": "aks-subnet",
"privateEndpointNetworkPolicies": "Disabled",
"privateLinkServiceNetworkPolicies": "Enabled",
"provisioningState": "Succeeded",
"resourceGroup": "aksntwdemo",
"type": "Microsoft.Network/virtualNetworks/subnets"
}
yumemaru@azure:~$ az network vnet subnet create -n np-subnet -g aksntwdemo --vnet-name aks-vnet3 --address-prefixes 172.20.0.64/26
{
"addressPrefix": "172.20.0.64/26",
"delegations": [],
"etag": "W/\"beba6d57-5497-4b5c-a910-0b1f02d7e6db\"",
"id": "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/aksntwdemo/providers/Microsoft.Network/virtualNetworks/aks-vnet3/subnets/np-subnet",
"name": "np-subnet",
"privateEndpointNetworkPolicies": "Disabled",
"privateLinkServiceNetworkPolicies": "Enabled",
"provisioningState": "Succeeded",
"resourceGroup": "aksntwdemo",
"type": "Microsoft.Network/virtualNetworks/subnets"
}
yumemaru@azure:~$ az aks create -n aks-ntwdemo12 -g aksntwdemo --vnet-subnet-id /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/aksntwdemo/providers/Microsoft.Network/virtualNetworks/aks-vnet3/subnets/aks-subnet --network-plugin kubenet
After that we can add a node pool and specify the subnet:
yumemaru@azure:~$ az aks nodepool add --cluster-name aks-ntwdemo12 -g aksntwdemo -n aksnp02 --vnet-subnet-id /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/aksntwdemo/providers/Microsoft.Network/virtualNetworks/aks-vnet3/subnets/np-subnet --node-count 1
Looking at the connected devices in the vnet, we do see the node pools in different subnets:
Last, we can also have a dedicated subnet for the Internal load balancer that we use for K8S services. This is more on the Kubernetes control plane, but it does reflect in the Azure plane so let’s take a few lines to discuss this.
Let’s create a deployment:
apiVersion: v1
kind: Namespace
metadata:
creationTimestamp: null
name: testns
spec: {}
status: {}
---
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
labels:
app: testdeploy
name: testdeploy
namespace: testns
spec:
replicas: 3
selector:
matchLabels:
app: testdeploy
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
app: testdeploy
spec:
containers:
- image: nginx
name: nginx
resources: {}
status: {}
As mentionned, we want to use another subnet for the Internal Load Balancer:
yumemaru@azure:~$ az network vnet subnet create -n ilb-subnet -g aksntwdemo --vnet-name aks-vnet3 --address-prefixes 172.20.0.128/26
{
"addressPrefix": "172.20.0.128/26",
"delegations": [],
"etag": "W/\"d029c31e-9fde-441e-808d-b72b246dc617\"",
"id": "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/aksntwdemo/providers/Microsoft.Network/virtualNetworks/aks-vnet3/subnets/ilb-subnet",
"name": "ilb-subnet",
"privateEndpointNetworkPolicies": "Disabled",
"privateLinkServiceNetworkPolicies": "Enabled",
"provisioningState": "Succeeded",
"resourceGroup": "aksntwdemo",
"type": "Microsoft.Network/virtualNetworks/subnets"
Now let’s create the service on the Kubernetes side. We’ll note the 2 annotations, service.beta.kubernetes.io/azure-load-balancer-internal
to use an internal load balancer and service.beta.kubernetes.io/azure-load-balancer-internal-subnet
to target the specific subnet:
apiVersion: v1
kind: Service
metadata:
annotations:
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
service.beta.kubernetes.io/azure-load-balancer-internal-subnet: "ilb-subnet"
labels:
app: testdeploy
name: testdeploy
namespace: testns
spec:
type: LoadBalancer
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: testdeploy
Upon creation, we should see the service on the kubernetes side with a private IP from the subnet:
yumemaru@azure:~$ k get svc -n testns
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
testdeploy LoadBalancer 10.0.175.119 172.20.0.132 80:32709/TCP 44s
And the corresponding internal load balancer in the subnet:
We’re finished with the subnet usage.
That will be all. Let’s wrap it
3. Before leaving
We’ve seen a lot on this 2 part article. Let’s summarize a bit:
Area | Actions |
---|---|
Control plane | Protect API server access with either </br> - Accept list </br> - Private clusters </br> - API Server Vnet integration |
Worker plane | Configure nodes and pods networking </br> - with kubenet, which isolate nodes IP range in Vnet from pods IP range on overlay </br> - with Azure CNI which uses Vnet range for both pods and nodes </br>Get the better of both world with Azure CNI with overlay or Azure CNI powerered by Cilium </br>Take full control of your kubernetes network with BYO CNI |
Subnet usage | Use additional subnets on the Azure plane for </br> - Additional Node pools </br> - Internal Load balancers </br> - Pods in Azure CNI with dynamic allocation </br> - The API server with the integrated option |
We did not take time to talk about this but we should also plan for the egress traffic for AKS. By design, it needs some flows that are detailed in a specific section of the documentation. We can either configure filtering through NSG rules or through Azure Firewall.
Now I’ll probably come back on AKS with BYO CNI, specifically because I want to dig a bit on Cilium.
Until now, have fun ^^