
Before I start this post, I want to share a realization I’ve come to. Somewhere along the way, my posts lost the personal touch that made me so proud of having this blog in the first place.
Reading them, I feel like I’m reading a generic LinkedIn post, trying to convince the reader of some random thing, blatantly shoving a CTA when it clearly isn’t natural, or detailing 10 things that going through [insert happy/tragic life event] taught me about B2B SAAS.
Anyway, what I’m trying to say is - I’m the only one who decided that my blog posts need to be these super polished, highly detailed articles. I guess that doesn’t really resonate with me as much these days, so from now on, I’ll try posting more bite-sized, frequent stuff, documenting cool things I find and do, and making this blog more of a log than anything else.
I hope the 250 monthly readers (if you’re reading this, thanks by the way!) I have will stick around to see the pivot.
Anyway, I’m blabbering, you came here for the tech stuff, not the ramblings of a man on the verge of crossing over to his 30s.
Recently, I was tasked with implementing Kubernetes Network Policies on our EKS clusters.
Network policies, if you’re not familiar, are a resource you can deploy to a Kubernetes cluster, detailing which traffic is allowed/denied into/out of the cluster. The policies themselves are only part of the equation. Since they’re a Kubernetes resource, they need a controller to actually enforce them. In EKS, the VPC CNI’s network policy agent handles that (if that’s the CNI you use, of course).
The flow goes like this:
NetworkPolicy (you write this)
↓
PolicyEndpoint Controller
↓
PolicyEndpoint objects (concrete IPs + port rules, one per pod)
↓
VPC CNI Node Agent
↓
eBPF maps in the kernel (packets without a matching ALLOW rule are dropped)
eBPF is a mechanism built into the kernel — a way to run safe, fast programs in response to kernel events, letting you inspect and control things like network traffic without modifying the kernel itself. The VPC CNI uses it to enforce network policies: for each pod, it maintains a set of eBPF maps containing the allowed traffic rules, and a kernel program checks every incoming and outgoing packet against them.
Amazon implemented two tiers of maps:
| Tier | Source | eBPF Maps | Entry Size |
|---|---|---|---|
| Namespace (regular) | NetworkPolicy (Kubernetes resource) |
ingress_map / egress_map |
12 bytes per rule |
| Cluster (admin) | ClusterNetworkPolicy (EKS-only resource) |
cp_ingress_map / cp_egress_map |
16 bytes per rule |
Cluster-tier maps have a 4-byte priority field that namespace-tier maps don’t, which is where the entry size difference comes from. Each map value is a fixed-size array of up to 24 rule slots — one per port/protocol combination for that CIDR — so the full value buffer is 24 × 12 = 288 bytes for namespace maps, and 24 × 16 = 384 bytes for cluster maps. The cluster tier is evaluated first and takes precedence.
I created a pretty lax network policy to test the waters before committing to something more fine-grained, allowing all internal traffic and denying everything else:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: baseline-network-policy
namespace: default
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- ipBlock:
cidr: 10.0.0.0/8
- ipBlock:
cidr: 172.16.0.0/12
- ipBlock:
cidr: 192.168.0.0/16
egress:
- to:
- ipBlock:
cidr: 10.0.0.0/8
- ipBlock:
cidr: 172.16.0.0/12
- ipBlock:
cidr: 192.168.0.0/16
“Thank God for abstraction”, I thought, “Everything will go over smoothly and I’ll just set and forget this network policy and close the ticket in no time”. I woke up to a slack channel that looked like a war zone. Pods were failing left and right because they couldn’t communicate with each other.
I started where anyone would: the network policy itself looked fine. The PolicyEndpoint objects the controller generated from it also looked fine — correct IPs, correct CIDRs. The node agent logs showed no errors. Deleting and recreating pods didn’t help. Rolling back the network policy made the problem go away, and re-applying it brought it back, which ruled out anything upstream of the CNI. Whatever was broken lived between the agent and the kernel.
That’s when I started looking at what was actually being written into the kernel — not the Kubernetes resource, not the controller logs, the actual bytes in the eBPF maps on the node, with the help of Claude.
The VPC CNI pins its maps to the BPF filesystem at /host/sys/fs/bpf/globals/aws/maps/, one set per pod. To get to them, I spun up a privileged debug pod on the affected node:
kubectl debug node/<node-name> -it --profile=sysadmin --image=nicolaka/netshoot -- bash
Once in, I dumped both ingress maps. ingress_map had 191 entries — all the CIDRs and pod /32s the agent had written before it last restarted. Each one looks like this:
$ bpftool map dump pinned /host/sys/fs/bpf/globals/aws/maps/my-pod_ingress_map
key:
08 00 00 00 0a 00 00 00
value:
fe 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[... 17 rows of zeros ...]
key:
10 00 00 00 0a 64 00 00
value:
fe 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[... 17 rows of zeros ...]
key:
20 00 00 00 0a 64 01 04
value:
fe 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[... 17 rows of zeros ...]
[... 188 more entries ...]
Found 191 elements
The key format is LPM trie: first 4 bytes are the prefix length, next 4 are the IP. 08 00 00 00 0a 00 00 00 = 10.0.0.0/8. 20 00 00 00 = /32. The value is 288 bytes — one 12-byte rule slot followed by 276 bytes of zeros, because there’s only one rule per CIDR here: fe 00 00 00 = protocol 254 (ANY), allow everything.
191 entries. But 0a 64 01 05 — the new pod’s IP, 10.100.1.5 — wasn’t in there. Then I dumped cp_ingress_map:
$ bpftool map dump pinned /host/sys/fs/bpf/globals/aws/maps/my-pod_cp_ingress_map
key:
20 00 00 00 0a 64 01 05
value:
fe 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[... 17 rows of zeros ...]
fe 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[... 5 rows of zeros ...]
Found 1 element
Two things immediately looked wrong. 0a 64 01 05 is the new pod’s own IP (10.100.1.5). It should be in ingress_map — but it wasn’t there. It was only in cp_ingress_map, a map that should have been empty since we had zero ClusterNetworkPolicy resources in the cluster. And that second fe 00 00 00 block near the bottom of the value — that didn’t belong there either. Something had written namespace-policy data into a cluster-tier map.
At that point I had the evidence but not the explanation. I took the map dumps, pointed Claude at the agent’s source, and just told it network policies weren’t working. It traced the write path back to the recovery phase — the code that runs on agent restart to reconstruct its in-memory state from the eBPF maps already loaded in the kernel. Here’s roughly what that looked like in v1.3.0 (the network policy agent’s version deployed as part of VPC CNI 1.21.0):
val, found := bpfEntry.Maps["ingress_map"]
if found {
NewInMemoryBpfMap(&val)
}
val, found = bpfEntry.Maps["cp_ingress_map"]
if found {
NewInMemoryBpfMap(&val)
}
Did you catch the bug?
NewInMemoryBpfMap doesn’t copy the data — it stores the pointer. So when val is reassigned on the second lookup, the pointer stored for ingress_map now points to the same stack slot as cp_ingress_map. Both in-memory map objects end up wrapping the same underlying kernel fd: cp_ingress_map.
Later, when the agent reconciled the namespace policy rules, it serialized the allow rules into a 288-byte buffer and wrote it to whatever kernel fd the ingressInMemoryMap object wrapped — which, due to the aliasing, was cp_ingress_map. That map was created with 384-byte value slots. The kernel’s bpf_map_update_elem doesn’t validate struct layout — it just reads the full 384 bytes it expects from a pointer to a 288-byte Go buffer. The extra 96 bytes are adjacent stack memory, which is why you see the duplicate fe 00 00 00 at exactly offset 288 in the dump above.
The actual ingress_map was never written to. And that’s why the policy turned into a total blackout rather than just “cluster-tier rules don’t work”.
Before any NetworkPolicy exists, the VPC CNI doesn’t enforce anything — pods talk freely. The moment a policy selects a pod, the CNI creates that pod’s eBPF maps and switches it into enforced mode. In enforced mode, the eBPF program drops everything that doesn’t have an explicit ALLOW entry in ingress_map. The rules that ended up in cp_ingress_map aren’t consulted for namespace-tier decisions — they’re structurally in the wrong place, and the eBPF program doesn’t look there when evaluating a namespace-scoped NetworkPolicy. So as far as the kernel was concerned, ingress_map was empty, and an empty ingress_map in enforced mode means drop everything.
The fix was one PR - give each lookup its own variable so each gets its own address on the stack:
ingressVal, found := bpfEntry.Maps["ingress_map"]
if found {
NewInMemoryBpfMap(&ingressVal)
}
cpIngressVal, found := bpfEntry.Maps["cp_ingress_map"]
if found {
NewInMemoryBpfMap(&cpIngressVal)
}
The fix for this bug was merged silently into v1.3.1 (VPC CNI v1.21.1) on December 17, 2025, with no linked issue and a one-line description. Almost like AWS wanted no one to notice, but I noticed 😈.
To be honest, this is one of these bugs where it’s very hard to catch it before it hits production. The fix is boring in hindsight - the bug, less so.
So, to recap - I pulled about 50% of my hair, I was gaslit by AWS’s network policy agent, but at the end of the day, I learned SO MUCH. I’ll admit that I didn’t really know what eBPF was up until now, and to be honest, I had a blast debugging this (although I’d be really happy if it didn’t completely nuke my systems next time).