antiTree | posts and projects
antiTree
posted on Apr 06, 2026

Sandboxes, Seccomp, and Syscalls

  1. Building Insecure and Incomplete Profiles
  2. Bypasses and Breakouts
  3. Measuring Scoring and Scaling with Seccompute

I’ve been describing a common thread throughout these posts which is we can spend hours building artisinally crafted seccomp profiles and say that everything is more secure, but we don’t inherently know that. We just assume. And even when we want to measure if a given seccomp profile is “good” there’s an immense amount of complexity to come to that decision.

For instance, to decide if a seccomp profile is “good” we have 1 simple question:

Which system calls does it allow and which does it block?

To get that answer we have to ask more questions:

  • What happens on a different architecture?
  • What happens when a certain Capability has been granted to the process?
  • Are there any ways of bypassing our seccomp policy through multiplexing?
  • What are the differences between container runtimes?

I think the lack of population-level data is genuinely one of the biggest problems in this space right now – we’re all tuning profiles in the dark. Here’s what we don’t have: public data on what seccomp profiles organizations actually deploy. We can score your profile against the Docker default, but we can’t tell you how you compare to the industry because nobody’s publishing their profiles. Why would they?

This partly why I’ve been running seccompare.com for a year or so. It’s partly an anonymous data collection exercise to figure out how people are using seccomp at scale. What are people uploading? How do those profiles score? The historical timeline shows how Docker’s own default has evolved since 2015 – syscalls added, syscalls blocked, conditional rules refined. Most custom profiles don’t track these changes. They’re frozen in time.

I’ve also been harvesting sources online for OCI seccomp profiles and profiled them. The results of looking at ~800 profiles were interesting to me:

Case Study 1: Capabilities in Moby vs Containerd

I’ve been saying that seccomp is simple: “block”, “allow” – but that’s kind of a lie, right? Seccomp has this vast amount of conditionals you could apply per system call or per filter. You can say things like “If the first argument of the system call is 2114060288 then allow it, else block”

    {
      "names": ["clone"],
      "action": "SCMP_ACT_ALLOW",
      "excludes": {
        "caps": ["CAP_SYS_ADMIN"]
      },
      "args": [
        {
          "index": 0,
          "value": 2114060288,
          "op": "SCMP_CMP_MASKED_EQ"
        }
      ]
    }

This is granting the container the ability to do the initial clone() on startup but blocking it on all subsequent calls.

Or you can say “If the container has been granted CAP_SYS_ADMIN then allow bpf”.

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    {
      "names": ["bpf"],
      "action": "SCMP_ACT_ALLOW",
      "includes": {
        "caps": ["CAP_BPF"]
      }
    },
  ]
}

This one is pretty practical – if you are granting CAP_BPF then you mind as well allow it to the system calls it needs like bpf. Moby/Docker’s default profile uses this type of conditional extensively.

But looking at the containerd/runc implementation and you’ll see it’s different. Docker leverages the BPF format itself for the conditionals, but containerd generates seccomp profiles dynamically. This is the code from containerd: contrib/seccomp/seccomp_default.go:

// conditionally add bpf if CAP_SYS_ADMIN is set
if !inSlice(s.Capabilities, "CAP_SYS_ADMIN") {
    s.Syscalls = append(s.Syscalls, specs.LinuxSyscall{
        Names:  []string{"bpf", "clone", "fanotify_init", "lookup_dcookie",
            "mount", "name_to_handle_at", "perf_event_open",
            "quotactl", "setdomainname", "sethostname",
            "setns", "syslog", "umount", "umount2",
            "unshare"},
        Action: specs.ActErrno,
        ErrnoRet: &defaultErrno,
    })
}

The profile isn’t static JSON — it’s generated at container start based on what capabilities the process will have. Grant CAP_SYS_ADMIN and a completely different profile gets loaded, one that no longer blocks bpf.

Which one is more secure? They’re effectively the exact same. Which one is more auditable? Unless you’re extracting the raw BPF bytecode from the process, you don’t really know for sure which seccomp policy is being applied for your container.

Case Study 2: Public Seccomp Profiles Good vs Bad

I started out with a naive perspective that there was a bad and good with seccomp profiles. And when I gathered them all together I created charts that like this: Seccomp Profiles Good vs Bad

The problem is good isn’t just based on the number of system calls blocked. Seccomp profiles have at least 2 dimensions that make them good:

  • How well does it restrict the application to a known good baseline?
  • How many dangerous system calls are blocked?

So that means the above chart is too simple. We can’t just say that restricting a profile makes it “good” because we still end up situations where it’s more restrive but then it allows bpf(). I like to think of it like a D&D alignment chart (good/evil on one axis, lawful/chaotic on the other):

Hardening score distribution across generated profiles

  • Lawful Good: Restricts syscalls AND doesn’t include anything dangerous
  • Chaotic Evil: Less restrictive syscalls AND includes dangerous ones

Then there are some interesting types:

  • Lawful Evil: Restricts syscalls… but allows dangerous system calls like bpf()
  • Chaotic Good: It blocks dangerous syscalls… but it defaults to allow all others

If you take all the profiles I’ve found and overlay them on this grid, you see a interesting mix of very good and well intentioned profiles, but many many other instances of dangerous ones – or at least less secure than the default seccomp profile.

Hardening score distribution across generated profiles

Can we put this together into something we can use?

Seccompute: Making “Is This Profile Good?” a Number

Knowing what’s in your profiles is one thing. Knowing if they’re any good is another. “I looked at it and it seems fine” is not a scalable review process. Enter seccompute which scores profiles on a 0-100 scale using tier-weighted syscall analysis:

  • Tier 1 – automatic F grade: bpf, ptrace, kexec_load, init_module, finit_module. If any of these are allowed, the profile is a failure. I don’t care how many other system calls you’ve blocked, you’ve allowed dangerous things.
  • Tier 2 – significant deductions: io_uring_setup, io_uring_enter, perf_event_open, mount, clone, unshare. Each one costs real points because each one enables real attack classes.
  • Tier 3 – minor deductions: clone3, chroot, namespace-adjacent syscalls. Still tracked, still scored, but these are the potential enumeration risks rather than the “stop and fix it” tier.

You can also use this to compare your profile against the Docker default. A positive delta means you’re more restrictive – good. A negative delta means you’ve made things worse than the default – you should know that before deploying.

Combo Detection and Rules Engine

If you remember me writing about the multiplexing risks (Seccomp in K8s: Bypasses and Breakouts) then this is where Seccompute gets interesting. seccompute doesn’t just check individual syscalls – it checks combinations against a rule set and contextualizing the risks by including the underlying architecture and capabilities assigned. You can also customize this rulset yourself.

  • COMBO-io-uring-network-bypass: io_uring_setup + io_uring_enter allowed while network syscalls are blocked means the network blocks are bypassed.
  • COMBO-bpf-escalation: bpf allowed in any form. These patterns don’t show up in a line-by-line review. They require understanding how syscalls interact, which is exactly what a scoring tool should do for you.
  • COMBO-io-uring-file-io-bypass: io_uring_setup + io_uring_enter allowed while file I/O syscalls (read, write, pread64, fsync, etc.) are blocked — file operations bypass the filter.
  • COMBO-io-uring-filesystem-bypass: same pair allows filesystem path operations (openat, unlinkat, renameat, mkdir, symlink, etc.) without touching the blocked syscalls.
  • COMBO-io-uring-poll-bypass: io_uring_enter can submit IORING_OP_POLL_ADD and IORING_OP_EPOLL_CTL, bypassing blocked poll, epoll_wait, epoll_ctl.
  • COMBO-io-uring-splice-bypass: IORING_OP_SPLICE and IORING_OP_TEE bypass blocked splice, tee, vmsplice.
  • COMBO-io-uring-process-bypass: on Linux 6.6+, io_uring_enter can submit IORING_OP_WAITID and futex operations, bypassing blocked futex, waitid, wait4.
  • COMBO-io-uring-ioctl-bypass: IORING_OP_URING_CMD allows device-specific commands (NVMe, ublk) without invoking ioctl(2) directly.
  • COMBO-io-uring-xattr-bypass: xattr get/set/list operations (getxattr, setxattr, listxattr, and their variants) bypass via io_uring_enter.

You can run this as a CLI or as part of a library like I do for Seccompare.com. But the pro move would be to integrate it into your CI.

CI Gates: Put Seccomp Review in the Pipeline

The real challenge of using seccomp is that every line of code, every bump of a dependency, every new node type you deploy to needs to profile the container all over again. That means that seccomp is not a 1 time event. You need to integrate it into your CI pipelines as part of your normal testing.

You can take Seccompute (or rebuild it and make your own, I don’t care) and integrate it into your CI. It could be a Github Action that validates the seccomp profile meets a minimum threshold of a score and if not needs manual human intervention.

Final Thoughts

My talk at BSidesSF made the point that I specifically wanted to convince people to have stronger opinions about seccomp in Kubernetes.It’s either all in or all out. There are some great organizations that are doing this at scale and doing it well. All of this is to say that seccomp in k8s is harder than it should be. I made the point during the talk that if you need strong isolation, we can go back to the classics: MicroVMs or emulated kernels (ie. gVisor). These are the strongest solutions but have the biggest risks to performance.

I also dropped a few “what-ifs” in the talk that I still think are good ideas if we can get some support on it:

  • What if we had different flavors of DefaultSeccompProfiles (ie. web server, database, terminal) instead of one that covers all of them at once?
  • What if someone could do the testing of many of the seccomp profiles before you get them to remove the overhead of teams doing it themselves?
  • What if we could support layers of seccomp policies to start with a base and then add additional more restrive ones on top of that?

But, if nothing else, my hope is for anyone taking their Certified Kubernetes Security (CKS) Specialist exam or reading materials, please question how far you want to say that seccomp is an every day tool for locking down the environment.

How did I do? The only failure mode for me is after reading through all of this you come away with a tepid perspective on seccomp in Kubernetes. And no, I didn’t use AI to slop this blog together. This is an artisanally crafted blog post filled with my own human errors.

My hope is that this seccomp brain worm can finally get out of my head and stay here on these pages, or enter into yours.

Sandboxes, Seccomp and Syscalls - BSides SF 2026 by Mark Manning