antiTree | posts and projects
posted by antitree on Sep 10, 2017

This post goes through building custom Docker seccomp profiles for your container. I'm not recommending you do this especially in enterprise environments, but I'm being charitable to the idea that system call filtering is the basis of a lot of sandboxing technologies and filtering out unnecessary ones should reduce the attack footprint of your application. This is more of an exploration of use-cases for custom Docker seccomp profiles than a suggestion that everyone does this themselves. Or simply to answer the question: “Why does Docker let you load custom seccomp profiles?”

Custom Seccomp Profiles and Docker

Docker has come up with their own way of managing Seccomp rules by using a proprietary JSON format. Other systems like SubgraphOS have done something similar to make a simple configuration file format that can be converted into an actual seccomp profile. In Docker's case, you can save your custom seccomp BPF rules into a JSON file and start your container with a custom profile and have it apply just to that container. Cool. In theory you could create a more secure whitelist of necessary syscalls that your container needs in order to run. I'll come back to that later.

At it's core Seccomp BPF provides a way of filtering out which syscalls you do and don't want your application to make to the kernel. This is completely independent of Docker itself so even in the case of a Docker exploit, your seccomp profile may prevent the exploit from breaking out from the container to the host.

As I've written about before, that's easier said than done but using tools like strace and sysdig, you can capture syscalls being made and figure out which are necessary.

The JSON File Format

Here's a simple example that only allows a container the “chown” and “chdir” syscalls.

{
	"defaultAction": "SCMP_ACT_ERRNO",
	"architectures": [
		"SCMP_ARCH_X86_64",
		"SCMP_ARCH_X86",
		"SCMP_ARCH_X32"
	],
	"syscalls": [
		{
			"name": "chown",
			"action": "SCMP_ACT_ALLOW",
			"args": []
		},
		{
			"name": "chdir",
			"action": "SCMP_ACT_ALLOW",
			"args": []
		}
	]
}

Actions:

  • SCMP_ACT_ERRNO: designed to create a “permission denied” error in the container:
  • SCMP_ACT_ALLOW: Whitelist or allow syscalls
  • SCMP_ACT_KILL: Syscall will be killed by the kernel

And more! At this time, I'm not sure if Docker can handle other types of actions but there are other options including logic operators and ways of tracing a function instead of completely stopping the application.

Name:

The syscall you want to make a rule for. Here's a list of all available syscalls for a few architectures.

Args:

This is where the “BPF” part of Seccomp BPF comes in. Instead of just saying yes or no for a syscall, you can say yes or no but only in the case of certain arguments. For example, I don't want to allow the chown syscall except when it's issued against a file in the “/tmp/” directory. This is very powerful but also very time consuming to trace down. I'm waiting for someone to tell me about a tool that can help make this process easier.

Organization

In the example above, I'm creating an object block for each individual syscall but they both have the exact same action. If you're just coming up with a simple whitelist/blacklist of syscalls, you can use multiple names in a single block to make rule management a bit cleaner:

	...
	"syscalls": [
		{
			"names": [
				"chdir",
				"chown",
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "A probably useless rule"
		},
	...

Caveats and Challenges

Filtering applies to the whole container, not just application: If you were to debug your application, and diligently spent time identifying the necessary syscalls it needs, you wouldn't have the complete list. The Docker container itself must have a set of syscalls for itself to start(see https://github.com/moby/moby/issues/22252). That's because the seccomp profile is applied to the whole container so things like starting up, dropping privileges, mounting volumes are all necessary syscalls to the container even if your application doesn't use them.

Policy precedence: As of writing this, Docker's seccomp implementation doesn't allow policy inheritance so when you apply a custom seccomp profile to your container, you're completely removing anyone that was there by default. And the default one isn't bad at all. Seccomp BPF itself supports this feature but Docker has yet to implement it. Which could lead to:

Foot shooting: When you override the default seccomp profile that Docker provides, there's a risk that you or your developers are actually doing more harm than good. A simple misconfiguration, typo, or misinterpretation could introduce a vulnerability that would have been previously defended against.

Policy management at scale: If you're a developer with a single containerized application, maybe this makes sense. If you're orchestrating hundreds or thousands of images, I question the value that this generates. Each container would have its own policy that needs to be kept with the source and integrated into unit testing. Right now, Kubernetes doesn't even support custom seccomp profiles per pod.

Debugging syscalls: If you have never debugged the syscalls in your application, I hope that the take away from these posts is that it's a time-consuming and annoying process to put it nicely. When an unplanned syscall is filtered, the entire container crashes with an ambiguous error or no error at all and you'll need to find where that comes from. This is something that can easily be improved through.

Value: The big question you should ask is how valuable is this to you and your deployment? Are you introducing a heightened risk of a denial-of-service condition of a legitimate syscall crashing all of your containers at once? Or do you have an ultra-secure use-case where preventing a container breakout and future breakouts is important enough to spend the time? (I have yet to see such a scenario where this is a requirement and it's using containers.)

Recommendations

There are organizations that are recommending that you come up with custom seccomp profiles “adapted to their environment” but this is just really hand-wavy compliance bullshit. The NIST recommendations rightfully don't touch seccomp policies besides saying that filtering syscalls is an important control in the security model of Docker and there's a concern that changes to the default policy would have a major impact.

At this point, I believe that other Docker security controls like capability dropping and custom AppArmor profiles provide more value to defending against breakouts with less risk to the stability of the environment but I'm happy to be proved wrong.

What I think would be smarter (if your threat model truly requires it) is to deploy a custom seccomp profile to the daemon itself. This would fit in with some orchestration models so you could apply it to groups of deployments at a consistent level. (Most orchestration tools are coming up with their own security policies though so not sure this is even worth it.)

UPDATE: If you're still into this, then check out my next post about a tool to speed up the process.