Auditing Syscalls for Custom Docker Seccomp Profiles

One of Docker’s many updates this year was adding seccomp support. In short, seccomp/secomp-bpf is a way of filtering the system calls that you want to allow an application to make. It’s used for sandboxing enforcement it a lot of projects including Chromium, bubblewrap, and SubgraphOS.

In Docker, it’s enabled by default (in supported environments) and has a default profile that is fine, but there’s always ways to customize it. (Even you should never do this.)

It’s academic to say “You should use custom seccomp profiles for each of your containers!" and you’ll fall victim to easier-said-than-done-ism. I think that using custom seccomp profiles for each container is a great way to help defend against container breakouts but actually doing that, at scale, is a nightmare. But possible. And something that ~~will~~might get better.

If you’re not concerned about needing to restrict your container to custom seccomp profiles, you can be afraid of the.. we’ll call it… “agile” nature Docker and be concerned that the Docker default profile might change and break something for your container. That would never happen though right? To be fair, AFAIK that’s just reorganizing the profile, not changing the policy.

As of today, there are more than 300 syscalls that are whitelisted by default. Your blog probably doesn’t need that many.

How to seccomp in Docker

How to you run a docker container with seccomp enabled? Run docker. Done. But to load a custom profile it’s just something like this:

docker run --security-opt seccomp=$PWD/customseccomp.json nginx

Docker’s seccomp json format

You can look at Docker’s default seccomp profile yourself. Their default profile is a whitelist of all the possible syscalls that the container can make. You can filter out a containers ability accept network connections like so:

"syscalls": [
               {
                       "name": "accept",
                       "action": "SCMP_ACT_ALLOW",
                       "args": []
               },

I’m stopping there because it gets much more complicated. We’ll come back to that another day.

UPDATE: See this post for more info

Strace

You might know that strace is designed for this job. Here’s a few strace examples that might be useful for containers:

Strace wget with absolute timestamps:

strace -ttt wget https://www.google.com

Strace nginx with multiprocess support:

strace -ff -o /tmp/nginx.strace nginx

Strace ls and summarize all of the syscalls they made:

strace -c ls -la

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.44    0.001058           8       141           lstat
  0.56    0.000006           0       142           write
  0.00    0.000000           0        17           read
  0.00    0.000000           0        30         5 open
  0.00    0.000000           0        31           close
  0.00    0.000000           0        27           fstat
  0.00    0.000000           0        11           lseek
  0.00    0.000000           0        36           mmap
  0.00    0.000000           0        20           mprotect
  0.00    0.000000           0         8           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         2           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         2           ioctl
  0.00    0.000000           0        12        12 access
  0.00    0.000000           0         4           socket
  0.00    0.000000           0         4         4 connect
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         2           getdents
  0.00    0.000000           0         3           readlink
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         2         2 statfs
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0       233       233 getxattr
  0.00    0.000000           0       141       141 lgetxattr
  0.00    0.000000           0         1           futex
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.001064                   878       397 total

Ok now just run this on Docker. Oh wait. How does that work? How would you run strace on just a single container?

The answer is you can modify the image to include strace and modify the entry-point to run the normal command with strace. Here’s an example using featherduster’s docker file:

FROM ubuntu:xenial

RUN apt-get update -qq && apt-get install -qq \
        build-essential \
        libncurses-dev \
        libgmp3-dev \
        python-crypto \
        python-dev \
        python-pip \
        python-setuptools \
        strace
    && rm -rf /var/lib/apt/lists/*

COPY . /opt/featherduster
WORKDIR /opt/featherduster
RUN pip install -U pip
RUN pip install .
RUN mkdir /strace

ENTRYPOINT ["/usr/bin/strace", "-ff", "-o", "/strace/feather.strace","python", "/opt/featherduster/featherduster/featherduster.py"]

I’m adding strace to the list of packages to install. Creating a directory called “/strace” to hold the logs. And then changing it from executing python pythonscript.py to strace python pythonscript.py.

I can rebuild it with

docker build -t antitree/featherduster .

Run it and grant the SYS_PTRACE capability to the container

docker run -it \
  -v $PWD/strace:/strace \
  --cap-add SYS_PTRACE \
  antitree/featherduster

You’ll see I’m mounting a “/strace” folder to store the logs outside of the container.

Now it’s your job to make sure your container executes every possible function to illicit every possible syscall. Easy right? Are you starting to appreciate what SubgraphOS is doing? :)

Strace output

That was a lot of annoying work right? You’re just getting started. You only now have a log of the syscalls that were caught by strace. Lets get a summary of those syscalls:

cat strace.log| awk -F '(' '{print $1}' | sort | uniq -c

In my case that gave me 32 syscalls.

So you’re done. Now just do this for every Docker image in your organization, compile the results, and port them to individual seccomp profiles. Easy!

sysdig

I hope you’re still reading through all that mundaine explanation and you’ve gotten to this point because strace is the defacto tool for this kind of thing but the company/tool sysdig is making a name for themselves in the container arena by building tools that have native container support. Their sysdig tool fills the gap that we’re talking about here.

Run sysdig and show all the calls from a container named suspicious_ether

sysdig container.name=suspicious_ether

Run the ncurses client csysdig looking at running containers

csysdig vcontainers

csysdig

That’s just the beginning of this tool. It has support for all kinds of different export formats to make scripting pretty easy. Eventually you can see my tool that converts the syscalls from their output directly to Docker seccomp profiles.

Making the profile

UPDATE: See the syscall2seccomp post

This post is already too long and I hope I come back to update it with how to convert these into Docker seccomp profiles. In short, you’ll want to take the list of those syscalls that you’ve captured and make a whitelist of syscalls you need.

Even better, the “bpf” part of “seccomp-bpf” allows you not just control syscalls but also control the syscall arguments. Does your container need to be able to read the “WORKDIR” folder but not the “/etc/” folder? You can make a restriction for this.

(Again, a shout-out to SubgraphOS which made their own seccomp profile system complete with whitelists and blacklists AND they go the extra mile to restrict arguments.)

BTW, what do you do when your tool didn’t correctly capture the syscall? Yes, that’s something you have to worry about.

Summary

Seccomp is great. Custom seccomp profiles are great (in theory).
Finding syscalls is time consuming and doesn’t scale for large oranizations
strace or sysdig (or dtrace for that matter) are the tools to help

This is a gap in the container security world. There are some projects coming out that see this and are trying to take advantage of it to make it easier and scale out to the organization. AquaSec is a start-up trying to get into this space. They wouldn’t admit to me whether or not they’re building custom seccomp profiles but their whole thing is to build a secure container lifecycle.