antiTree | posts and projects
posted on Jan 10, 2019

By default Docker allows all of their containers to run with the CAP_NET_RAW capability, I believe to easily support ICMP health checks when needed. Supporting ping makes sense but this post will go through why CAP_NET_RAW is an unnecessary risk and how you can still send pings without needing CAP_NET_RAW.

What does CAP_NET_RAW do?

CAP_NET_RAW controls a processes ability to build any types of packets that you want. TCP, UDP, ARP, ICMP, etc. For example, you may have noticed when running nmap --priviliged as a normal user, the output would look something like this:

Starting Nmap 7.60 ( https://nmap.org ) at 2019-01-10 16:31 EST
Couldn't open a raw socket. Error: Operation not permitted (1)

But running sudo setcap cap_net_raw+ep $(which nmap) and re-running the same command will give you a different result:

./nmap --privileged -sP ioioio.cz

Starting Nmap 7.60 ( https://nmap.org ) at 2019-01-10 16:29 EST
Nmap scan report for ioioio.cz (45.79.145.175)
Host is up (0.028s latency).
...

The setcap command explicitly adds the CAP_NET_RAW capability as an attribute of the nmap binary on your disk so anyone that executes that binary will be able to forge its own packets including ICMP and UDP.

Hopefully you understood that last paragraph and are going back to undo what you just did. :) To do that it’s sudo setcap -r $(which nmap).

Containers Without CAP_NET_RAW

Back to containers: Docker’s default containers are still running with CAP_NET_RAW enabled by default which means in theory the container can forge their own packets, send ICMP, maybe even do something like ARP poisoning in the right situation. For those that are trying to harden their containers, one of the most basic steps is to drop unnecessary capabilities, so why can’t we (almost) always drop CAP_NET_RAW?

Docker and other engines support a --sysctl argument per container allowing you to set the net.ipv4.ping_group_range value which lets you drop the CAP_NET_RAW requirement to send a ping. On a normal Linux box it would look like this:

sysctl -w net.ipv4.ping_group_range=0 65535

It means that any user between root (UID 0) and UID 65535 will be able to use the ping command. (NOTE: I’ve seen a much higher number than 65535 as well depending on your max UID.)

Lets do this in a container.

Normal container:

> docker run --rm -it willfarrell/ping 
ping 517d215a8fe2 every 300 sec
PING 517d215a8fe2 (172.17.0.3): 56 data bytes
64 bytes from 172.17.0.3: seq=0 ttl=64 time=0.038 ms

--- 517d215a8fe2 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.038/0.038/0.038 ms

Ping works because the container has CAP_NET_RAW


**Same image but dropping cap_net_raw**:

> docker run --rm -it \
    --cap-drop=net_raw \
    willfarrell/ping
ping 15a94d4af148 every 300 sec
PING 15a94d4af148 (172.17.0.3): 56 data bytes
ping: permission denied (are you root?)

Ping doesn’t work because I dropped CAP_NET_RAW.


**Same image, still dropping net_raw, but using a custom sysctl argument:**
>  docker run --rm -it \
    --cap-drop=net_raw \
    --sysctl "net.ipv4.ping_group_range=0 1000000000" \
    willfarrell/ping
ping 0bedc92d1c86 every 300 sec
PING 0bedc92d1c86 (172.17.0.3): 56 data bytes
64 bytes from 172.17.0.3: seq=0 ttl=42 time=0.036 ms

--- 0bedc92d1c86 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.036/0.036/0.036 ms

Ping works again even though I drop CAP_NET_RAW.

Results

The point is, until someone tells me otherwise, I don’t think we need CAP_NET_RAW in most scenarios.

So, OK, can we just get rid of CAP_NET_RAW now? Probably not. There are probably some edge scenarios that Docker wants to officially support so they’ll keep it in and the goal of Docker is to be usable first and allow you to configure it in a secure way second. But that’s not going to stop you from dropping it from all of your containers. There are already some other container engines out there that leverage this method by default.