Summary of the Book "Container Security" by Liz Rice

Photo by FlyD on Unsplash

Summary of the Book "Container Security" by Liz Rice

This blog post is a summary of key insights from the book "Container Security" authored by Liz Rice. These notes were captured as I went through the book, so certain portions might be missing or not as detailed as in the original text. Nonetheless, this summary highlights the main points and is intended to serve as a helpful reference. For a comprehensive understanding, I highly recommend reading the full book. The original book can be found here.

Chapter 1 ( Container Security Threats )

Risks, Threats, and Mitigations

A risk is a potential problem, and the effects of that problem if it were to occur.

A threat is a path to that risk occurring.

A mitigation is a countermeasure against a threat—something you can do to prevent the threat or at least reduce the likelihood of its success.

Security Principles

  • Least Privilege

  • Defense in Depth

    • have multiple layers of defense
  • Reducing the Attack Surface

    • less code and less dependency

    • system with less components

    • Limiting the users and components who can access a service

  • Limiting the Blast Radius

  • Segregation of Duties

    • give only subset of privilege to different components or people

Chapter 2 ( Linux System Calls, Permissions, and Capabilities )

System Calls

The programmatic interface that the user space code uses to make these requests of the kernel is known as the system call or syscall interface.

There are some 300+ different system calls, with the number varying according to the version of Linux kernel.

File Permissions

  • setuid

  • setgid

Linux Capabilities: there is another way to give ping sufficient privileges to open the socket without the executable having all the privileges associated with root.

Because setuid provides a dangerous pathway to privilege escalation, some container image scanners (covered in Chapter 7) will report on the presence of files with the setuid bit set. You can also prevent it from being used with the --no-new-privileges flag on a docker run command.

Linux Capabilities

There are over 30 different capabilities in today’s Linux kernel. Capabilities can be assigned to a thread to determine whether that thread can perform certain actions. For example, a thread needs the CAP_NET_BIND_SERVICE capability in order to bind to a low-numbered (below 1024) port. CAP_SYS_BOOT exists so that arbitrary executables don’t have permission to reboot the system. CAP_SYS_MODULE is needed to load or unload kernel modules.

You can see the capabilities assigned to a process by using the getpcaps command. For example, a process run by a non-root user typically won’t have capabilities:

getpcaps <pid>
setcap 'cap_net_raw+p' <executable>

Chapter 3 ( Control Groups )

Cgroup Hierarchies

There is a hierarchy of control groups for each type of resource being managed, and each hierarchy is managed by a cgroup controller. Any Linux process is a member of one cgroup of each type, and when it is first created, a process inherits the cgroups of its parent.

The Linux kernel communicates information about cgroups through a set of pseudo‐ filesystems that typically reside at /sys/fs/cgroup.

Assigning a Process to a Cgroup

echo 100000 > memory.limit_in_bytes
echo 29903 > cgroup.procs

Chapter 4 (Container Isolation)

Linux Namespaces

If cgroups control the resources that a process can use, namespaces control what it can see. By putting a process in a namespace, you can restrict the resources that are visible to that process.

PID Namespace:

  • Isolates the process ID number space, allowing processes inside the namespace to have the same PID as processes outside the namespace.

  • Enables the creation of containers with their own init process, providing a private process tree.

NET Namespace:

  • Provides isolation of network interfaces, IP addresses, routing tables, and port numbers.

  • Allows containers to have their own network stack, separate from the host and other containers.

MNT Namespace:

  • Isolates the filesystem mount points, so changes to mounts (e.g., adding or removing filesystems) in one namespace do not affect others.

  • Ensures each container can have its own file system layout.

IPC Namespace:

  • Isolates System V IPC objects and POSIX message queues, allowing processes to communicate within the same namespace without interfering with other namespaces.

  • Useful for containers that need shared memory or semaphores.

UTS Namespace:

  • Isolates the hostname and NIS domain name, allowing containers to have a different hostname and domain name from the host.

  • Useful for setting custom hostnames within containers.

USER Namespace:

  • Isolates user and group IDs, enabling processes to have different user IDs and privileges inside the namespace compared to outside.

  • Allows running containers as non-root users while providing root privileges within the container.

CGROUP Namespace:

  • Isolates the view of the cgroups, so each namespace sees its own cgroups.

  • Helps in managing and limiting the resources (CPU, memory, etc.) available to the processes within the namespace.

TIME Namespace:

  • Isolates the system and monotonic clocks, allowing processes within the namespace to see and manipulate different time values.

  • Useful for testing software behavior at different times or handling time-sensitive applications in containers without affecting the host system.

Chapter 5 ( Virtual Machines )

....

Capter 6 ( Container Images )

Dockerfile Best Practices for Security

Base image

  • Refer to an image from a trusted registry

  • The smaller the base image, the less likely that it includes unnecessary code, and hence the smaller the attack surface.

  • https://github.com/GoogleContainerTools/distroless

  • Be thoughtful about using a tag or a digest to reference the base image.

Use multi-stage builds

Non-root USER

RUN commands

Volume mounts

  • Be careful while running volume mounts of system directories like /etc

Don’t include sensitive data in the Dockerfile

Avoid setuid binaries

Avoid unnecessary code

Include everything that your container needs

Chapter 7 ( Software Vulnerabilities in Images )

https://github.com/aquasecurity/trivy

https://aws.amazon.com/blogs/containers/scanning-images-with-trivy-in-an-aws-codepipeline/

https://bitnami.com/stack/nginx

Chapter 8 ( Strengthening Container Isolation )

https://blog.jessfraz.com/post/how-to-use-new-docker-seccomp-profiles/

https://github.com/nevins-b/falco2seccomp

https://github.com/aquasecurity/tracee

In AppArmor, a profile can be associated with an executable file, determining what that file is allowed to do in terms of capabilities and file access permissions.

/sys/module/apparmor/parameters/enabled

AppArmor and other LSMs implement mandatory access controls. A mandatory access control is set by a central administrator, and once set, other users do not have any ability to modify the control or pass it on to another user.

https://github.com/genuinetools/bane (Custom & better AppArmor profile generator for Docker containers.)

gVisor

Google’s gVisor sandboxes a container by intercepting system calls in much the same way that a hypervisor intercepts the system calls of a guest virtual machine.

According to the gVisor documentation, gVisor is a “user-space kernel,” which strikes me as a contradiction in terms but is meant to describe how a number of Linux sys‐ tem calls are implemented in user space through paravirtualization. As you saw in Chapter 5, paravirtualization means reimplementing instructions that would other‐ wise be run by the host kernel.

Kata Containers

As you’ve seen in Chapter 4, when you run a regular container, the container runtime starts a new process within the host. The idea with Kata Containers is to run contain‐ ers within a separate virtual machine. This approach gives the ability to run applica‐ tions from regular OCI format container images, with all the isolation of a virtual machine.

Kata uses a proxy between the container runtime and a separate target host where the application code runs. The runtime proxy creates a separate virtual machine using QEMU to run the container on its behalf.

One criticism of Kata Containers is that you have to wait for a virtual machine to boot up. The folks at AWS have created a lightweight virtual machine that is specifically designed for running containers, with much faster startup times than a normal VM: Firecracker.

Firecracker

As you saw in “Disadvantages of Virtual Machines” on page 62, virtual machines are slow to start, making them unsuitable for the ephemeral workloads that typically run in containers. But what if you had a virtual machine that boots extremely quickly? Firecracker is a virtual machine offering the benefits of secure isolation through a hypervisor and no shared kernel, but with startup times around 100ms, it is much more suitable for containers. It has the benefit of becoming field-hardened due to its (as I understand it, gradual) adoption by AWS for its Lambda and Fargate services.

Unikernels

The idea of Unikernels is to create a dedicated machine image consisting of the appli‐ cation and the parts of the operating system that the app needs. This machine image can run directly on the hypervisor, giving the same levels of isolation as regular virtual machines, but with a lightweight startup time similar to what we see in Firecracker.

Every application has to be compiled into a Unikernel image complete with every‐ thing it needs to operate. The hypervisor can boot up this machine in just the same way that it would boot a standard Linux virtual machine image.

Chapter 9 ( Breaking Container Isolation )

Containers Run as Root by Default

Rootless Containers

The --privileged Flag and Capabilities

Mounting Sensitive Directories

Mounting the Docker Socket

Sharing Namespaces Between a Container and Its Host

Sidecar Containers

Chapter 10 ( Container Network Security )

https://bitnami.com/stack/nginx

https://rootlesscontaine.rs/

docker run -it --rm nginx
./tracee.py -c -e cap_capable
docker run --cap-drop=all --cap-add=<cap1> --cap-add=<cap2> <image>

Layer 3/4 Routing and Rules

As you already know, routing at Layer 3 is concerned about deciding the next hop for an IP packet. This decision is based on a set of rules about which addresses are reached over which interface. But this is just a subset of things you can do with Layer 3 rules: there are also some fun things that can go on at this level to drop packets or manipulate IP addresses, for example, to implement load balancing, NAT, firewalls, and network security policies. Rules can also act at layer 4 to take into account the port number. These rules rely on a kernel feature called netfilter.

netfilter is a packet-filtering framework that was first introduced into the Linux kernel in version 2.4. It uses a set of rules that define what to do with a packet based on its source and destination addresses.

There are a few different ways that netfilter rules can be configured in user space. Let’s look at the two most common options: iptables and IPVS.

iptables

The iptables tool is one way of configuring IP packet–handling rules that are dealt with in the kernel using netfilter. There are several different table types; the two most interesting types in the context of container networking are filter and nat:

  • filter—for deciding whether to drop or forward packets

  • nat—for translating addresses

    As a root user, you can see the current set of rules of a particular type by running iptables -t <table type> -L.

IPVS

IP Virtual Server (IPVS) is sometimes referred to as Layer 4 load balancing or Layer 4 LAN switching. It is another rules implementation similar to iptables, but it’s opti‐ mized for load balancing by storing the forwarding rules in hash tables.

This optimization makes it very performant for kube-proxy’s use case, but it doesn’t necessarily mean you can draw conclusions about the performance of network plug- ins implementing network policies.

Whether it’s IPVS or iptables that manages netfilter rules, they act within the ker‐ nel.

Chapter 11 ( Securely Connecting Components with TLS )

....

Chapter 12 ( Passing Secrets to Containers )

  1. Storing the Secret in the Container Image

  2. Passing the Secret Over the Network

  3. Passing Secrets in Environment Variables

  4. Passing Secrets Through Files

Chapter 13 ( Container Runtime Protection )

Container Image Profiles

Network Traffic Profiles

Executable Profiles

  • Observing executables with eBPF

File Access Profiles

User ID Profiles

Other Runtime Profiles

Container Security Tools

  • AppArmor, SELinux, or seccomp profile.

  • Network traffic can be policed at runtime using network policy or a service mesh, as described in

  • CNCF project Falco.

Prevention or alerting

Chapter 14 ( Containers and the OWASP Top 10 )

....

Tools to play with Container Internals

  • lscgroup

  • lib-cgroup

  • setns

  • nsenter

  • unshare

  • chroot

  • lsns

Links

https://www.openwall.com/lists/oss-security/2019/02/11/2

https://github.com/0xAX/linux-insides

https://github.com/opencontainers/umoci

https://github.com/containers/skopeo

https://medium.com/microscaling-systems/spot-the-docker-difference-9f99adcc4aaf

https://github.com/GoogleContainerTools/distroless

https://github.com/aquasecurity/trivy

https://aws.amazon.com/blogs/containers/scanning-images-with-trivy-in-an-aws-codepipeline/

https://www.schneier.com/blog/archives/2016/08/the_nsa_is_hoar.html

https://blog.jessfraz.com/post/how-to-use-new-docker-seccomp-profiles/

https://bitnami.com/stack/nginx