Summary of the Book "Container Security" by Liz Rice
This blog post is a summary of key insights from the book "Container Security" authored by Liz Rice. These notes were captured as I went through the book, so certain portions might be missing or not as detailed as in the original text. Nonetheless, this summary highlights the main points and is intended to serve as a helpful reference. For a comprehensive understanding, I highly recommend reading the full book. The original book can be found here.
Chapter 1 ( Container Security Threats )
Risks, Threats, and Mitigations
A risk is a potential problem, and the effects of that problem if it were to occur.
A threat is a path to that risk occurring.
A mitigation is a countermeasure against a threat—something you can do to prevent the threat or at least reduce the likelihood of its success.
Security Principles
Least Privilege
Defense in Depth
- have multiple layers of defense
Reducing the Attack Surface
less code and less dependency
system with less components
Limiting the users and components who can access a service
Limiting the Blast Radius
Segregation of Duties
- give only subset of privilege to different components or people
Chapter 2 ( Linux System Calls, Permissions, and Capabilities )
System Calls
The programmatic interface that the user space code uses to make these requests of the kernel is known as the system call or syscall interface.
There are some 300+ different system calls, with the number varying according to the version of Linux kernel.
File Permissions
setuid
setgid
Linux Capabilities: there is another way to give ping sufficient privileges to open the socket without the executable having all the privileges associated with root.
Because setuid provides a dangerous pathway to privilege escalation, some container image scanners (covered in Chapter 7) will report on the presence of files with the setuid bit set. You can also prevent it from being used with the --no-new-privileges flag on a docker run command.
Linux Capabilities
There are over 30 different capabilities in today’s Linux kernel. Capabilities can be assigned to a thread to determine whether that thread can perform certain actions. For example, a thread needs the CAP_NET_BIND_SERVICE capability in order to bind to a low-numbered (below 1024) port. CAP_SYS_BOOT exists so that arbitrary executables don’t have permission to reboot the system. CAP_SYS_MODULE is needed to load or unload kernel modules.
You can see the capabilities assigned to a process by using the getpcaps command. For example, a process run by a non-root user typically won’t have capabilities:
getpcaps <pid>
setcap 'cap_net_raw+p' <executable>
Chapter 3 ( Control Groups )
Cgroup Hierarchies
There is a hierarchy of control groups for each type of resource being managed, and each hierarchy is managed by a cgroup controller. Any Linux process is a member of one cgroup of each type, and when it is first created, a process inherits the cgroups of its parent.
The Linux kernel communicates information about cgroups through a set of pseudo‐ filesystems that typically reside at /sys/fs/cgroup.
Assigning a Process to a Cgroup
echo 100000 > memory.limit_in_bytes
echo 29903 > cgroup.procs
Chapter 4 (Container Isolation)
Linux Namespaces
If cgroups control the resources that a process can use, namespaces control what it can see. By putting a process in a namespace, you can restrict the resources that are visible to that process.
PID Namespace:
Isolates the process ID number space, allowing processes inside the namespace to have the same PID as processes outside the namespace.
Enables the creation of containers with their own init process, providing a private process tree.
NET Namespace:
Provides isolation of network interfaces, IP addresses, routing tables, and port numbers.
Allows containers to have their own network stack, separate from the host and other containers.
MNT Namespace:
Isolates the filesystem mount points, so changes to mounts (e.g., adding or removing filesystems) in one namespace do not affect others.
Ensures each container can have its own file system layout.
IPC Namespace:
Isolates System V IPC objects and POSIX message queues, allowing processes to communicate within the same namespace without interfering with other namespaces.
Useful for containers that need shared memory or semaphores.
UTS Namespace:
Isolates the hostname and NIS domain name, allowing containers to have a different hostname and domain name from the host.
Useful for setting custom hostnames within containers.
USER Namespace:
Isolates user and group IDs, enabling processes to have different user IDs and privileges inside the namespace compared to outside.
Allows running containers as non-root users while providing root privileges within the container.
CGROUP Namespace:
Isolates the view of the cgroups, so each namespace sees its own cgroups.
Helps in managing and limiting the resources (CPU, memory, etc.) available to the processes within the namespace.
TIME Namespace:
Isolates the system and monotonic clocks, allowing processes within the namespace to see and manipulate different time values.
Useful for testing software behavior at different times or handling time-sensitive applications in containers without affecting the host system.
Chapter 5 ( Virtual Machines )
....
Capter 6 ( Container Images )
Dockerfile Best Practices for Security
Base image
Refer to an image from a trusted registry
The smaller the base image, the less likely that it includes unnecessary code, and hence the smaller the attack surface.
Be thoughtful about using a tag or a digest to reference the base image.
Use multi-stage builds
Non-root USER
RUN commands
Volume mounts
- Be careful while running volume mounts of system directories like /etc
Don’t include sensitive data in the Dockerfile
Avoid setuid binaries
Avoid unnecessary code
Include everything that your container needs
Chapter 7 ( Software Vulnerabilities in Images )
https://github.com/aquasecurity/trivy
https://aws.amazon.com/blogs/containers/scanning-images-with-trivy-in-an-aws-codepipeline/
https://bitnami.com/stack/nginx
Chapter 8 ( Strengthening Container Isolation )
https://blog.jessfraz.com/post/how-to-use-new-docker-seccomp-profiles/
https://github.com/nevins-b/falco2seccomp
https://github.com/aquasecurity/tracee
In AppArmor, a profile can be associated with an executable file, determining what that file is allowed to do in terms of capabilities and file access permissions.
/sys/module/apparmor/parameters/enabled
AppArmor and other LSMs implement mandatory access controls. A mandatory access control is set by a central administrator, and once set, other users do not have any ability to modify the control or pass it on to another user.
https://github.com/genuinetools/bane (Custom & better AppArmor profile generator for Docker containers.)
gVisor
Google’s gVisor sandboxes a container by intercepting system calls in much the same way that a hypervisor intercepts the system calls of a guest virtual machine.
According to the gVisor documentation, gVisor is a “user-space kernel,” which strikes me as a contradiction in terms but is meant to describe how a number of Linux sys‐ tem calls are implemented in user space through paravirtualization. As you saw in Chapter 5, paravirtualization means reimplementing instructions that would other‐ wise be run by the host kernel.
Kata Containers
As you’ve seen in Chapter 4, when you run a regular container, the container runtime starts a new process within the host. The idea with Kata Containers is to run contain‐ ers within a separate virtual machine. This approach gives the ability to run applica‐ tions from regular OCI format container images, with all the isolation of a virtual machine.
Kata uses a proxy between the container runtime and a separate target host where the application code runs. The runtime proxy creates a separate virtual machine using QEMU to run the container on its behalf.
One criticism of Kata Containers is that you have to wait for a virtual machine to boot up. The folks at AWS have created a lightweight virtual machine that is specifically designed for running containers, with much faster startup times than a normal VM: Firecracker.
Firecracker
As you saw in “Disadvantages of Virtual Machines” on page 62, virtual machines are slow to start, making them unsuitable for the ephemeral workloads that typically run in containers. But what if you had a virtual machine that boots extremely quickly? Firecracker is a virtual machine offering the benefits of secure isolation through a hypervisor and no shared kernel, but with startup times around 100ms, it is much more suitable for containers. It has the benefit of becoming field-hardened due to its (as I understand it, gradual) adoption by AWS for its Lambda and Fargate services.
Unikernels
The idea of Unikernels is to create a dedicated machine image consisting of the appli‐ cation and the parts of the operating system that the app needs. This machine image can run directly on the hypervisor, giving the same levels of isolation as regular virtual machines, but with a lightweight startup time similar to what we see in Firecracker.
Every application has to be compiled into a Unikernel image complete with every‐ thing it needs to operate. The hypervisor can boot up this machine in just the same way that it would boot a standard Linux virtual machine image.
Chapter 9 ( Breaking Container Isolation )
Containers Run as Root by Default
Rootless Containers
The --privileged Flag and Capabilities
Mounting Sensitive Directories
Mounting the Docker Socket
Sharing Namespaces Between a Container and Its Host
Sidecar Containers
Chapter 10 ( Container Network Security )
https://bitnami.com/stack/nginx
docker run -it --rm nginx
./tracee.py -c -e cap_capable
docker run --cap-drop=all --cap-add=<cap1> --cap-add=<cap2> <image>
Layer 3/4 Routing and Rules
As you already know, routing at Layer 3 is concerned about deciding the next hop for an IP packet. This decision is based on a set of rules about which addresses are reached over which interface. But this is just a subset of things you can do with Layer 3 rules: there are also some fun things that can go on at this level to drop packets or manipulate IP addresses, for example, to implement load balancing, NAT, firewalls, and network security policies. Rules can also act at layer 4 to take into account the port number. These rules rely on a kernel feature called netfilter.
netfilter is a packet-filtering framework that was first introduced into the Linux kernel in version 2.4. It uses a set of rules that define what to do with a packet based on its source and destination addresses.
There are a few different ways that netfilter rules can be configured in user space. Let’s look at the two most common options: iptables and IPVS.
iptables
The iptables tool is one way of configuring IP packet–handling rules that are dealt with in the kernel using netfilter. There are several different table types; the two most interesting types in the context of container networking are filter and nat:
filter—for deciding whether to drop or forward packets
nat—for translating addresses
As a root user, you can see the current set of rules of a particular type by running iptables -t <table type> -L.
IPVS
IP Virtual Server (IPVS) is sometimes referred to as Layer 4 load balancing or Layer 4 LAN switching. It is another rules implementation similar to iptables, but it’s opti‐ mized for load balancing by storing the forwarding rules in hash tables.
This optimization makes it very performant for kube-proxy’s use case, but it doesn’t necessarily mean you can draw conclusions about the performance of network plug- ins implementing network policies.
Whether it’s IPVS or iptables that manages netfilter rules, they act within the ker‐ nel.
Chapter 11 ( Securely Connecting Components with TLS )
....
Chapter 12 ( Passing Secrets to Containers )
Storing the Secret in the Container Image
Passing the Secret Over the Network
Passing Secrets in Environment Variables
Passing Secrets Through Files
Chapter 13 ( Container Runtime Protection )
Container Image Profiles
Network Traffic Profiles
Executable Profiles
- Observing executables with eBPF
File Access Profiles
User ID Profiles
Other Runtime Profiles
Container Security Tools
AppArmor, SELinux, or seccomp profile.
Network traffic can be policed at runtime using network policy or a service mesh, as described in
CNCF project Falco.
Prevention or alerting
Chapter 14 ( Containers and the OWASP Top 10 )
....
Tools to play with Container Internals
lscgroup
lib-cgroup
setns
nsenter
unshare
chroot
lsns
Links
https://www.openwall.com/lists/oss-security/2019/02/11/2
https://github.com/0xAX/linux-insides
https://github.com/opencontainers/umoci
https://github.com/containers/skopeo
https://medium.com/microscaling-systems/spot-the-docker-difference-9f99adcc4aaf
https://github.com/GoogleContainerTools/distroless
https://github.com/aquasecurity/trivy
https://aws.amazon.com/blogs/containers/scanning-images-with-trivy-in-an-aws-codepipeline/
https://www.schneier.com/blog/archives/2016/08/the_nsa_is_hoar.html
https://blog.jessfraz.com/post/how-to-use-new-docker-seccomp-profiles/