Symlink race vulnerabilities are common issues we encounter while performing code audits. One might think that such bugs should not happen anymore and every developer can easily avoid them - sadly the reality is different. In this blog post, we will have a closer look at a symlink race vulnerability from 2018 in docker. We think the vulnerability is quite interesting since it is easy to exploit but not so obvious to note while reviewing. Attentive readers may ask themselves whether they’d have noticed the issue while developing or reviewing the affected lines of code.
Due to their time-dependent nature, they are not always easy to exploit and therefore remain underrated. On the other hand, sometimes they are exploitable with surprisingly little effort. One of these bugs is the docker vulnerability CVE-2018-15664 from 2018.
docker offers a way to copy files from/to a container. The basic syntax of the copy tool is:
docker cp [OPTIONS] CONTAINER:SRC_PATH DEST_PATH|-
docker cp [OPTIONS] SRC_PATH|- CONTAINER:DEST_PATH
So to update the container’s /etc/passwd
file, a system administrator can execute a command such as:
docker cp new_passwd ea7d702924bc:/etc/passwd
Where ea7d702924bc
is the container ID.
Internally, docker cp
shifts execution to the docker daemon which runs as root.
The daemon will resolve the path to the actual host path of the docker storage.
ea7d702924bc:/etc/passwd
would resolve to /var/lib/docker/btrfs/subvolumes/6383eed3/etc/passwd
Where /var/lib/docker/btrfs/subvolumes/6383eed3
is the root directory from the container point of view.
Depending on the type of docker storage, the path might be different.
In this example, we used btrfs subvolumes as storage.
For better illustration, please read BASEDIR
as /var/lib/docker/btrfs/subvolumes/6383eed3
from now on.
Things get interesting when symlinks are involved. Let’s assume that within the container, /etc/
is a symlink pointing to /
. The docker daemon will determine that the last path component of BASEDIR/etc
is a symlink.
Simply following the symlink is not allowed: It would redirect to /
, in this context, the host filesystem root. So docker reads the contents of the symlink and applies it to the base directory of the container. The final path is BASEDIR/
.
Therefore the passwd
file will land in the container root directly instead of /etc/
.
This is expected and nothing to worry about.
To ensure that all file operations run within the container filesystem, docker sets up a new mount namespace. The interesting steps are:
pivot_root()
so that the root directory is within the container root/
.So, a simplified sequence of executed system calls could be:
[pid 1724] lstat("BASEDIR/etc", {st_mode=S_IFLNK|0777, st_size=1, ...}) = 0
[pid 1724] readlinkat(AT_FDCWD, "BASEDIR/etc", "/", 128) = 1
[pid 1724] lstat("BASEDIR", {st_mode=S_IFDIR|0755, st_size=206, ...}) = 0
[pid 1724] fork() = 21681
[pid 21681] unshare(CLONE_NEWNS)
[pid 21681] mkdirat(AT_FDCWD, "BASEDIR/.pivot_root918928353", 0700) = 0
[pid 21681] pivot_root("BASEDIR", "BASEDIR/.pivot_root918928353") = 0
[pid 21681] chdir("/") = 0
[pid 21681] mount("", "/.pivot_root918928353", 0xc420246fd3, MS_REC|MS_PRIVATE, NULL) = 0
[pid 21681] umount2("/.pivot_root918928353", MNT_DETACH) = 0
[pid 21681] unlinkat(AT_FDCWD, "/.pivot_root918928353", AT_REMOVEDIR) = 0
[pid 21681] openat(AT_FDCWD, "/passwd", O_WRONLY|O_CREAT|O_CLOEXEC, 0644) = 6
We observe that docker assumes that the sanitized path cannot change while it is working on it. That’s where the root of the problem lies.
Consider that /etc
within the container is not a symlink at the time of the check but it is changed to a symlink pointing to /
right before mkdirat()
.
Under this assumption the system call sequence could be:
[pid 15509] lstat("BASEDIR/etc", {st_mode=S_IFDIR|0755, st_size=18, ...}) = 0
[pid 15509] fork() = 21630
[pid 21630] unshare(CLONE_NEWNS)
[pid 21630] mkdirat(AT_FDCWD, "BASEDIR/etc/.pivot_root447772558", 0700) = 0
[pid 21630] pivot_root("BASEDIR/etc/", "BASEDIR/etc/.pivot_root447772558") = 0
[pid 21630] chdir("/") = 0
[pid 21630] mount("", "/.pivot_root447772558", 0xc4208981ed, MS_REC|MS_PRIVATE, NULL) = 0
[pid 21630] umount2("/.pivot_root447772558", MNT_DETACH) = 0
[pid 21630] unlinkat(AT_FDCWD, "/.pivot_root447772558", AT_REMOVEDIR) = 0
[pid 21630] openat(AT_FDCWD, "/passwd", O_WRONLY|O_CREAT|O_CLOEXEC, 0644) = 6
If etc
is suddenly a symlink to /
, the kernel will resolve BASEDIR/etc/
to
/
in mkdirat()
and pivot_root()
. The same applies to BASEDIR/etc/.pivot_root447772558
- it will turn into /.pivot_root447772558
.
This makes the pivot_root()
operation basically a no-op since we switched the host root filesystem to /.pivot_root447772558
and back again.
As soon as docker opens /passwd
inside the new mount namespace, it will be the host filesystem instead of the root of the
targeted container.
In the attack scenario, the attacker has control over a container and can execute code inside of it. He knows that the system administrator either uses docker cp
manually or automatically to copy a file into the container.
We saw that docker assumes that the path cannot change while the copy operation is running and that we can get access to the host filesystem if we manage to trick docker by presenting a directory at check time and later a symlink at use time. Hitting exactly the right times to present either a directory or a symlink might sound hard to achieve. There is a simpler approach: a brute force attack. We don’t care if the majority of all tries don’t work, as long as one works.
So we create two files, /etc
being a directory and /etc.sym
being a symlink to /
.
To apply the brute force attack /etc
and /etc.sym
filenames are exchanged forever.
At any point in time /etc
is either a directory or a symlink to /
.
To achieve that, we use Linux’s renameat2()
system call.
The exploit is basically a program performing
renameat2(AT_FDCWD, "/etc", AT_FDCWD, "/etc.sym", RENAME_EXCHANGE);
in a loop.
The crucial points in time are:
mkdirat()
runs to create the pivot directorypivot_root()
takes place/etc
within the container can either be in state D being a directory or in state S being a symlink.
So we have three points in time, which can be in two states. To sum it up: there are eight possible attack states our brute force exploit
can trigger.
We already know that the attack can only be successful if /etc/
is a directory at check time and then a symlink.
Therefore only attack state A4 will lead to success.
This rough estimation shows that one-eighth of all attacks are successful. Not too bad for a brute force attack. Please note that this is just an estimation that leaves out some details. It does not take into consideration, for example, that the time difference between T1 and T2 is large since execution spawns over two processes, while the time between T2 and T3 is minimal since both execution points are in the same process and successive. So hitting the race between T1 and T2 is much more likely than between T2 and T3.
Let’s take a deeper look at the state in which the attack is not successful. We distinguish two types of failures:
For example, if at time of mkdirat()
etc is a symlink and later at pivot_root()
again a directory, pivot_root()
will fail and docker cp
reports an error.
A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | |
---|---|---|---|---|---|---|---|---|
T1 | D | D | D | D | S | S | S | S |
T2 | D | D | S | S | S | S | D | D |
T3 | D | S | D | S | S | D | S | D |
U | E | E | X | U | U | U | U |
D ... etc is a directory
S ... etc is a symlink
U ... attack unsuccessful, unnoticed
E ... attack unsuccessful, errors reported
X ... exploited!
Under optimal circumstances, an attacker within a container can overwrite any file on the host side. On the other hand, the attacker needs to know which file is being written using docker cp
and if the copy operation runs very seldom, the chances to hit the race window are low.
In a previous section, we saw that there is a chance that docker cp
will fail and report an error.
If docker cp
runs as part of your automation, check logs for failures - they could indicate an attack.
Another possibility to detect ongoing attacks is stale .pivot_rootXXXXXXXX
directories on the host side, where XXXXXXXX
is a number.
If the attacker tries to place a file in the host root filesystem, unsuccessful attacks leave a .pivot_rootXXXXXXXX
behind.
The currently implemented solution is suspending the container while the copy operation is running. With that, the docker daemon can be sure that nothing inside the container can change paths while it is performing the copy operation. The downside of this approach is that suspending a container is slow and will interrupt workloads.
Symlink races, or more general TOCTOU (Time-of-Check-to-Time-of-Use) issues, are still a problem and not always obvious. To avoid them while writing code, always ask yourself whether a system resource can change after you did some check on it. Usually, the answer is yes, it can change when the check is performed by your code and not by the operating system. Especially when it comes to checks on filesystem resources such as symlinks. If you are using a high-level programming language, try to understand what operations will be made by it at a system call level. Just because it’s a single operation in your favorite programming language it doesn’t have to be atomic.
Publish date
30.07.2021
Category
security
Authors
Richard Weinberger
+43 5 9980 400 00 (email preferred)
sigma star gmbh
Eduard-Bodem-Gasse 6, 1st floor
6020 Innsbruck | Austria