Restricting file system access using Linux Mount Namespaces

Linux offers a variety of mechanisms to confine a process, one of them are namespaces. Nowadays, they are commonly used as foundation for Linux containers. In this blog post you’ll learn how namespaces can be used to restrict access to the file system for a given process and all its children.

The baseline

Let’s consider you want to make sure that your new application can use all system resources but has no access to certain parts of the file system. e.g. access to /home should be impossible. Or an application which runs within your user session should have no access to your .ssh folder which contains usually valuable assets.

Mount namespaces are one way to achieve this. Let’s write an unprivileged program which can be used as wrapper for another application but it will hide .ssh in the home directory. By using mount namespaces it is possible to make a copy of the current namespace and edit it. A program can either

use the unshare() system call to disconnect itself from the current namespace and create a new one, or:
create a new child process with clone() where the child has a new namespace.

User Namespaces

The act of creating a new mount namespace is restricted to processes which have the CAP_SYS_ADMIN capability. To use this facility in an unprivileged program we have to use a second namespace: the user namespace. User namespaces allow an unprivileged process gaining all capabilities and being uid 0. Linux guarantees that all actions taken by uid 0 within a user namespace adhere to the access rights of the unprivileged process. That way an unprivileged process can do things usually only root can do, but no action will have a visible effect to processes outside of the namespace.

Configuration of a user namespace

By calling unshare(CLONE_NEWNS | CLONE_NEWUSER) our program will gain a new user and mount namespace. Before we can proceed, we need to configure the user namespace. We want to have uid 0 with all capabilities inside our new user namespace. To achieve this, we must install a user and group mapping. The mapping tells Linux that everything we do inside the namespace should run outside the namespace as our current user id. Unprivileged processes are only allowed to install exactly one mapping where the target user and group ids match the outer namespace. Otherwise a processes could gain more privileges than it has.

Both user and group id mappings are controlled via two files in the processes’ /proc directory, /proc/PID/uid_map and /proc/PID/gid_map. Let’s assume the current uid is 1000 and gid is 100, then we need to write 1000 0 1 and 100 0 1 into the map files. So ids 1000 or 100 should be mapped to 0 once. Furthermore, the processes’s setgroups file has to be configured to deny. This is needed to be able to switch the group id within the user namespace. Only privileged processes are allowed to keep the file set to allow and therefore can modify their set of supplementary groups.

After this overly complicated setup procedure we’re allowed to run setuid(0) and setgid(0) to become uid 0 within the user namespace and can finally work on the mount namespace.

Creating helper functions

To ease things, let’s create first a helper function which writes a single line into a file below /proc/self/:

static void write_proc_self(const char *file, const char *content)
{
        size_t len = strlen(content) + 1;
        char *path;
        int fd;

        assert(asprintf(&path, "/proc/self/%s", file) != -1);
        fd = open(path, O_RDWR | O_CLOEXEC);
        free(path);
        assert(fd >= 0); 
        assert(write(fd, content, len) == len);
        close(fd);
}

A further helper is update_uidgid_map(), which constructs and writes the user and group mappings:

static void update_uidgid_map(uid_t from, uid_t to, bool is_user)
{
        char *map_content;

        assert(asprintf(&map_content, "%u %u 1\n", from, to) != -1);

        if (is_user)
                write_proc_self("uid_map", map_content);
        else
                write_proc_self("gid_map", map_content);

        free(map_content);
}

And finally another helper: deny_setgroups() writes deny to the setgroups file:

static void deny_setgroups(void)
{
        write_proc_self("setgroups", "deny\n");
}

Using the helpers from above and switching to uid and gid 0 happens in become_uid0():

static void become_uid0(uid_t orig_uid, gid_t orig_gid)
{
        deny_setgroups();
        update_uidgid_map(0, orig_gid, false);
        update_uidgid_map(0, orig_uid, true);
        assert(setuid(0) == 0);
        assert(setgid(0) == 0);
}

Mount Namespaces

The very first operation to the mount namespace is marking it as a hole as slave to the parent namespace. A slave mount namespace receives new mounts from outside. For example, if a user plugs in a USB thumb drive, this new mount point will also be visible to us. This is achieved with the following command: mount("none", "/", NULL, MS_REC|MS_SLAVE, NULL). The important settings are MS_REC and MS_SLAVE.

By setting the MS_REC flag we tell Linux to apply the new settings to all mount points below /. And MS_SLAVE sets the mounts, as the name describes, into slave mode. If we didn’t want to see new mounts instead of MS_SLAVE, MS_PRIVATE can be used.

Adding new mounts

Since we’re still unprivileged despite being uid 0 in our user namespace, we are only allowed to perform certain operations on the mount namespace. Our mount namespace is an immutable copy of the namespace we were in before, so we cannot unmount individual parts of it. However, we are allowed to unmount it as a whole or add new mounts to it. We’ll use the latter. Only very few virtual file systems can be mounted within a user namespace. One of them is tmpfs, which is an in-memory filesystem. We’ll use it to over mount .ssh, so that .ssh becomes an empty directory where all modifications are done to ram but the underlying content is hidden. To ensure that no memory is wasted by filling tmpfs, we’ll mount it as read-only, so that nobody can write to it. Changing the working directory to / and back is an important step. If we don’t do so and the current directory is the user’s .ssh, the process will still posess a reference and can access it.

All of the above happens in setup_mounts():

static void setup_mounts(void)
{
	char *curdir = get_current_dir_name();
	char *homedir = getenv("HOME");
	char *sshdir;

	assert(curdir);
	assert(homedir);
	assert(mount("none", "/", NULL, MS_REC | MS_SLAVE, NULL) == 0);

	assert(asprintf(&sshdir, "%s/.ssh", homedir) != -1);
	assert(mount("tmpfs", sshdir, "tmpfs", MS_RDONLY, NULL) == 0);
	chdir("/");
	chdir(curdir);

	free(sshdir);
	free(curdir);
}

But we’re not done yet. The mount we just made can still be undone, if somebody finds a way to execute code within the program we’re wrapping all our modifications to the original mount namespace can be undone.

Sealing

To seal the current mount namespace, we create a new user namespace. User and mount namespaces are tightly connected. Every mount namespace has an owning user namespace and modifications are only allowed by the owner. With the new user namespace we give up the ownership and make sure no further modifications can happen. Additionally, we switch back to the user and group ids we had initially. So we utilize again the unshare() system call, but this time only with the CLONE_NEWUSER flag, and install the following uid and gid mappings, 0 1000 1, 0 100 1. The attentive reader will notice that this time we map back from 0 to 1000 or 100 again.

So we need another helper: become_orig(). It switches back to the original user ids:

static void become_orig(uid_t orig_uid, gid_t orig_gid)
{
        update_uidgid_map(orig_gid, 0, false);
        update_uidgid_map(orig_uid, 0, true);
        assert(setuid(orig_uid) == 0);
        assert(setgid(orig_gid) == 0);
}

Finally

After all, in main() everything comes together:

int main(int argc, char *argv[])
{
        char *const new_argv[] = {"/bin/bash", NULL};
        uid_t my_uid = getuid();
        gid_t my_gid = getgid();

        assert(unshare(CLONE_NEWNS | CLONE_NEWUSER) == 0);
        become_uid0(my_uid, my_gid);
        setup_mounts();

        assert(unshare(CLONE_NEWUSER) == 0);
        become_orig(my_uid, my_gid);
 
        return execvp(new_argv[0], new_argv);
}

Here we’re wrapping /bin/bash, where you can directly run the program as regular user. In the new shell you’ll find your .ssh directory empty. Download the complete program here.

Another Example

Simply over mounting a directory to hide it is not always a perfect fit. Consider the more complicated case where you want to hide everything except one directory in your home directory. For example, a download application shall not access files in /home/user/ but /home/user/Downloads/. Just mounting a tmpfs on /home/user/ will hide everything. The task can be accomplished by using the property that the current working directory holds a reference to the mount point and keeps it from vanishing.

Necessary steps

Change the current directory to /home/user
Mount a tmpfs to /home/user
Create /home/user/Downloads - this has to be an absolute path to target the new tmpfs!
Remount /home/user to read-only mode
Bind mount Downloads to /home/user/Downloads; it is crucial to use a relative path as source, as mount() will use our current working directory - therefore bind Download from the overmounted home directory.

In code:

static void setup_mounts(void)
{
        char *curdir = get_current_dir_name();
        char *homedir = getenv("HOME");
        char *downloadsdir;

	assert(curdir);
        assert(homedir);
        assert(mount("none", "/", NULL, MS_REC | MS_SLAVE, NULL) == 0);

        assert(asprintf(&downloadsdir, "%s/%s", homedir, "Downloads") != -1);

        chdir(homedir);
        assert(mount("tmpfs", homdir, "tmpfs", 0, NULL) == 0);
        mkdir(downloadsdir, 0700);
	assert(mount("tmpfs", homedir, "tmpfs", MS_REMOUNT | MS_RDONLY, NULL) == 0);
        assert(mount("Downloads", downloadsdir, NULL, MS_BIND, NULL) == 0);
        free(downloadsdir);
        chdir("/");
        chdir(curdir);
	free(curdir);
}

Summary

Namespace are not only useful to build containers, but they can also be utilized to restrict access. In this case we used mount namespaces to limit access to certain parts of the file system. For more details on namespaces, see namespaces(7).

Publish date

18.03.2023

Get in touch

office@sigma-star.at (PGP Key)

+43 5 9980 400 00

sigma star gmbh
Eduard-Bodem-Gasse 6, 1st floor
6020 Innsbruck | Austria

Privacy & Terms Careers