I am reading the mount & clone man page. I want to clarify how CLONE_NEWNS effects the view of file system for the child process.
(File hierarchy)
Lets consider this tree to be the directory hierarchy. Lets says 5 & 6 are mount points in the parent process. I clarified mount points in another question.
So my understanding is : 5 & 6 are mount points means that the mount
command was used previously to 'mount' file systems (directory hierarchies) at 5 & 6 (which means there must be directory trees under 5 & 6 as well).
From mount
man page :
A mount namespace is the set of filesystem mounts that are visible to a process.
From clone
man page :
Every process lives in a mount namespace. The namespace of a process is the data (the set of mounts) describing the file hierarchy as seen by that process. After a fork(2) or clone() where the CLONE_NEWNS flag is not set, the child lives in the same mount namespace as the parent.
Also :
After a clone() where the CLONE_NEWNS flag is set, the cloned child is started in a new mount namespace, initialized with a copy of the namespace of the parent.
Now if I use clone()
with CLONE_NEWNS
to create a child process, does this mean that child will get an exact copy of the mount points in the tree (5 & 6) and still be able to access the rest of the original tree ? Does it also mean that the child could mount 5 & 6 at its will, without effecting what's mounted at 5 or 6 in its parent process's mount namespace.
If yes, does it also mean that child could mount / unmount a different directory than 5 or 6 and effect what's visible to the parent process ?
Thanks.
“Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources.” In other words, the key feature of namespaces is that they isolate processes from each other.
Creating a separate mount namespace allows each of these isolated processes to have a completely different view of the entire system's mountpoint structure from the original one. This allows you to have a different root for each isolated process, as well as other mountpoints that are specific to those processes.
A mount namespace is the set of filesystem mounts that are visible to a process. It makes it so that the superuser has separate mount points visible to it from the rest of the system/apps. I believe the intention is to prevent any issues when remounting partitions, such as remounting /system as read-write.
The “mount namespace” of a process is just the set of mounted filesystems that it sees. Once you go from the traditional situation of having one global mount namespace to having per-process mount namespaces, you must decide what to do when creating a child process with clone()
.
Traditionally, mounting or unmounting a filesystem changed the filesystem as seen by all processes: there was one global mount namespace, seen by all processes, and if any change was made (e.g. using the mount
command) all processes would immediately see that change irrespective of their relationship to the mount
command.
With per-process mount namespaces, a child process can now have a different mount namespace to its parent. The question now arises:
Should changes to the mount namespace made by the child propagate back to the parent?
Clearly, this functionality must at least be supported and, indeed, must probably be the default. Otherwise, launching the mount
command itself would effect no change (since the filesystem as seen by the parent shell would be unaffected).
Equally clearly, it must also be possible for this necessary propagation to be suppressed, otherwise we can never create a child process whose mount namespace differs from its parent, and we have one global mount namespace again (the filesystem as seen by init
).
Thus, we must decide when creating a child process with clone()
whether the child process gets its own copy of the data about mounted filesystems from the parent, which it can change without affecting the parent, or gets a pointer to the same data structures as the parent, which it can change (necessary for changes to propagate back, as when you launch mount
from the shell).
If the CLONE_NEWNS
flag is passed to clone()
, the child gets a copy of its parent's mounted filesystem data, which it can change without affecting the parent's mount namespace. Otherwise, it gets a pointer to the parent's mount data structures, where changes made by the child will be seen by the parent (so the mount
command itself can work).
Now if I use clone with CLONE_NEWNS to create a child process, does this mean that child will get an exact copy of the mount points in the tree (5 & 6) and still be able to access the rest of the original tree ?
Yes. It sees the exact same tree as its parent after the call to clone()
.
Does it also mean that the child could mount 5 & 6 at its will, without effecting what's mounted at 5 or 6 in its parent process's mount namespace.
Yes. Since you've used CLONE_NEWNS
, the child can unmount one device from 5 and mount another device there, and only it (and its children) could see the changes. No other process can see the changes made by the child in this case.
If yes, does it also mean that child could mount / unmount a different directory than 5 or 6 and effect what's visible to the parent process ?
No. If you've used CLONE_NEWNS
, the changes made in the child cannot propagate back to the parent.
If you haven't used CLONE_NEWNS
, the child would have received a pointer to the same mount namespace data as its parent, and any changes made by the child would be seen by any process that shares those data structures, including the parent. (This is also the case when the new child is created using fork()
.)
I don't have enough reputation points to add a comment so instead adding this comment as an answer. It's just an add on to Emmet's answer.
AFAICU, If a process is created with CLONE_NEWNS flag set, it can only mount those file systems which have FS_USERNS_MOUNT flag set. And almost all disk based file systems does not set this flag (due to security reasons). In do_new_mount, there is this check:
if (user_ns != &init_user_ns) { if (!(type->fs_flags & FS_USERNS_MOUNT)) { put_filesystem(type); return -EPERM; }
Please correct me if I am wrong
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With