Discussion:
[RFC PATCH 00/11] Add support for devtmpfs in user namespaces
(too old to reply)
Seth Forshee
2014-05-14 21:34:48 UTC
Permalink
Unpriveleged containers cannot run mknod, making it difficult to support
devices which appear at runtime. Using devtmpfs is one possible
solution, and it would have the added benefit of making container setup
simpler. But simply letting containers mount devtmpfs isn't sufficient
since the container may need to see a different, more limited set of
devices, and because different environments making modifications to
the filesystem could lead to conflicts.

This series solves these problems by assigning devices to user
namespaces. Each device has an "owner" namespace which specifies which
devtmpfs mount the device should appear in as well allowing priveleged
operations on the device from that namespace. This defaults to
init_user_ns. There's also an ns_global flag to indicate a device should
appear in all devtmpfs mounts.

devtmpfs is updated to present a different superblock to each user
namespace. Each super block contains nodes for only global devices and
the devices assigned to the associated namespace.

The implementation isn't complete at this point - it's lacking proper
cleanup when a namespace is no longer in use, and only a sampling of
devices are updated to support use in namespaces. I'm sending the
patches now for feedback on the overall approach and the implementation
so far. I also have a couple of areas where I'd appreciate some
suggestions:

* If devices are owned by a namespace it might be useful to have this
awareness for uevents and sysfs as well. Would it make sense to apply
the ownership to kobjects rather than devices?

* I'd like to be able to do clean up when a namespace is destroyed,
e.g. with loop devices I'd probably free up any devices owned by the
namespace. But that's impossible in the current implementation since
the device has a reference to the namespace. Any suggestions to get
around this? I haven't spent much time thinking about it yet, but my
first thought was to add some kind of weak reference to user
namespaces. Then when the main reference count hits zero the
namespace isn't destroyed, but there would be a notification that
drivers could use to perform cleanup. Once all weak references were
released the memory would actually be freed.

Thanks,
Seth


Seth Forshee (11):
driver core: Assign owning user namespace to devices
driver core: Add device_create_global()
tmpfs: Add sub-filesystem data pointer to shmem_sb_info
ramfs: Add sub-filesystem data pointer to ram_fs_info
devtmpfs: Add support for mounting in user namespaces
drivers/char/mem.c: Make null/zero/full/random/urandom available to
user namespaces
block: Make partitions inherit namespace from whole disk device
block: Allow blkdev ioctls within user namespaces
misc: Make loop-control available to all user namespaces
loop: Assign devices to current_user_ns()
loop: Allow priveleged operations for root in the namespace which owns
a device

block/compat_ioctl.c | 3 +-
block/ioctl.c | 16 +-
block/partition-generic.c | 2 +
drivers/base/core.c | 54 ++++-
drivers/base/devtmpfs.c | 509 ++++++++++++++++++++++++++++++++-------------
drivers/block/loop.c | 22 +-
drivers/char/mem.c | 28 ++-
drivers/char/misc.c | 11 +-
fs/ramfs/inode.c | 8 -
include/linux/device.h | 18 ++
include/linux/miscdevice.h | 1 +
include/linux/ramfs.h | 9 +
include/linux/shmem_fs.h | 1 +
13 files changed, 499 insertions(+), 183 deletions(-)
Seth Forshee
2014-05-14 21:34:49 UTC
Permalink
Adds a member to struct device named ns to indicate the user
namespace which "owns" a device, which would generally indicate
that root in that namespace is priveleged toward the device. It
will also be used for future devtmpfs to determine which
namespace's mount the device will appear in. This defaults to
init_user_ns. An ns_global flag is also added to struct device,
which indicates the device should appear in all devtmpfs mounts.

Also adds a helper interface, dev_set_ns(), for changing the
namespace which a device has been assigned to.

Signed-off-by: Seth Forshee <seth.forshee at canonical.com>
---
drivers/base/core.c | 3 +++
include/linux/device.h | 13 +++++++++++++
2 files changed, 16 insertions(+)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 20da3ad1696b..1da05f1319fa 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -26,6 +26,7 @@
#include <linux/pm_runtime.h>
#include <linux/netdevice.h>
#include <linux/sysfs.h>
+#include <linux/user_namespace.h>

#include "base.h"
#include "power/power.h"
@@ -661,6 +662,7 @@ void device_initialize(struct device *dev)
INIT_LIST_HEAD(&dev->devres_head);
device_pm_init(dev);
set_dev_node(dev, -1);
+ dev->ns = get_user_ns(&init_user_ns);
}
EXPORT_SYMBOL_GPL(device_initialize);

@@ -1211,6 +1213,7 @@ void device_del(struct device *dev)
*/
if (platform_notify_remove)
platform_notify_remove(dev);
+ put_user_ns(dev->ns);
kobject_uevent(&dev->kobj, KOBJ_REMOVE);
cleanup_device_parent(dev);
kobject_del(&dev->kobj);
diff --git a/include/linux/device.h b/include/linux/device.h
index d1d1c055b48e..41a4ba33b13b 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -27,6 +27,7 @@
#include <linux/ratelimit.h>
#include <linux/uidgid.h>
#include <linux/gfp.h>
+#include <linux/user_namespace.h>
#include <asm/device.h>

struct device;
@@ -704,9 +705,12 @@ struct acpi_dev_node {
* gone away. This should be set by the allocator of the
* device (i.e. the bus driver that discovered the device).
* @iommu_group: IOMMU group the device belongs to.
+ * @ns: User namespace which "owns" this device.
*
* @offline_disabled: If set, the device is permanently online.
* @offline: Set after successful invocation of bus type's .offline().
+ * @ns_global: Set to make device appear in devtmpfs for all user
+ * namespaces.
*
* At the lowest level, every device in a Linux system is represented by an
* instance of struct device. The device structure contains the information
@@ -780,8 +784,11 @@ struct device {
void (*release)(struct device *dev);
struct iommu_group *iommu_group;

+ struct user_namespace *ns;
+
bool offline_disabled:1;
bool offline:1;
+ bool ns_global:1;
};

static inline struct device *kobj_to_dev(struct kobject *kobj)
@@ -804,6 +811,12 @@ static inline const char *dev_name(const struct device *dev)
extern __printf(2, 3)
int dev_set_name(struct device *dev, const char *name, ...);

+static inline void dev_set_ns(struct device *dev, struct user_namespace *ns)
+{
+ put_user_ns(dev->ns);
+ dev->ns = get_user_ns(ns);
+}
+
#ifdef CONFIG_NUMA
static inline int dev_to_node(struct device *dev)
{
--
1.9.1
Seth Forshee
2014-05-14 21:34:51 UTC
Permalink
devtmpfs is built on top of tmpfs but also needs to store some
data of its own. Add a sub_fs_data pointer to
struct shmem_sb_info for this purpose.

Signed-off-by: Seth Forshee <seth.forshee at canonical.com>
---
include/linux/shmem_fs.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 4d1771c2d29f..7e2c33a0c73a 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -33,6 +33,7 @@ struct shmem_sb_info {
kgid_t gid; /* Mount gid for root directory */
umode_t mode; /* Mount mode for root directory */
struct mempolicy *mpol; /* default memory policy for mappings */
+ void *sub_fs_data; /* data for other shmem-based fses */
};

static inline struct shmem_inode_info *SHMEM_I(struct inode *inode)
--
1.9.1
Seth Forshee
2014-05-14 21:34:50 UTC
Permalink
This does the same thing as device_create() but also sets the
ns_global flags for the device.

It's likely better to do this as a flag to device_create() or
something like that, but making it a separate interface for now
avoids needing to change the 100+ callers of device_create().

Signed-off-by: Seth Forshee <seth.forshee at canonical.com>
---
drivers/base/core.c | 51 +++++++++++++++++++++++++++++++++++++++++++++-----
include/linux/device.h | 4 ++++
2 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 1da05f1319fa..b2b62743e757 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1575,7 +1575,7 @@ static void device_create_release(struct device *dev)

static struct device *
device_create_groups_vargs(struct class *class, struct device *parent,
- dev_t devt, void *drvdata,
+ dev_t devt, bool ns_global, void *drvdata,
const struct attribute_group **groups,
const char *fmt, va_list args)
{
@@ -1597,6 +1597,7 @@ device_create_groups_vargs(struct class *class, struct device *parent,
dev->parent = parent;
dev->groups = groups;
dev->release = device_create_release;
+ dev->ns_global = ns_global;
dev_set_drvdata(dev, drvdata);

retval = kobject_set_name_vargs(&dev->kobj, fmt, args);
@@ -1643,8 +1644,8 @@ struct device *device_create_vargs(struct class *class, struct device *parent,
dev_t devt, void *drvdata, const char *fmt,
va_list args)
{
- return device_create_groups_vargs(class, parent, devt, drvdata, NULL,
- fmt, args);
+ return device_create_groups_vargs(class, parent, devt, false, drvdata,
+ NULL, fmt, args);
}
EXPORT_SYMBOL_GPL(device_create_vargs);

@@ -1686,6 +1687,46 @@ struct device *device_create(struct class *class, struct device *parent,
EXPORT_SYMBOL_GPL(device_create);

/**
+ * device_create_global - creates a global device and registers it with sysfs
+ * @class: pointer to the struct class that this device should be registered to
+ * @parent: pointer to the parent struct device of this new device, if any
+ * @devt: the dev_t for the char device to be added
+ * @drvdata: the data to be added to the device for callbacks
+ * @fmt: string for the device's name
+ *
+ * This function can be used by char device classes. A struct device
+ * will be created in sysfs, registered to the specified class, and
+ * accessible to all user namespaces.
+ *
+ * A "dev" file will be created, showing the dev_t for the device, if
+ * the dev_t is not 0,0.
+ * If a pointer to a parent struct device is passed in, the newly created
+ * struct device will be a child of that device in sysfs.
+ * The pointer to the struct device will be returned from the call.
+ * Any further sysfs files that might be required can be created using this
+ * pointer.
+ *
+ * Returns &struct device pointer on success, or ERR_PTR() on error.
+ *
+ * Note: the struct class passed to this function must have previously
+ * been created with a call to class_create().
+ */
+struct device *device_create_global(struct class *class, struct device *parent,
+ dev_t devt, void *drvdata,
+ const char *fmt, ...)
+{
+ va_list vargs;
+ struct device *dev;
+
+ va_start(vargs, fmt);
+ dev = device_create_groups_vargs(class, parent, devt, true, drvdata,
+ NULL, fmt, vargs);
+ va_end(vargs);
+ return dev;
+}
+EXPORT_SYMBOL(device_create_global);
+
+/**
* device_create_with_groups - creates a device and registers it with sysfs
* @class: pointer to the struct class that this device should be registered to
* @parent: pointer to the parent struct device of this new device, if any
@@ -1722,8 +1763,8 @@ struct device *device_create_with_groups(struct class *class,
struct device *dev;

va_start(vargs, fmt);
- dev = device_create_groups_vargs(class, parent, devt, drvdata, groups,
- fmt, vargs);
+ dev = device_create_groups_vargs(class, parent, devt, false, drvdata,
+ groups, fmt, vargs);
va_end(vargs);
return dev;
}
diff --git a/include/linux/device.h b/include/linux/device.h
index 41a4ba33b13b..e2dbe19b5f46 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -973,6 +973,10 @@ extern __printf(5, 6)
struct device *device_create(struct class *cls, struct device *parent,
dev_t devt, void *drvdata,
const char *fmt, ...);
+extern __printf(5, 6)
+struct device *device_create_global(struct class *cls, struct device *parent,
+ dev_t devt, void *drvdata,
+ const char *fmt, ...);
extern __printf(6, 7)
struct device *device_create_with_groups(struct class *cls,
struct device *parent, dev_t devt, void *drvdata,
--
1.9.1
Seth Forshee
2014-05-14 21:34:52 UTC
Permalink
devtmpfs will use ramfs if tmpfs is not available, and it needs
to store some data of its own. Add a sub_fs_data pointer to
struct ram_fs_info and move the struct definition to the shared
header, and export the relevant structs to give devtmpfs access
to this pointer.

Signed-off-by: Seth Forshee <seth.forshee at canonical.com>
---
fs/ramfs/inode.c | 8 --------
include/linux/ramfs.h | 9 +++++++++
2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index d365b1c4eb3c..0f2fa4d7212c 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -163,10 +163,6 @@ static const struct super_operations ramfs_ops = {
.show_options = generic_show_options,
};

-struct ramfs_mount_opts {
- umode_t mode;
-};
-
enum {
Opt_mode,
Opt_err
@@ -177,10 +173,6 @@ static const match_table_t tokens = {
{Opt_err, NULL}
};

-struct ramfs_fs_info {
- struct ramfs_mount_opts mount_opts;
-};
-
static int ramfs_parse_options(char *data, struct ramfs_mount_opts *opts)
{
substring_t args[MAX_OPT_ARGS];
diff --git a/include/linux/ramfs.h b/include/linux/ramfs.h
index ecc730977a5a..cd00e86cc444 100644
--- a/include/linux/ramfs.h
+++ b/include/linux/ramfs.h
@@ -1,6 +1,15 @@
#ifndef _LINUX_RAMFS_H
#define _LINUX_RAMFS_H

+struct ramfs_mount_opts {
+ umode_t mode;
+};
+
+struct ramfs_fs_info {
+ struct ramfs_mount_opts mount_opts;
+ void *sub_fs_data;
+};
+
struct inode *ramfs_get_inode(struct super_block *sb, const struct inode *dir,
umode_t mode, dev_t dev);
extern struct dentry *ramfs_mount(struct file_system_type *fs_type,
--
1.9.1
Seth Forshee
2014-05-14 21:34:54 UTC
Permalink
Signed-off-by: Seth Forshee <seth.forshee at canonical.com>
---
drivers/char/mem.c | 28 +++++++++++++++++-----------
1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 917403fe10da..edd71ebf5025 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -811,21 +811,22 @@ static const struct memdev {
umode_t mode;
const struct file_operations *fops;
struct backing_dev_info *dev_info;
+ bool global;
} devlist[] = {
- [1] = { "mem", 0, &mem_fops, &directly_mappable_cdev_bdi },
+ [1] = { "mem", 0, &mem_fops, &directly_mappable_cdev_bdi, false },
#ifdef CONFIG_DEVKMEM
- [2] = { "kmem", 0, &kmem_fops, &directly_mappable_cdev_bdi },
+ [2] = { "kmem", 0, &kmem_fops, &directly_mappable_cdev_bdi, false },
#endif
- [3] = { "null", 0666, &null_fops, NULL },
+ [3] = { "null", 0666, &null_fops, NULL, true },
#ifdef CONFIG_DEVPORT
- [4] = { "port", 0, &port_fops, NULL },
+ [4] = { "port", 0, &port_fops, NULL, false },
#endif
- [5] = { "zero", 0666, &zero_fops, &zero_bdi },
- [7] = { "full", 0666, &full_fops, NULL },
- [8] = { "random", 0666, &random_fops, NULL },
- [9] = { "urandom", 0666, &urandom_fops, NULL },
+ [5] = { "zero", 0666, &zero_fops, &zero_bdi, true },
+ [7] = { "full", 0666, &full_fops, NULL, true },
+ [8] = { "random", 0666, &random_fops, NULL, true },
+ [9] = { "urandom", 0666, &urandom_fops, NULL, true },
#ifdef CONFIG_PRINTK
- [11] = { "kmsg", 0644, &kmsg_fops, NULL },
+ [11] = { "kmsg", 0644, &kmsg_fops, NULL, false },
#endif
};

@@ -897,8 +898,13 @@ static int __init chr_dev_init(void)
if ((minor == DEVPORT_MINOR) && !arch_has_dev_port())
continue;

- device_create(mem_class, NULL, MKDEV(MEM_MAJOR, minor),
- NULL, devlist[minor].name);
+ if (devlist[minor].global)
+ device_create_global(mem_class, NULL,
+ MKDEV(MEM_MAJOR, minor), NULL,
+ devlist[minor].name);
+ else
+ device_create(mem_class, NULL, MKDEV(MEM_MAJOR, minor),
+ NULL, devlist[minor].name);
}

return tty_init();
--
1.9.1
Seth Forshee
2014-05-14 21:34:53 UTC
Permalink
devtmpfs is arguably more useful within containers than outside
since containers will often lack the ability to run mknod. So far
this hasn't been permitted since it doesn't make sense to give
containers the same set of devices as the rest of the system.
devtmpfs needs to be aware of device ownership, creating device
nodes only for the namespaces in which a given device should be
accessible.

Add this support by creating multiple devtmpfs super blocks, one
for each user namespace which as devtmpfs mounted. A given super
block only contains device nodes for device owned by the
associated namespace as well as nodes for global devices. Upon
mount, if no super block already exists for the current user
namespace a new one is created and populated with the appropriate
device nodes.

Under this new structure devtmpfsd can no longer assume that all
files will be created relative to its current working directory,
so this code is also rewritten to create files relative to the
root of the super block.

Signed-off-by: Seth Forshee <seth.forshee at canonical.com>
---
drivers/base/devtmpfs.c | 509 ++++++++++++++++++++++++++++++++++--------------
include/linux/device.h | 1 +
2 files changed, 368 insertions(+), 142 deletions(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index 25798db14553..1f77c419ef6a 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -24,10 +24,16 @@
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/kthread.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/user_namespace.h>
#include "base.h"

static struct task_struct *thread;

+static LIST_HEAD(dev_list);
+static DEFINE_MUTEX(dev_list_mutex);
+
#if defined CONFIG_DEVTMPFS_MOUNT
static int mount_dev = 1;
#else
@@ -36,17 +42,77 @@ static int mount_dev;

static DEFINE_SPINLOCK(req_lock);

+enum req_type {
+ REQ_TYPE_CREATE,
+ REQ_TYPE_REMOVE,
+ REQ_TYPE_POPULATE,
+
+ NUM_REQ_TYPES
+};
+
static struct req {
+ enum req_type type;
struct req *next;
struct completion done;
int err;
const char *name;
- umode_t mode; /* 0 => delete */
+ umode_t mode;
kuid_t uid;
kgid_t gid;
struct device *dev;
+ struct super_block *sb;
} *requests;

+#ifdef CONFIG_BLOCK
+static inline int is_blockdev(struct device *dev)
+{
+ return dev->class == &block_class;
+}
+#else
+static inline int is_blockdev(struct device *dev) { return 0; }
+#endif
+
+/* Caller must free returned string */
+static char *dev_get_params(struct device *dev, umode_t *mode, kuid_t *uid,
+ kgid_t *gid)
+{
+ const char *name, *tmp = NULL;
+
+ if (mode)
+ *mode = 0;
+ if (uid)
+ *uid = GLOBAL_ROOT_UID;
+ if (gid)
+ *gid = GLOBAL_ROOT_GID;
+
+ name = device_get_devnode(dev, mode, uid, gid, &tmp);
+ if (!name)
+ return ERR_PTR(-ENOMEM);
+
+ if (mode) {
+ if (*mode == 0)
+ *mode = 0600;
+ *mode |= is_blockdev(dev) ? S_IFBLK : S_IFCHR;
+ }
+
+ /*
+ * If !tmp the name is static memory, so duplicate it for
+ * returning to caller
+ */
+ if (!tmp)
+ tmp = kstrdup(name, GFP_KERNEL);
+ return (char *)tmp;
+}
+
+struct user_namespace *dev_sb_ns(struct super_block *s)
+{
+#ifdef CONFIG_TMPFS
+ return ((struct shmem_sb_info *)s->s_fs_info)->sub_fs_data;
+#else
+ return ((struct ram_fs_info *)s->s_fs_info)->sub_fs_data;
+#endif
+}
+
static int __init mount_param(char *str)
{
mount_dev = simple_strtoul(str, NULL, 0);
@@ -54,53 +120,104 @@ static int __init mount_param(char *str)
}
__setup("devtmpfs.mount=", mount_param);

+static int dev_compare_sb(struct super_block *s, void *data)
+{
+ return dev_sb_ns(s) == data;
+}
+
+static int dev_fill_super(struct super_block *s, void *data, int silent)
+{
+#ifdef CONFIG_TMPFS
+ return shmem_fill_super(s, data, silent);
+#else
+ return ramfs_fill_super(s, data, silent);
+#endif
+
+}
+
static struct dentry *dev_mount(struct file_system_type *fs_type, int flags,
- const char *dev_name, void *data)
+ const char *dev_name, void *data)
{
+ struct super_block *s = NULL;
+ struct user_namespace *ns;
+ struct req req;
+ int err;
+
+ if (!thread)
+ return ERR_PTR(-ENODEV);
+
+ ns = get_user_ns(current_user_ns());
+
+ s = sget(fs_type, dev_compare_sb, set_anon_super, flags, ns);
+ if (IS_ERR(s)) {
+ err = PTR_ERR(s);
+ goto error;
+ }
+
+ if (!s->s_root) {
+ err = dev_fill_super(s, data, flags & MS_SILENT ? 1 : 0);
+ if (err)
+ goto error;
+ s->s_flags |= MS_ACTIVE;
+
#ifdef CONFIG_TMPFS
- return mount_single(fs_type, flags, data, shmem_fill_super);
+ ((struct shmem_sb_info *)s->s_fs_info)->sub_fs_data = ns;
#else
- return mount_single(fs_type, flags, data, ramfs_fill_super);
+ ((struct ram_fs_info *)s->s_fs_info)->sub_fs_data = ns;
#endif
+
+ req.type = REQ_TYPE_POPULATE;
+ req.sb = s;
+ init_completion(&req.done);
+
+ spin_lock(&req_lock);
+ req.next = requests;
+ requests = &req;
+ spin_unlock(&req_lock);
+
+ wake_up_process(thread);
+ wait_for_completion(&req.done);
+ }
+
+ return dget(s->s_root);
+
+error:
+ if (s)
+ deactivate_locked_super(s);
+ put_user_ns(ns);
+ return ERR_PTR(err);
+}
+
+static void dev_kill_sb(struct super_block *s)
+{
+ struct user_namespace *ns = dev_sb_ns(s);
+
+ kill_litter_super(s);
+ put_user_ns(ns);
}

static struct file_system_type dev_fs_type = {
.name = "devtmpfs",
.mount = dev_mount,
- .kill_sb = kill_litter_super,
+ .kill_sb = dev_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT | FS_USERNS_DEV_MOUNT,
};

-#ifdef CONFIG_BLOCK
-static inline int is_blockdev(struct device *dev)
-{
- return dev->class == &block_class;
-}
-#else
-static inline int is_blockdev(struct device *dev) { return 0; }
-#endif
-
int devtmpfs_create_node(struct device *dev)
{
- const char *tmp = NULL;
struct req req;

+ mutex_lock(&dev_list_mutex);
+ list_add(&dev->devtmpfs_list, &dev_list);
+ mutex_unlock(&dev_list_mutex);
+
if (!thread)
return 0;

- req.mode = 0;
- req.uid = GLOBAL_ROOT_UID;
- req.gid = GLOBAL_ROOT_GID;
- req.name = device_get_devnode(dev, &req.mode, &req.uid, &req.gid, &tmp);
- if (!req.name)
- return -ENOMEM;
-
- if (req.mode == 0)
- req.mode = 0600;
- if (is_blockdev(dev))
- req.mode |= S_IFBLK;
- else
- req.mode |= S_IFCHR;
-
+ req.type = REQ_TYPE_CREATE;
+ req.name = dev_get_params(dev, &req.mode, &req.uid, &req.gid);
+ if (IS_ERR(req.name))
+ return PTR_ERR(req.name);
req.dev = dev;

init_completion(&req.done);
@@ -113,22 +230,26 @@ int devtmpfs_create_node(struct device *dev)
wake_up_process(thread);
wait_for_completion(&req.done);

- kfree(tmp);
+ kfree(req.name);

return req.err;
}

int devtmpfs_delete_node(struct device *dev)
{
- const char *tmp = NULL;
struct req req;

+ mutex_lock(&dev_list_mutex);
+ list_del(&dev->devtmpfs_list);
+ mutex_unlock(&dev_list_mutex);
+
if (!thread)
return 0;

- req.name = device_get_devnode(dev, NULL, NULL, NULL, &tmp);
- if (!req.name)
- return -ENOMEM;
+ req.type = REQ_TYPE_REMOVE;
+ req.name = dev_get_params(dev, NULL, NULL, NULL);
+ if (IS_ERR(req.name))
+ return PTR_ERR(req.name);

req.mode = 0;
req.dev = dev;
@@ -143,113 +264,165 @@ int devtmpfs_delete_node(struct device *dev)
wake_up_process(thread);
wait_for_completion(&req.done);

- kfree(tmp);
+ kfree(req.name);
return req.err;
}

-static int dev_mkdir(const char *name, umode_t mode)
+/*
+ * Looks up the path specified in @nodepath and returns the corresponding
+ * dentry. If @create is true the path will be created if it does not
+ * exist.
+ *
+ * When @create is true: if @nodepath ends in '/', lookup_path() will
+ * create a directory for the last path component if it doesn't exist.
+ * If @nodepath does not end in '/', lookup_path() will create the
+ * dentry but not an inode. It is up to the caller to check d_inode in
+ * the returned dentry and act accordingly.
+ */
+static struct dentry *lookup_path(const char *nodepath, struct dentry *parent,
+ bool create)
{
- struct dentry *dentry;
- struct path path;
- int err;
+ const char *p, *s;
+ struct dentry *next, *de = parent;
+ void *cookie = de->d_sb;
+ bool dir = true;
+ int err = 0;

- dentry = kern_path_create(AT_FDCWD, name, &path, LOOKUP_DIRECTORY);
- if (IS_ERR(dentry))
- return PTR_ERR(dentry);
+ dget(de);
+ for (s = p = nodepath; *s;) {
+ s = strchr(p, '/');
+ if (!s) {
+ if (*p) {
+ s = p + strlen(p);
+ dir = false;
+ } else {
+ break;
+ }
+ }

- err = vfs_mkdir(path.dentry->d_inode, dentry, mode);
- if (!err)
- /* mark as kernel-created inode */
- dentry->d_inode->i_private = &thread;
- done_path_create(&path, dentry);
- return err;
-}
+ mutex_lock(&de->d_inode->i_mutex);
+ next = lookup_one_len(p, de, s - p);
+ if (IS_ERR(next)) {
+ err = PTR_ERR(next);
+ break;
+ }

-static int create_path(const char *nodepath)
-{
- char *path;
- char *s;
- int err = 0;
+ if (!next->d_inode) {
+ if (!create) {
+ err = -ENOENT;
+ dput(next);
+ break;
+ }

- /* parent directories do not exist, create them */
- path = kstrdup(nodepath, GFP_KERNEL);
- if (!path)
- return -ENOMEM;
+ if (dir) {
+ err = vfs_mkdir(de->d_inode, next, 0755);
+ if (err == -EEXIST) {
+ /* SAF: I'm not sure if this is right,
+ * or even necessary. We definitely
+ * should not overwrite i_private in
+ * this case though. */
+ dput(next);
+ err = 0;
+ continue; /* try lookup again */
+ }
+ if (err) {
+ dput(next);
+ break;
+ }
+ next->d_inode->i_private = cookie;
+ }
+ }

- s = path;
- for (;;) {
- s = strchr(s, '/');
- if (!s)
- break;
- s[0] = '\0';
- err = dev_mkdir(path, 0755);
- if (err && err != -EEXIST)
- break;
- s[0] = '/';
- s++;
+ mutex_unlock(&de->d_inode->i_mutex);
+ dput(de);
+ de = next;
+ p = s + 1;
}
- kfree(path);
- return err;
+
+ if (err) {
+ mutex_unlock(&de->d_inode->i_mutex);
+ dput(de);
+ de = ERR_PTR(err);
+ }
+
+ return de;
}

-static int handle_create(const char *nodename, umode_t mode, kuid_t uid,
- kgid_t gid, struct device *dev)
+static void do_handle_create(struct super_block *s, void *arg)
{
+ struct req *req = arg;
+ struct device *dev = req->dev;
struct dentry *dentry;
- struct path path;
int err;

- dentry = kern_path_create(AT_FDCWD, nodename, &path, 0);
- if (dentry == ERR_PTR(-ENOENT)) {
- create_path(nodename);
- dentry = kern_path_create(AT_FDCWD, nodename, &path, 0);
+ if (!dev->ns_global && dev_sb_ns(s) != dev->ns)
+ return;
+
+ dentry = lookup_path(req->name, s->s_root, true);
+ if (IS_ERR(dentry)) {
+ req->err = PTR_ERR(dentry);
+ return;
}
- if (IS_ERR(dentry))
- return PTR_ERR(dentry);

- err = vfs_mknod(path.dentry->d_inode, dentry, mode, dev->devt);
+ if (dentry->d_inode) {
+ dput(dentry);
+ req->err = -EEXIST;
+ return;
+ }
+
+ err = vfs_mknod(dentry->d_parent->d_inode, dentry, req->mode,
+ dev->devt);
if (!err) {
struct iattr newattrs;

- newattrs.ia_mode = mode;
- newattrs.ia_uid = uid;
- newattrs.ia_gid = gid;
+ newattrs.ia_mode = req->mode;
+ /* SAF: Is this right? */
+ newattrs.ia_uid = make_kuid(dev_sb_ns(s), req->uid.val);
+ newattrs.ia_gid = make_kgid(dev_sb_ns(s), req->gid.val);
newattrs.ia_valid = ATTR_MODE|ATTR_UID|ATTR_GID;
mutex_lock(&dentry->d_inode->i_mutex);
notify_change(dentry, &newattrs, NULL);
mutex_unlock(&dentry->d_inode->i_mutex);

/* mark as kernel-created inode */
- dentry->d_inode->i_private = &thread;
+ dentry->d_inode->i_private = s;
}
- done_path_create(&path, dentry);
- return err;
+
+ dput(dentry);
+
+ if (err)
+ req->err = err;
}

-static int dev_rmdir(const char *name)
+static int handle_create(struct req *req)
+{
+ req->err = 0;
+ iterate_supers_type(&dev_fs_type, do_handle_create, req);
+ return req->err;
+}
+
+static int dev_rmdir(struct super_block *s, const char *name)
{
- struct path parent;
struct dentry *dentry;
int err;

- dentry = kern_path_locked(name, &parent);
+ dentry = lookup_path(name, s->s_root, false);
if (IS_ERR(dentry))
return PTR_ERR(dentry);
if (dentry->d_inode) {
- if (dentry->d_inode->i_private == &thread)
- err = vfs_rmdir(parent.dentry->d_inode, dentry);
+ if (dentry->d_inode->i_private == s)
+ err = vfs_rmdir(dentry->d_parent->d_inode, dentry);
else
err = -EPERM;
} else {
err = -ENOENT;
}
+
dput(dentry);
- mutex_unlock(&parent.dentry->d_inode->i_mutex);
- path_put(&parent);
return err;
}

-static int delete_path(const char *nodepath)
+static int delete_path(struct super_block *s, const char *nodepath)
{
const char *path;
int err = 0;
@@ -265,7 +438,7 @@ static int delete_path(const char *nodepath)
if (!base)
break;
base[0] = '\0';
- err = dev_rmdir(path);
+ err = dev_rmdir(s, path);
if (err)
break;
}
@@ -274,10 +447,11 @@ static int delete_path(const char *nodepath)
return err;
}

-static int dev_mynode(struct device *dev, struct inode *inode, struct kstat *stat)
+static int dev_mynode(struct super_block *s, struct device *dev,
+ struct inode *inode, struct kstat *stat)
{
/* did we create it */
- if (inode->i_private != &thread)
+ if (inode->i_private != s)
return 0;

/* does the dev_t match */
@@ -295,36 +469,50 @@ static int dev_mynode(struct device *dev, struct inode *inode, struct kstat *sta
return 1;
}

-static int handle_remove(const char *nodename, struct device *dev)
+static void do_handle_remove(struct super_block *s, void *arg)
{
- struct path parent;
+ struct req *req = arg;
+ struct device *dev = req->dev;
struct dentry *dentry;
int deleted = 0;
- int err;
+ int err = 0;

- dentry = kern_path_locked(nodename, &parent);
- if (IS_ERR(dentry))
- return PTR_ERR(dentry);
+ if (!dev->ns_global && dev_sb_ns(s) != dev->ns)
+ return;
+
+ dentry = lookup_path(req->name, s->s_root, false);
+ if (IS_ERR(dentry)) {
+ req->err = PTR_ERR(dentry);
+ return;
+ }

if (dentry->d_inode) {
struct kstat stat;
- struct path p = {.mnt = parent.mnt, .dentry = dentry};
- err = vfs_getattr(&p, &stat);
- if (!err && dev_mynode(dev, dentry->d_inode, &stat)) {
+ /*
+ * SAF: Should probably call vfs_getattr(), but there's no
+ * obvious way to get a vfsmount. But for both tmpfs and
+ * ramfs it's the same, since neither implement getattr().
+ * So I could leave it like this, or else keep an internal
+ * mount for each super block.
+ */
+ generic_fillattr(dentry->d_inode, &stat);
+ /* SAF: What if !dev_mynode()? Error? */
+ if (dev_mynode(s, dev, dentry->d_inode, &stat)) {
struct iattr newattrs;
/*
* before unlinking this node, reset permissions
* of possible references like hardlinks
*/
- newattrs.ia_uid = GLOBAL_ROOT_UID;
- newattrs.ia_gid = GLOBAL_ROOT_GID;
+ newattrs.ia_uid = make_kuid(dev_sb_ns(s), 0);
+ newattrs.ia_gid = make_kgid(dev_sb_ns(s), 0);
newattrs.ia_mode = stat.mode & ~0777;
newattrs.ia_valid =
ATTR_UID|ATTR_GID|ATTR_MODE;
mutex_lock(&dentry->d_inode->i_mutex);
notify_change(dentry, &newattrs, NULL);
mutex_unlock(&dentry->d_inode->i_mutex);
- err = vfs_unlink(parent.dentry->d_inode, dentry, NULL);
+ err = vfs_unlink(dentry->d_parent->d_inode,
+ dentry, NULL);
if (!err || err == -ENOENT)
deleted = 1;
}
@@ -332,11 +520,43 @@ static int handle_remove(const char *nodename, struct device *dev)
err = -ENOENT;
}
dput(dentry);
- mutex_unlock(&parent.dentry->d_inode->i_mutex);

- path_put(&parent);
- if (deleted && strchr(nodename, '/'))
- delete_path(nodename);
+ if (deleted && strchr(req->name, '/'))
+ delete_path(s, req->name);
+
+ if (err)
+ req->err = err;
+}
+
+static int handle_remove(struct req *req)
+{
+ req->err = 0;
+ iterate_supers_type(&dev_fs_type, do_handle_remove, req);
+ return req->err;
+}
+
+static int handle_populate(struct req *req)
+{
+ struct device *dev;
+ int err = 0;
+
+ mutex_lock(&dev_list_mutex);
+ list_for_each_entry(dev, &dev_list, devtmpfs_list) {
+ if (!dev->ns_global && dev_sb_ns(req->sb) != dev->ns)
+ continue;
+
+ req->name = dev_get_params(dev, &req->mode, &req->uid,
+ &req->gid);
+ if (IS_ERR(req->name)) {
+ err = -ENOMEM;
+ continue;
+ }
+
+ req->dev = dev;
+ do_handle_create(req->sb, req);
+ }
+ mutex_unlock(&dev_list_mutex);
+
return err;
}

@@ -362,31 +582,30 @@ int devtmpfs_mount(const char *mntdir)
return err;
}

-static DECLARE_COMPLETION(setup_done);
-
-static int handle(const char *name, umode_t mode, kuid_t uid, kgid_t gid,
- struct device *dev)
+static int handle(struct req *req)
{
- if (mode)
- return handle_create(name, mode, uid, gid, dev);
- else
- return handle_remove(name, dev);
+ int err;
+
+ switch(req->type) {
+ case REQ_TYPE_CREATE:
+ err = handle_create(req);
+ break;
+ case REQ_TYPE_REMOVE:
+ err = handle_remove(req);
+ break;
+ case REQ_TYPE_POPULATE:
+ err = handle_populate(req);
+ break;
+ default:
+ err = -EINVAL;
+ }
+
+ return err;
}

static int devtmpfsd(void *p)
{
- char options[] = "mode=0755";
- int *err = p;
- *err = sys_unshare(CLONE_NEWNS);
- if (*err)
- goto out;
- *err = sys_mount("devtmpfs", "/", "devtmpfs", MS_SILENT, options);
- if (*err)
- goto out;
- sys_chdir("/.."); /* will traverse into overmounted root */
- sys_chroot(".");
- complete(&setup_done);
- while (1) {
+ while (!kthread_should_stop()) {
spin_lock(&req_lock);
while (requests) {
struct req *req = requests;
@@ -394,8 +613,7 @@ static int devtmpfsd(void *p)
spin_unlock(&req_lock);
while (req) {
struct req *next = req->next;
- req->err = handle(req->name, req->mode,
- req->uid, req->gid, req->dev);
+ req->err = handle(req);
complete(&req->done);
req = next;
}
@@ -406,11 +624,10 @@ static int devtmpfsd(void *p)
schedule();
}
return 0;
-out:
- complete(&setup_done);
- return *err;
}

+struct vfsmount *dev_mnt;
+
/*
* Create devtmpfs instance, driver-core devices will add their device
* nodes here.
@@ -425,9 +642,7 @@ int __init devtmpfs_init(void)
}

thread = kthread_run(devtmpfsd, &err, "kdevtmpfs");
- if (!IS_ERR(thread)) {
- wait_for_completion(&setup_done);
- } else {
+ if (IS_ERR(thread)) {
err = PTR_ERR(thread);
thread = NULL;
}
@@ -438,6 +653,16 @@ int __init devtmpfs_init(void)
return err;
}

+ /* Don't use kern_mount() because tmpfs will set MS_NOUSER */
+ dev_mnt = vfs_kern_mount(&dev_fs_type, 0, dev_fs_type.name, NULL);
+ if (IS_ERR(dev_mnt)) {
+ err = PTR_ERR(dev_mnt);
+ kthread_stop(thread);
+ thread = NULL;
+ unregister_filesystem(&dev_fs_type);
+ return err;
+ }
+
printk(KERN_INFO "devtmpfs: initialized\n");
return 0;
}
diff --git a/include/linux/device.h b/include/linux/device.h
index e2dbe19b5f46..55f0fca24df5 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -785,6 +785,7 @@ struct device {
struct iommu_group *iommu_group;

struct user_namespace *ns;
+ struct list_head devtmpfs_list;

bool offline_disabled:1;
bool offline:1;
--
1.9.1
Seth Forshee
2014-05-14 21:34:55 UTC
Permalink
When adding block devices for paritions the devices default to
ownership by init_user_ns. Instead assign them to the same
namespace as the device for the whole disk.

Signed-off-by: Seth Forshee <seth.forshee at canonical.com>
---
block/partition-generic.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/block/partition-generic.c b/block/partition-generic.c
index 789cdea05893..7c1c83a072a6 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -17,6 +17,7 @@
#include <linux/ctype.h>
#include <linux/genhd.h>
#include <linux/blktrace_api.h>
+#include <linux/user_namesapce.h>

#include "partitions/check.h"

@@ -325,6 +326,7 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno,
pdev->class = &block_class;
pdev->type = &part_type;
pdev->parent = ddev;
+ dev_set_ns(pdev, ddev->ns);

err = blk_alloc_devt(p, &devt);
if (err)
--
1.9.1
Seth Forshee
2014-05-14 21:34:56 UTC
Permalink
Many blkdev ioctls require CAP_SYS_ADMIN, preventing them from
being used on block devices owned by unpriveleged user
namespaces. Change this to requiring only CAP_SYS_ADMIN within
the namespace which owns the device. Most devices are owned by
init_user_ns, and in that case this check is equivalent.

Signed-off-by: Seth Forshee <seth.forshee at canonical.com>
---
block/compat_ioctl.c | 3 ++-
block/ioctl.c | 16 ++++++++--------
2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/block/compat_ioctl.c b/block/compat_ioctl.c
index fbd5a67cb773..b16745d14ac2 100644
--- a/block/compat_ioctl.c
+++ b/block/compat_ioctl.c
@@ -10,6 +10,7 @@
#include <linux/syscalls.h>
#include <linux/types.h>
#include <linux/uaccess.h>
+#include <linux/user_namespace.h>

static int compat_put_ushort(unsigned long arg, unsigned short val)
{
@@ -725,7 +726,7 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg)
!blk_queue_nonrot(bdev_get_queue(bdev)));
case BLKRASET: /* compatible, but no compat_ptr (!) */
case BLKFRASET:
- if (!capable(CAP_SYS_ADMIN))
+ if (!ns_capable(disk_to_dev(disk)->ns, CAP_SYS_ADMIN))
return -EACCES;
bdi = blk_get_backing_dev_info(bdev);
if (bdi == NULL)
diff --git a/block/ioctl.c b/block/ioctl.c
index 7d5c3b20af45..45bb28d59b56 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -7,12 +7,13 @@
#include <linux/backing-dev.h>
#include <linux/fs.h>
#include <linux/blktrace_api.h>
+#include <linux/user_namespace.h>
#include <asm/uaccess.h>

static int blkpg_ioctl(struct block_device *bdev, struct blkpg_ioctl_arg __user *arg)
{
struct block_device *bdevp;
- struct gendisk *disk;
+ struct gendisk *disk = bdev->bd_disk;
struct hd_struct *part, *lpart;
struct blkpg_ioctl_arg a;
struct blkpg_partition p;
@@ -20,13 +21,12 @@ static int blkpg_ioctl(struct block_device *bdev, struct blkpg_ioctl_arg __user
long long start, length;
int partno;

- if (!capable(CAP_SYS_ADMIN))
+ if (!ns_capable(disk_to_dev(disk)->ns, CAP_SYS_ADMIN))
return -EACCES;
if (copy_from_user(&a, arg, sizeof(struct blkpg_ioctl_arg)))
return -EFAULT;
if (copy_from_user(&p, a.data, sizeof(struct blkpg_partition)))
return -EFAULT;
- disk = bdev->bd_disk;
if (bdev != bdev->bd_contains)
return -EINVAL;
partno = p.pno;
@@ -157,7 +157,7 @@ static int blkdev_reread_part(struct block_device *bdev)

if (!disk_part_scan_enabled(disk) || bdev != bdev->bd_contains)
return -EINVAL;
- if (!capable(CAP_SYS_ADMIN))
+ if (!ns_capable(disk_to_dev(disk)->ns, CAP_SYS_ADMIN))
return -EACCES;
if (!mutex_trylock(&bdev->bd_mutex))
return -EBUSY;
@@ -281,7 +281,7 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,

switch(cmd) {
case BLKFLSBUF:
- if (!capable(CAP_SYS_ADMIN))
+ if (!ns_capable(disk_to_dev(disk)->ns, CAP_SYS_ADMIN))
return -EACCES;

ret = __blkdev_driver_ioctl(bdev, mode, cmd, arg);
@@ -296,7 +296,7 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
ret = __blkdev_driver_ioctl(bdev, mode, cmd, arg);
if (!is_unrecognized_ioctl(ret))
return ret;
- if (!capable(CAP_SYS_ADMIN))
+ if (!ns_capable(disk_to_dev(disk)->ns, CAP_SYS_ADMIN))
return -EACCES;
if (get_user(n, (int __user *)(arg)))
return -EFAULT;
@@ -380,7 +380,7 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
return put_ushort(arg, !blk_queue_nonrot(bdev_get_queue(bdev)));
case BLKRASET:
case BLKFRASET:
- if(!capable(CAP_SYS_ADMIN))
+ if(!ns_capable(disk_to_dev(disk)->ns, CAP_SYS_ADMIN))
return -EACCES;
bdi = blk_get_backing_dev_info(bdev);
if (bdi == NULL)
@@ -389,7 +389,7 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
return 0;
case BLKBSZSET:
/* set the logical block size */
- if (!capable(CAP_SYS_ADMIN))
+ if (!ns_capable(disk_to_dev(disk)->ns, CAP_SYS_ADMIN))
return -EACCES;
if (!arg)
return -EINVAL;
--
1.9.1
Seth Forshee
2014-05-14 21:34:57 UTC
Permalink
Add a ns_global field to struct miscdevice to allow indicating
that the created struct device should be global to all user
namespaces, and set this flag for loop-control.

Signed-off-by: Seth Forshee <seth.forshee at canonical.com>
---
drivers/block/loop.c | 1 +
drivers/char/misc.c | 11 +++++++++--
include/linux/miscdevice.h | 1 +
3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f70a230a2945..f0e41a372c24 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1813,6 +1813,7 @@ static struct miscdevice loop_misc = {
.minor = LOOP_CTRL_MINOR,
.name = "loop-control",
.fops = &loop_ctl_fops,
+ .ns_global = true,
};

MODULE_ALIAS_MISCDEV(LOOP_CTRL_MINOR);
diff --git a/drivers/char/misc.c b/drivers/char/misc.c
index ffa97d261cf3..5940ebd98023 100644
--- a/drivers/char/misc.c
+++ b/drivers/char/misc.c
@@ -205,8 +205,15 @@ int misc_register(struct miscdevice * misc)

dev = MKDEV(MISC_MAJOR, misc->minor);

- misc->this_device = device_create(misc_class, misc->parent, dev,
- misc, "%s", misc->name);
+ if (misc->ns_global)
+ misc->this_device = device_create_global(misc_class,
+ misc->parent, dev,
+ misc, "%s",
+ misc->name);
+ else
+ misc->this_device = device_create(misc_class, misc->parent, dev,
+ misc, "%s", misc->name);
+
if (IS_ERR(misc->this_device)) {
int i = DYNAMIC_MINORS - misc->minor - 1;
if (i < DYNAMIC_MINORS && i >= 0)
diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
index 51e26f3cd3b3..a85782339fe2 100644
--- a/include/linux/miscdevice.h
+++ b/include/linux/miscdevice.h
@@ -62,6 +62,7 @@ struct miscdevice {
struct device *this_device;
const char *nodename;
umode_t mode;
+ bool ns_global;
};

extern int misc_register(struct miscdevice * misc);
--
1.9.1
Seth Forshee
2014-05-14 21:34:58 UTC
Permalink
loop-control is now global to user namespaces, meaning that any
namespace can use LOOP_CTL_GET_FREE to request a loop device. The
namespace won't necessarily be able to use the device, however.

Update loop to search only for devices matching current_user_ns()
when finding free devices, and to set the device's owning
namespace to current_user_ns() when a new device is added. This
will cause the devices to appear in that namespace's devtmpfs
super block, where it can be used.

This should generally be safe, since only the namespace used to
request the device should see it in devtmpfs, avoiding accidental
use by another namespace. Only a user priveleged enough to mknod
will be able to access the same device, and such access is
unlikely to be accidental.

Signed-off-by: Seth Forshee <seth.forshee at canonical.com>
---
drivers/block/loop.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f0e41a372c24..66bd938bcc1c 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -75,6 +75,8 @@
#include <linux/sysfs.h>
#include <linux/miscdevice.h>
#include <linux/falloc.h>
+#include <linux/notifier.h>
+#include <linux/user_namespace.h>
#include "loop.h"

#include <asm/uaccess.h>
@@ -1674,6 +1676,7 @@ static int loop_add(struct loop_device **l, int i)
disk->private_data = lo;
disk->queue = lo->lo_queue;
sprintf(disk->disk_name, "loop%d", i);
+ dev_set_ns(disk_to_dev(disk), current_user_ns());
add_disk(disk);
*l = lo;
return lo->lo_number;
@@ -1690,6 +1693,7 @@ out:

static void loop_remove(struct loop_device *lo)
{
+ dev_set_ns(disk_to_dev(lo->lo_disk), &init_user_ns);
del_gendisk(lo->lo_disk);
blk_cleanup_queue(lo->lo_queue);
put_disk(lo->lo_disk);
@@ -1701,7 +1705,8 @@ static int find_free_cb(int id, void *ptr, void *data)
struct loop_device *lo = ptr;
struct loop_device **l = data;

- if (lo->lo_state == Lo_unbound) {
+ if (lo->lo_state == Lo_unbound &&
+ disk_to_dev(lo->lo_disk)->ns == current_user_ns()) {
*l = lo;
return 1;
}
--
1.9.1
Seth Forshee
2014-05-14 21:34:59 UTC
Permalink
Signed-off-by: Seth Forshee <seth.forshee at canonical.com>
---
drivers/block/loop.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 66bd938bcc1c..2cc19868ea0d 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1074,7 +1074,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)

if (lo->lo_encrypt_key_size &&
!uid_eq(lo->lo_key_owner, uid) &&
- !capable(CAP_SYS_ADMIN))
+ !ns_capable(disk_to_dev(lo->lo_disk)->ns, CAP_SYS_ADMIN))
return -EPERM;
if (lo->lo_state != Lo_bound)
return -ENXIO;
@@ -1164,7 +1164,8 @@ loop_get_status(struct loop_device *lo, struct loop_info64 *info)
memcpy(info->lo_crypt_name, lo->lo_crypt_name, LO_NAME_SIZE);
info->lo_encrypt_type =
lo->lo_encryption ? lo->lo_encryption->number : 0;
- if (lo->lo_encrypt_key_size && capable(CAP_SYS_ADMIN)) {
+ if (lo->lo_encrypt_key_size &&
+ ns_capable(disk_to_dev(lo->lo_disk)->ns, CAP_SYS_ADMIN)) {
info->lo_encrypt_key_size = lo->lo_encrypt_key_size;
memcpy(info->lo_encrypt_key, lo->lo_encrypt_key,
lo->lo_encrypt_key_size);
@@ -1309,7 +1310,8 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
break;
case LOOP_SET_STATUS:
err = -EPERM;
- if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
+ if ((mode & FMODE_WRITE) ||
+ ns_capable(disk_to_dev(lo->lo_disk)->ns, CAP_SYS_ADMIN))
err = loop_set_status_old(lo,
(struct loop_info __user *)arg);
break;
@@ -1318,7 +1320,8 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
break;
case LOOP_SET_STATUS64:
err = -EPERM;
- if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
+ if ((mode & FMODE_WRITE) ||
+ ns_capable(disk_to_dev(lo->lo_disk)->ns, CAP_SYS_ADMIN))
err = loop_set_status64(lo,
(struct loop_info64 __user *) arg);
break;
@@ -1327,7 +1330,8 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode,
break;
case LOOP_SET_CAPACITY:
err = -EPERM;
- if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN))
+ if ((mode & FMODE_WRITE) ||
+ ns_capable(disk_to_dev(lo->lo_disk)->ns, CAP_SYS_ADMIN))
err = loop_set_capacity(lo, bdev);
break;
default:
--
1.9.1
Marian Marinov
2014-05-23 05:48:25 UTC
Permalink
One question about this patch.

Why don't you use the devices cgroup check if the root user in that namespace is allowed to use this device?

This way you can be sure that the root in that namespace can not access devices to which the host system did not gave
him access to.

Marian
Signed-off-by: Seth Forshee <seth.forshee at canonical.com> --- drivers/block/loop.c | 14 +++++++++----- 1 file
changed, 9 insertions(+), 5 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 66bd938bcc1c..2cc19868ea0d 100644 ---
const struct loop_info64 *info)
if (lo->lo_encrypt_key_size && !uid_eq(lo->lo_key_owner, uid) && - !capable(CAP_SYS_ADMIN)) +
!ns_capable(disk_to_dev(lo->lo_disk)->ns, CAP_SYS_ADMIN)) return -EPERM; if (lo->lo_state != Lo_bound) return
memcpy(info->lo_crypt_name, lo->lo_crypt_name, LO_NAME_SIZE); info->lo_encrypt_type = lo->lo_encryption ?
lo->lo_encryption->number : 0; - if (lo->lo_encrypt_key_size && capable(CAP_SYS_ADMIN)) { + if
(lo->lo_encrypt_key_size && + ns_capable(disk_to_dev(lo->lo_disk)->ns, CAP_SYS_ADMIN)) {
info->lo_encrypt_key_size = lo->lo_encrypt_key_size; memcpy(info->lo_encrypt_key, lo->lo_encrypt_key,
break; case LOOP_SET_STATUS: err = -EPERM; - if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) + if ((mode &
FMODE_WRITE) || + ns_capable(disk_to_dev(lo->lo_disk)->ns, CAP_SYS_ADMIN)) err = loop_set_status_old(lo,
fmode_t mode, break; case LOOP_SET_STATUS64: err = -EPERM; - if ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) +
if ((mode & FMODE_WRITE) || + ns_capable(disk_to_dev(lo->lo_disk)->ns, CAP_SYS_ADMIN)) err =
block_device *bdev, fmode_t mode, break; case LOOP_SET_CAPACITY: err = -EPERM; - if ((mode & FMODE_WRITE) ||
capable(CAP_SYS_ADMIN)) + if ((mode & FMODE_WRITE) || + ns_capable(disk_to_dev(lo->lo_disk)->ns,
- --
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hackman at jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
Seth Forshee
2014-05-26 09:16:14 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
One question about this patch.
Why don't you use the devices cgroup check if the root user in that namespace is allowed to use this device?
This way you can be sure that the root in that namespace can not access devices to which the host system did not gave
him access to.
That might be possible, but I don't want to require something on the
host to whitelist the device for the container. Then loop would need to
automatically add the device to devices.allow, which doesn't seem
desirable to me. But I'm not entirely opposed to the idea if others
think this is a better way to go.

Seth
Michael H. Warfield
2014-05-26 15:32:05 UTC
Permalink
Post by Seth Forshee
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
One question about this patch.
Why don't you use the devices cgroup check if the root user in that namespace is allowed to use this device?
This way you can be sure that the root in that namespace can not access devices to which the host system did not gave
him access to.
That might be possible, but I don't want to require something on the
host to whitelist the device for the container. Then loop would need to
automatically add the device to devices.allow, which doesn't seem
desirable to me. But I'm not entirely opposed to the idea if others
think this is a better way to go.
I don't see any safe way to avoid it. The host has to be in control of
what devices can and can not be accessed by the container.
Post by Seth Forshee
Seth
Regards,
Mike
--
Michael H. Warfield (AI4NB) | (770) 978-7061 | mhw at WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20140526/348b91c7/attachment.sig>
Seth Forshee
2014-05-26 15:45:36 UTC
Permalink
Post by Michael H. Warfield
Post by Seth Forshee
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
One question about this patch.
Why don't you use the devices cgroup check if the root user in that namespace is allowed to use this device?
This way you can be sure that the root in that namespace can not access devices to which the host system did not gave
him access to.
That might be possible, but I don't want to require something on the
host to whitelist the device for the container. Then loop would need to
automatically add the device to devices.allow, which doesn't seem
desirable to me. But I'm not entirely opposed to the idea if others
think this is a better way to go.
I don't see any safe way to avoid it. The host has to be in control of
what devices can and can not be accessed by the container.
Hmm, for testing I've been giving access to 7:* block devices since my
containers can't mknod and only see device nodes for loop devices they
have access to, but maybe I'm not being sufficiently paranoid.
Serge E. Hallyn
2014-05-27 01:36:34 UTC
Permalink
Post by Michael H. Warfield
Post by Seth Forshee
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
One question about this patch.
Why don't you use the devices cgroup check if the root user in that namespace is allowed to use this device?
This way you can be sure that the root in that namespace can not access devices to which the host system did not gave
him access to.
That might be possible, but I don't want to require something on the
host to whitelist the device for the container. Then loop would need to
automatically add the device to devices.allow, which doesn't seem
desirable to me. But I'm not entirely opposed to the idea if others
think this is a better way to go.
I don't see any safe way to avoid it. The host has to be in control of
what devices can and can not be accessed by the container.
Disagree. loop%d is meaningless until it is attached to a file. So
whether a container can use loop2 vs loop9 is meaningless. The point
of Seth's loopfs as I understood it is that the container simply gets a
unique (not visible to host or any other containers) set of loop devices
which it can attach to files which it owns. So long as the host can't
see the container's loop devices (i.e. so it unwittently mounts it when
looking for a particular UUID for /var), it won't get fooled by them.

So in this case *if* we can do it, a purely namespaced approach - meaning
that we restrict visibility of a particular loopdev to one container - is
perfect.
Michael H. Warfield
2014-05-27 02:39:22 UTC
Permalink
Post by Serge E. Hallyn
Post by Michael H. Warfield
Post by Seth Forshee
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
One question about this patch.
Why don't you use the devices cgroup check if the root user in that namespace is allowed to use this device?
This way you can be sure that the root in that namespace can not access devices to which the host system did not gave
him access to.
That might be possible, but I don't want to require something on the
host to whitelist the device for the container. Then loop would need to
automatically add the device to devices.allow, which doesn't seem
desirable to me. But I'm not entirely opposed to the idea if others
think this is a better way to go.
I don't see any safe way to avoid it. The host has to be in control of
what devices can and can not be accessed by the container.
Disagree. loop%d is meaningless until it is attached to a file. So
whether a container can use loop2 vs loop9 is meaningless. The point
of Seth's loopfs as I understood it is that the container simply gets a
unique (not visible to host or any other containers) set of loop devices
which it can attach to files which it owns. So long as the host can't
see the container's loop devices (i.e. so it unwittently mounts it when
looking for a particular UUID for /var), it won't get fooled by them.
So in this case *if* we can do it, a purely namespaced approach - meaning
that we restrict visibility of a particular loopdev to one container - is
perfect.
And in that "*if" is a cloud that says "then a miracle occurs" and that
miracle needs a lot more detail. How that translates into what is and
is not visible and what can be mimiced in a container becomes important
(to say nothing of notifying its udev). I think this loopfs thing is
the answer for the loop device case, we just need to clear up those
details and exorcise the devils we find in them. The loop devices are
unique while they strangely seem to work with minimal leakage already
(all meta data at this time).

Seth remarked that, maybe, he's not paranoid enough. You know that I'm
a well trained professional paranoid and I accept if people think I'm
overly paranoid (is that even possible?). Even paranoids have enemies
and just because you're paranoid it doesn't mean they're not out to get
you. While I admit that total isolation is virtually (excuse the pun)
impossible that doesn't mean I don't strive to maximize the isolation
and analyze the possibilities and consequences of compromise.

As I stated, "I don't see any way to avoid it". I would love to be
proven wrong. It would permit my life to be so much more easy. But how
can we allow this without the host in control of it and directing things
to the containers? A container may request something and the host can
grant it but the container should not be capable of demanding a device
over and above the control of the host. How do we define the rules that
say what a container can do and what it cannot do without it involving
knowledge in the host (whitelisting as Seth call's it) of what is and is
not allowed in the container?

We already have the problem that the container devices.allow and
devices.deny are major and minor based, which we know is fundamentally
flawed in a udev environment. We specify major:minor in the
configuration files as if they are cast in cement (which they are in all
common cases) but they are not in the general case. Greg K-H hammers on
this frequently.

The loop devices are unique and deserve a unique solution, I'll agree.
But I'm also comfortable that the host should have rules and procedures
to whitelist hard devices and loop devices and manage their transfer
and/or sharing into the containers.

Regards,
Mike
--
Michael H. Warfield (AI4NB) | (770) 978-7061 | mhw at WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20140526/86bd0fc4/attachment.sig>
Serge Hallyn
2014-05-27 13:16:57 UTC
Permalink
Post by Michael H. Warfield
Post by Serge E. Hallyn
Post by Michael H. Warfield
Post by Seth Forshee
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
One question about this patch.
Why don't you use the devices cgroup check if the root user in that namespace is allowed to use this device?
This way you can be sure that the root in that namespace can not access devices to which the host system did not gave
him access to.
That might be possible, but I don't want to require something on the
host to whitelist the device for the container. Then loop would need to
automatically add the device to devices.allow, which doesn't seem
desirable to me. But I'm not entirely opposed to the idea if others
think this is a better way to go.
I don't see any safe way to avoid it. The host has to be in control of
what devices can and can not be accessed by the container.
Disagree. loop%d is meaningless until it is attached to a file. So
whether a container can use loop2 vs loop9 is meaningless. The point
of Seth's loopfs as I understood it is that the container simply gets a
unique (not visible to host or any other containers) set of loop devices
which it can attach to files which it owns. So long as the host can't
see the container's loop devices (i.e. so it unwittently mounts it when
looking for a particular UUID for /var), it won't get fooled by them.
So in this case *if* we can do it, a purely namespaced approach - meaning
that we restrict visibility of a particular loopdev to one container - is
perfect.
And in that "*if" is a cloud that says "then a miracle occurs" and that
miracle needs a lot more detail.
Naturally. Which is why as Seth says we'll need concrete code to discuss.
But the concept that a well implemented namespace which prevents addressing
a given resource in the first place would suffice is, I think, a well
accepted premise of security in linux. And in this case it is more
appropriate than trying to finagle it into the devices cgroup. Note that
Marian said "to check if root user in that namespace is allowed to use
this device." This first off does not address the concern of root on the
host being tricked by the contents of loop0 which happens to be legitimately
used by container N. In contrast, making it so that loop0 is only
addressable by container N, and not by the host, does.

Anyway I as reading the above as why don't we *base* the containerized loop
on devices cgroups. I object to that. Well, at least until we rule out
more elegant solutions. Of course I don't object to defense in depth.

-serge
Seth Forshee
2014-05-27 07:16:26 UTC
Permalink
Post by Michael H. Warfield
Post by Serge E. Hallyn
Post by Michael H. Warfield
Post by Seth Forshee
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
One question about this patch.
Why don't you use the devices cgroup check if the root user in that namespace is allowed to use this device?
This way you can be sure that the root in that namespace can not access devices to which the host system did not gave
him access to.
That might be possible, but I don't want to require something on the
host to whitelist the device for the container. Then loop would need to
automatically add the device to devices.allow, which doesn't seem
desirable to me. But I'm not entirely opposed to the idea if others
think this is a better way to go.
I don't see any safe way to avoid it. The host has to be in control of
what devices can and can not be accessed by the container.
Disagree. loop%d is meaningless until it is attached to a file. So
whether a container can use loop2 vs loop9 is meaningless. The point
of Seth's loopfs as I understood it is that the container simply gets a
unique (not visible to host or any other containers) set of loop devices
which it can attach to files which it owns. So long as the host can't
see the container's loop devices (i.e. so it unwittently mounts it when
looking for a particular UUID for /var), it won't get fooled by them.
So in this case *if* we can do it, a purely namespaced approach - meaning
that we restrict visibility of a particular loopdev to one container - is
perfect.
And in that "*if" is a cloud that says "then a miracle occurs" and that
miracle needs a lot more detail. How that translates into what is and
is not visible and what can be mimiced in a container becomes important
(to say nothing of notifying its udev). I think this loopfs thing is
the answer for the loop device case, we just need to clear up those
details and exorcise the devils we find in them. The loop devices are
unique while they strangely seem to work with minimal leakage already
(all meta data at this time).
Seth remarked that, maybe, he's not paranoid enough. You know that I'm
a well trained professional paranoid and I accept if people think I'm
overly paranoid (is that even possible?). Even paranoids have enemies
and just because you're paranoid it doesn't mean they're not out to get
you. While I admit that total isolation is virtually (excuse the pun)
impossible that doesn't mean I don't strive to maximize the isolation
and analyze the possibilities and consequences of compromise.
As I stated, "I don't see any way to avoid it". I would love to be
proven wrong. It would permit my life to be so much more easy. But how
can we allow this without the host in control of it and directing things
to the containers? A container may request something and the host can
grant it but the container should not be capable of demanding a device
over and above the control of the host. How do we define the rules that
say what a container can do and what it cannot do without it involving
knowledge in the host (whitelisting as Seth call's it) of what is and is
not allowed in the container?
I need to post some code so we have something concrete to discuss, just
haven't gotten to it yet due to travel and meetings. I'll try to work
the code into something presentable and send it out later today.

But in loopfs the kernel would ultimately controls directing loop devs
to containers based on the mount. A container can request a free loop
device via loop-control, and one gets assigned to that mount. Userspace
can ask for a specific loop device number, but if that device is
associated with a different mount that will fail, so the container can't
gain access to another context's loop device. The container has access
to only its loop devices by virtue of not having device nodes for any
other devices.
Post by Michael H. Warfield
We already have the problem that the container devices.allow and
devices.deny are major and minor based, which we know is fundamentally
flawed in a udev environment. We specify major:minor in the
configuration files as if they are cast in cement (which they are in all
common cases) but they are not in the general case. Greg K-H hammers on
this frequently.
The loop devices are unique and deserve a unique solution, I'll agree.
But I'm also comfortable that the host should have rules and procedures
to whitelist hard devices and loop devices and manage their transfer
and/or sharing into the containers.
I'm aiming for being able to use the same tools to manage loop device in
a container as on the host. If the whole thing needs to be managed by a
process on the host then I suspect we need something more like
intercepting ioctls on loop-control within the container so the manager
can handle them.
Greg Kroah-Hartman
2014-05-15 01:32:45 UTC
Permalink
Post by Seth Forshee
Unpriveleged containers cannot run mknod, making it difficult to support
devices which appear at runtime.
Wait.

Why would you even want a container to see a "new" device? That's the
whole point, your container should see a "clean" system, not the "this
USB device was just plugged in" system. Otherwise, how are you going to
even tell that container a new device showed up? Are you now going to
add udev support in containers? Hah, no.
Post by Seth Forshee
Using devtmpfs is one possible
solution, and it would have the added benefit of making container setup
simpler. But simply letting containers mount devtmpfs isn't sufficient
since the container may need to see a different, more limited set of
devices, and because different environments making modifications to
the filesystem could lead to conflicts.
This series solves these problems by assigning devices to user
namespaces. Each device has an "owner" namespace which specifies which
devtmpfs mount the device should appear in as well allowing priveleged
operations on the device from that namespace. This defaults to
init_user_ns. There's also an ns_global flag to indicate a device should
appear in all devtmpfs mounts.
I'd strongly argue that this isn't even a "problem" at all. And, as I
said at the Plumbers conference last year, adding namespaces to devices
isn't going to happen, sorry. Please don't continue down this path.

greg k-h
Michael H. Warfield
2014-05-15 02:17:31 UTC
Permalink
Post by Greg Kroah-Hartman
Post by Seth Forshee
Unpriveleged containers cannot run mknod, making it difficult to support
devices which appear at runtime.
Wait.
Why would you even want a container to see a "new" device? That's the
whole point, your container should see a "clean" system, not the "this
USB device was just plugged in" system. Otherwise, how are you going to
even tell that container a new device showed up? Are you now going to
add udev support in containers? Hah, no.
Oooo... I can answer that... Tell me if you've heard this one
before... (You have back in NOLA last summer)...

I use a USB sharing device that controls a multiport USB serial device
controlling serial consoles to 16 servers and shared between 4
controlling servers. The sharing control port (a USB HID device) should
be shared between designated containers so that any designated container
owner can "request" a console to one of the other servers (yeah, I know
there can be contention but that's the way the cookie crumbles - most of
the time it's on the master host). Once they get the sharing device's
attention, they "lose" that HID control device (it disappears from /dev
entirely) and they gain only their designated USBtty{n} device for their
console. Dynamic devices at their finest.

I worked out a way of dealing with it using udev rules in the host and
shifting devices using subdirectories in /dev. I got the infrastructure
implemented but didn't finish the specific udev rules.
Post by Greg Kroah-Hartman
Post by Seth Forshee
Using devtmpfs is one possible
solution, and it would have the added benefit of making container setup
simpler. But simply letting containers mount devtmpfs isn't sufficient
since the container may need to see a different, more limited set of
devices, and because different environments making modifications to
the filesystem could lead to conflicts.
This series solves these problems by assigning devices to user
namespaces. Each device has an "owner" namespace which specifies which
devtmpfs mount the device should appear in as well allowing priveleged
operations on the device from that namespace. This defaults to
init_user_ns. There's also an ns_global flag to indicate a device should
appear in all devtmpfs mounts.
I'd strongly argue that this isn't even a "problem" at all. And, as I
said at the Plumbers conference last year, adding namespaces to devices
isn't going to happen, sorry. Please don't continue down this path.
I was just mentioning that to Serge just a week or so ago reminding him
of what you told all of us face to face back then. We were having a
discussion over loop devices into containers and this topic came up.
Post by Greg Kroah-Hartman
greg k-h
Regards,
Mike
--
Michael H. Warfield (AI4NB) | (770) 978-7061 | mhw at WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20140514/bcbc4b17/attachment.sig>
Greg Kroah-Hartman
2014-05-15 04:00:32 UTC
Permalink
Post by Michael H. Warfield
Post by Greg Kroah-Hartman
Post by Seth Forshee
Using devtmpfs is one possible
solution, and it would have the added benefit of making container setup
simpler. But simply letting containers mount devtmpfs isn't sufficient
since the container may need to see a different, more limited set of
devices, and because different environments making modifications to
the filesystem could lead to conflicts.
This series solves these problems by assigning devices to user
namespaces. Each device has an "owner" namespace which specifies which
devtmpfs mount the device should appear in as well allowing priveleged
operations on the device from that namespace. This defaults to
init_user_ns. There's also an ns_global flag to indicate a device should
appear in all devtmpfs mounts.
I'd strongly argue that this isn't even a "problem" at all. And, as I
said at the Plumbers conference last year, adding namespaces to devices
isn't going to happen, sorry. Please don't continue down this path.
I was just mentioning that to Serge just a week or so ago reminding him
of what you told all of us face to face back then. We were having a
discussion over loop devices into containers and this topic came up.
It was the loop device use case that got me started down this path in
the first place, so I don't personally have any interest in physical
devices right now (though I was sure others would).
Why do you want to give access to a loop device to a container?
Shouldn't you set up the loop devices before creating the container and
then pass those mount points into the container? I thought that was how
things worked today, or am I missing something?

Giving the ability for a container to create a loop device at all is a
horrid idea, as you have pointed out, lots of information leakage could
easily happen.

greg k-h
Michael H. Warfield
2014-05-15 13:42:17 UTC
Permalink
Post by Greg Kroah-Hartman
Post by Michael H. Warfield
Post by Greg Kroah-Hartman
Post by Seth Forshee
Using devtmpfs is one possible
solution, and it would have the added benefit of making container setup
simpler. But simply letting containers mount devtmpfs isn't sufficient
since the container may need to see a different, more limited set of
devices, and because different environments making modifications to
the filesystem could lead to conflicts.
This series solves these problems by assigning devices to user
namespaces. Each device has an "owner" namespace which specifies which
devtmpfs mount the device should appear in as well allowing priveleged
operations on the device from that namespace. This defaults to
init_user_ns. There's also an ns_global flag to indicate a device should
appear in all devtmpfs mounts.
I'd strongly argue that this isn't even a "problem" at all. And, as I
said at the Plumbers conference last year, adding namespaces to devices
isn't going to happen, sorry. Please don't continue down this path.
I was just mentioning that to Serge just a week or so ago reminding him
of what you told all of us face to face back then. We were having a
discussion over loop devices into containers and this topic came up.
It was the loop device use case that got me started down this path in
the first place, so I don't personally have any interest in physical
devices right now (though I was sure others would).
Why do you want to give access to a loop device to a container?
Shouldn't you set up the loop devices before creating the container and
then pass those mount points into the container? I thought that was how
things worked today, or am I missing something?
Ah, you keep feeding me easy ones. I need raw access to loop devices
and loop-control because I'm using containers to build NST (Network
Security Toolkit) distribution iso images (one container is x86_64 while
the other is i686). Each requires 2 loop devices. You can't set up the
loop devices in advance since the containers will be creating the images
and building them. NST tinkers with the base build engine
configuration, so I really DON'T want it running on a hard iron host.
There may be other cases where I need other specialized containers for
building distros. I'm also looking at custom builds of Kali (another
security distribution).
Post by Greg Kroah-Hartman
Giving the ability for a container to create a loop device at all is a
horrid idea, as you have pointed out, lots of information leakage could
easily happen.
It does but only slightly. I noticed that losetup will list all the
devices regardless of container where run or the container where set up.
But that seems to be largely cosmetic. You can't do anything with the
loop device in the other container. You can't disconnected it, read it,
or mount it (I've tested it). In the former case, losetup returns with
no error but does nothing. In the later case, you get a busy error.
Not clean, not pretty, but no damage. Since loop-control is working on
the global pool of loop devices, it's impossible to know what device to
move to what container when the container runs losetup.

For me, this isn't a serious problem, since it only involves 2
specialized containers out of over 4 dozen containers I have running
across 3 sites. And those two containers are under my explicit and
exclusive control. None of the others need it. I can get away with
adding extra loop devices and adding them to the containers and let
losetup deal with allocation and contention.

Serge mentioned something to me about a loopdevfs (?) thing that someone
else is working on. That would seem to be a better solution in this
particular case but I don't know much about it or where it's at.

Mind you, I heard your arguments at LinuxPlumbers regarding pushing user
space policies into the kernel and all and basically I agree with you,
this should be handled in host system user space and it seems
reasonable. I'm just pointing out real world cases I have in operation
right now and pointing out that I have solutions for them in host user
space, even if some of them may not be estheticly pretty.
Post by Greg Kroah-Hartman
greg k-h
Regards,
Mike
--
Michael H. Warfield (AI4NB) | (770) 978-7061 | mhw at WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20140515/70d3d4ba/attachment.sig>
Greg Kroah-Hartman
2014-05-15 14:08:56 UTC
Permalink
Post by Michael H. Warfield
Post by Greg Kroah-Hartman
Post by Michael H. Warfield
Post by Greg Kroah-Hartman
Post by Seth Forshee
Using devtmpfs is one possible
solution, and it would have the added benefit of making container setup
simpler. But simply letting containers mount devtmpfs isn't sufficient
since the container may need to see a different, more limited set of
devices, and because different environments making modifications to
the filesystem could lead to conflicts.
This series solves these problems by assigning devices to user
namespaces. Each device has an "owner" namespace which specifies which
devtmpfs mount the device should appear in as well allowing priveleged
operations on the device from that namespace. This defaults to
init_user_ns. There's also an ns_global flag to indicate a device should
appear in all devtmpfs mounts.
I'd strongly argue that this isn't even a "problem" at all. And, as I
said at the Plumbers conference last year, adding namespaces to devices
isn't going to happen, sorry. Please don't continue down this path.
I was just mentioning that to Serge just a week or so ago reminding him
of what you told all of us face to face back then. We were having a
discussion over loop devices into containers and this topic came up.
It was the loop device use case that got me started down this path in
the first place, so I don't personally have any interest in physical
devices right now (though I was sure others would).
Why do you want to give access to a loop device to a container?
Shouldn't you set up the loop devices before creating the container and
then pass those mount points into the container? I thought that was how
things worked today, or am I missing something?
Ah, you keep feeding me easy ones. I need raw access to loop devices
and loop-control because I'm using containers to build NST (Network
Security Toolkit) distribution iso images (one container is x86_64 while
the other is i686). Each requires 2 loop devices. You can't set up the
loop devices in advance since the containers will be creating the images
and building them. NST tinkers with the base build engine
configuration, so I really DON'T want it running on a hard iron host.
There may be other cases where I need other specialized containers for
building distros. I'm also looking at custom builds of Kali (another
security distribution).
Then don't use a container to build such a thing, or fix the build
scripts to not do that :)

That is not a "normal" use case for a container at all. Containers are
not for "everything", use a virtual machine for some tasks (like this
one).
Post by Michael H. Warfield
Serge mentioned something to me about a loopdevfs (?) thing that someone
else is working on. That would seem to be a better solution in this
particular case but I don't know much about it or where it's at.
Ok, let's see those patches then.

thanks,

greg k-h
Serge Hallyn
2014-05-15 17:42:54 UTC
Permalink
Post by Greg Kroah-Hartman
Post by Michael H. Warfield
Post by Greg Kroah-Hartman
Post by Michael H. Warfield
Post by Greg Kroah-Hartman
Post by Seth Forshee
Using devtmpfs is one possible
solution, and it would have the added benefit of making container setup
simpler. But simply letting containers mount devtmpfs isn't sufficient
since the container may need to see a different, more limited set of
devices, and because different environments making modifications to
the filesystem could lead to conflicts.
This series solves these problems by assigning devices to user
namespaces. Each device has an "owner" namespace which specifies which
devtmpfs mount the device should appear in as well allowing priveleged
operations on the device from that namespace. This defaults to
init_user_ns. There's also an ns_global flag to indicate a device should
appear in all devtmpfs mounts.
I'd strongly argue that this isn't even a "problem" at all. And, as I
said at the Plumbers conference last year, adding namespaces to devices
isn't going to happen, sorry. Please don't continue down this path.
I was just mentioning that to Serge just a week or so ago reminding him
of what you told all of us face to face back then. We were having a
discussion over loop devices into containers and this topic came up.
It was the loop device use case that got me started down this path in
the first place, so I don't personally have any interest in physical
devices right now (though I was sure others would).
Why do you want to give access to a loop device to a container?
Shouldn't you set up the loop devices before creating the container and
then pass those mount points into the container? I thought that was how
things worked today, or am I missing something?
Ah, you keep feeding me easy ones. I need raw access to loop devices
and loop-control because I'm using containers to build NST (Network
Security Toolkit) distribution iso images (one container is x86_64 while
the other is i686). Each requires 2 loop devices. You can't set up the
loop devices in advance since the containers will be creating the images
and building them. NST tinkers with the base build engine
configuration, so I really DON'T want it running on a hard iron host.
There may be other cases where I need other specialized containers for
building distros. I'm also looking at custom builds of Kali (another
security distribution).
Then don't use a container to build such a thing, or fix the build
scripts to not do that :)
That is not a "normal" use case for a container at all. Containers are
not for "everything", use a virtual machine for some tasks (like this
one).
Hi Greg,

What exactly defines '"normal" use case for a container'? Not too long
ago much of what we can now do with network namespaces was not a normal
container use case. Neither "you can't do it now" nor "I don't use it
like that" should be grounds for a pre-emptive nack. "It will horribly
break security assumptions" certainly would be.

That's not to say there might not be good reasons why this in particular
is not appropriate, but ISTM if things are going to be nacked without
consideration of the patchset itself, we ought to be having a ksummit
session to come to a consensus [ or receive a decree, presumably by you :)
but after we have a chance to make our case ] on what things are going to
be un/acceptable.
Post by Greg Kroah-Hartman
Post by Michael H. Warfield
Serge mentioned something to me about a loopdevfs (?) thing that someone
else is working on. That would seem to be a better solution in this
particular case but I don't know much about it or where it's at.
Ok, let's see those patches then.
I think Seth has a git tree ready, but not sure which branch he'd want
us to look at.

Splitting a namespaced devtmpfs from loopdevfs discussion might be
sensible. However, in defense of a namespaced devtmpfs I'd say
that for userspace to, at every container startup, bind-mount in
devices from the global devtmpfs into a private tmpfs (for systemd's
sake it can't just be on the container rootfs), seems like something
worth avoiding.

-serge

PS - Apparently both parallels and Michael independently
project devices which are hot-plugged on the host into containers.
That also seems like something worth talking about (best practices,
shortcomings, use cases not met by it, any ways tha the kernel can
help out) at ksummit/linuxcon.
Seth Forshee
2014-05-15 18:12:06 UTC
Permalink
Post by Serge Hallyn
Post by Greg Kroah-Hartman
Post by Michael H. Warfield
Serge mentioned something to me about a loopdevfs (?) thing that someone
else is working on. That would seem to be a better solution in this
particular case but I don't know much about it or where it's at.
Ok, let's see those patches then.
I think Seth has a git tree ready, but not sure which branch he'd want
us to look at.
I think the most recent code I've got is the devloop branch of
http://kernel.ubuntu.com/git/sforshee/ubuntu-trusty.git, which is still
a bit messy but gets the idea across. I switched from that to the
devtmpfs approach though for several reasons: the psuedo-fs approach
required some (in my opinion) undesirable collateral changes, it would
require changes to userspace tools (though likely small), and it solves
the problem only for loop devices. Plus if you don't push namespace
awareness down to at least the generic block layer you still can't do
partitions or encrypted loop, and then there are still other problems
which need to be solved to get partition blkdevs inside the mount.
Greg Kroah-Hartman
2014-05-15 22:15:51 UTC
Permalink
Post by Serge Hallyn
What exactly defines '"normal" use case for a container'?
Well, I'd say "acting like a virtual machine" is a good start :)
Post by Serge Hallyn
Not too long ago much of what we can now do with network namespaces
was not a normal container use case. Neither "you can't do it now"
nor "I don't use it like that" should be grounds for a pre-emptive
nack. "It will horribly break security assumptions" certainly would
be.
I agree, and maybe we will get there over time, but this patch is nto
the way to do that.
Post by Serge Hallyn
That's not to say there might not be good reasons why this in particular
is not appropriate, but ISTM if things are going to be nacked without
consideration of the patchset itself, we ought to be having a ksummit
session to come to a consensus [ or receive a decree, presumably by you :)
but after we have a chance to make our case ] on what things are going to
be un/acceptable.
I already stood up and publically said this last year at Plumbers, why
is anything now different?

And this patchset is proof of why it's not a good idea. You really
didn't do anything with all of the namespace stuff, except change loop.
That's the only thing that cares, so, just do it there, like I said to
do so, last August.

And you are ignoring the notifications to userspace and how namespaces
here would deal with that.
Post by Serge Hallyn
Post by Greg Kroah-Hartman
Post by Michael H. Warfield
Serge mentioned something to me about a loopdevfs (?) thing that someone
else is working on. That would seem to be a better solution in this
particular case but I don't know much about it or where it's at.
Ok, let's see those patches then.
I think Seth has a git tree ready, but not sure which branch he'd want
us to look at.
Splitting a namespaced devtmpfs from loopdevfs discussion might be
sensible. However, in defense of a namespaced devtmpfs I'd say
that for userspace to, at every container startup, bind-mount in
devices from the global devtmpfs into a private tmpfs (for systemd's
sake it can't just be on the container rootfs), seems like something
worth avoiding.
I think having to pick and choose what device nodes you want in a
container is a good thing. Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
Post by Serge Hallyn
PS - Apparently both parallels and Michael independently
project devices which are hot-plugged on the host into containers.
That also seems like something worth talking about (best practices,
shortcomings, use cases not met by it, any ways tha the kernel can
help out) at ksummit/linuxcon.
I was told that containers would never want devices hotplugged into
them. What use case has this happening / needed?

thanks,

greg k-h
Michael H. Warfield
2014-05-16 01:42:25 UTC
Permalink
Post by Greg Kroah-Hartman
Post by Serge Hallyn
What exactly defines '"normal" use case for a container'?
Well, I'd say "acting like a virtual machine" is a good start :)
Ok... And virtual machines (VirtualBox, VMware, etc, etc) have hot plug
USB devices. I use the USB hotplug with VirtualBox. I plug a
configured USB device in and the VirtualBox VM grabs it. Virtual
machines have loopback devices. I've used them and using them in
containers is significantly more efficient. VirtualBox has remote audio
and a host of other device features.

Now we have some agreement. Normal is "acting like a virtual machine".
That's a goal I can agree with. I want to work toward that goal of
containers "acting like a virtual machine" just running on a common
kernel with the host. It's a challenge. We're getting there.
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Not too long ago much of what we can now do with network namespaces
was not a normal container use case. Neither "you can't do it now"
nor "I don't use it like that" should be grounds for a pre-emptive
nack. "It will horribly break security assumptions" certainly would
be.
I agree, and maybe we will get there over time, but this patch is nto
the way to do that.
Ok... We have a goal. Now we can haggle over the details (to
paraphrase a joke that's as old as I am).
Post by Greg Kroah-Hartman
Post by Serge Hallyn
That's not to say there might not be good reasons why this in particular
is not appropriate, but ISTM if things are going to be nacked without
consideration of the patchset itself, we ought to be having a ksummit
session to come to a consensus [ or receive a decree, presumably by you :)
but after we have a chance to make our case ] on what things are going to
be un/acceptable.
I already stood up and publically said this last year at Plumbers, why
is anything now different?
Not much really. The reality is that more and more people are trying to
use hotplug devices, network interfaces, and loopback devices in
containers just like they would in full para or hw virt machines. We're
trying to make them work, without it looking like a kludge. I
personally agree with you that much of this can be done in host user
space and, coming out of LinuxPlumbers last year, I've implemented some
ideas that did not require kernel patches that achieve some of my goals.
Post by Greg Kroah-Hartman
And this patchset is proof of why it's not a good idea. You really
didn't do anything with all of the namespace stuff, except change loop.
That's the only thing that cares, so, just do it there, like I said to
do so, last August.
And you are ignoring the notifications to userspace and how namespaces
here would deal with that.
That's a problem to deal with. I don't thing anyone is ignoring them.
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Post by Greg Kroah-Hartman
Post by Michael H. Warfield
Serge mentioned something to me about a loopdevfs (?) thing that someone
else is working on. That would seem to be a better solution in this
particular case but I don't know much about it or where it's at.
Ok, let's see those patches then.
I think Seth has a git tree ready, but not sure which branch he'd want
us to look at.
Splitting a namespaced devtmpfs from loopdevfs discussion might be
sensible. However, in defense of a namespaced devtmpfs I'd say
that for userspace to, at every container startup, bind-mount in
devices from the global devtmpfs into a private tmpfs (for systemd's
sake it can't just be on the container rootfs), seems like something
worth avoiding.
I think having to pick and choose what device nodes you want in a
container is a good thing.
Both static and dynamic devices. It's got to support hotplug. We have
(I have) use cases. That's what I'm trying to do with host udev rules
and some custom configurations. I can play games with udev rules.
Maybe we can keep the user spaces policies in user space and not burden
the kernel.
Post by Greg Kroah-Hartman
Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
IMHO, there's nothing wrong with that as long as we agree on how it's to
be done. I'm not convinced that it can all be done in user space and
I'm not convinced that name spaced devtmpfs is the magic pill to make it
all go away either. Making the user space make the decisions and having
the kernel enforce them is a principle worth considering.
Post by Greg Kroah-Hartman
Post by Serge Hallyn
PS - Apparently both parallels and Michael independently
project devices which are hot-plugged on the host into containers.
That also seems like something worth talking about (best practices,
shortcomings, use cases not met by it, any ways tha the kernel can
help out) at ksummit/linuxcon.
I was told that containers would never want devices hotplugged into
them.
Interesting. You were told they (who they?) would never want them? Who
said that? I would have never thought that given that other
implementations can provide that. I would certainly want them. Seems
strange to explicitly relegate LXC containers to being second class
citizens behind OpenVZ, Parallels, BSD Gaols, and Solaris Zones.

I might believe you were never told they would need them, but that's a
totally different sense. Are we going to tell RedHat and the Docker
people that LXC is an inferior technology that is complex and unreliable
(to quote another poster) compared to these others? They're saying this
will be enterprise technology. If I go to Amazon AWS or other VPS
services and compare, are we not going to stand on a level playing
field? Admittedly, I don't expect Amazon AWS to provide me with serial
consoles, but I do expect to be able to mount file system images within
my VPS.
Post by Greg Kroah-Hartman
What use case has this happening / needed?
Hello? Dink... Dink... Is this microphone on? I've already detailed
out a use case (serial USB console case) that I'm dealing with now.
Now, I'm dealing with it in host user space and that's probably the
correct answer there. I probably don't need kernel space help in this
particular case. There's still a lot of bolt holes to fill with bolts
though for the more general case. It's not the common case but it is a
valid legitimate use case and one that would be expected of a "virtual
machine" (VirtualBox can handle it - waste of computing cycles that it
is). The loopback device case is even more common and, currently,
rather inconsistent but strangle self consistent and workable.

In the 80/20 case, I agree we can and should deal with this in the host
user space as much as possible. That's the realm I'm working within.
Seth and others seem to want more in the namespace region and I'm not
convinced. But, I'm not convinced we can accomplish everything in user
space either.

We've got use cases and we've got problem sets. Don't give into
confirmational bias and automatically discount the use cases that have
been mentioned and then assume there are none. I don't know if Seth's
paths are part of the answer or not. I'm not pro Seth's patches or
against Seth's patches but we've got a need in search of solutions.
Post by Greg Kroah-Hartman
thanks,
greg k-h
Regards,
Mike
--
Michael H. Warfield (AI4NB) | (770) 978-7061 | mhw at WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20140515/3d835fc6/attachment.sig>
Richard Weinberger
2014-05-16 07:56:55 UTC
Permalink
Post by Michael H. Warfield
Post by Greg Kroah-Hartman
Post by Serge Hallyn
What exactly defines '"normal" use case for a container'?
Well, I'd say "acting like a virtual machine" is a good start :)
Ok... And virtual machines (VirtualBox, VMware, etc, etc) have hot plug
USB devices. I use the USB hotplug with VirtualBox. I plug a
configured USB device in and the VirtualBox VM grabs it. Virtual
machines have loopback devices. I've used them and using them in
containers is significantly more efficient. VirtualBox has remote audio
and a host of other device features.
Now we have some agreement. Normal is "acting like a virtual machine".
That's a goal I can agree with. I want to work toward that goal of
containers "acting like a virtual machine" just running on a common
kernel with the host. It's a challenge. We're getting there.
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Not too long ago much of what we can now do with network namespaces
was not a normal container use case. Neither "you can't do it now"
nor "I don't use it like that" should be grounds for a pre-emptive
nack. "It will horribly break security assumptions" certainly would
be.
I agree, and maybe we will get there over time, but this patch is nto
the way to do that.
Ok... We have a goal. Now we can haggle over the details (to
paraphrase a joke that's as old as I am).
Post by Greg Kroah-Hartman
Post by Serge Hallyn
That's not to say there might not be good reasons why this in particular
is not appropriate, but ISTM if things are going to be nacked without
consideration of the patchset itself, we ought to be having a ksummit
session to come to a consensus [ or receive a decree, presumably by you :)
but after we have a chance to make our case ] on what things are going to
be un/acceptable.
I already stood up and publically said this last year at Plumbers, why
is anything now different?
Not much really. The reality is that more and more people are trying to
use hotplug devices, network interfaces, and loopback devices in
containers just like they would in full para or hw virt machines. We're
trying to make them work, without it looking like a kludge. I
personally agree with you that much of this can be done in host user
space and, coming out of LinuxPlumbers last year, I've implemented some
ideas that did not require kernel patches that achieve some of my goals.
Post by Greg Kroah-Hartman
And this patchset is proof of why it's not a good idea. You really
didn't do anything with all of the namespace stuff, except change loop.
That's the only thing that cares, so, just do it there, like I said to
do so, last August.
And you are ignoring the notifications to userspace and how namespaces
here would deal with that.
That's a problem to deal with. I don't thing anyone is ignoring them.
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Post by Greg Kroah-Hartman
Post by Michael H. Warfield
Serge mentioned something to me about a loopdevfs (?) thing that someone
else is working on. That would seem to be a better solution in this
particular case but I don't know much about it or where it's at.
Ok, let's see those patches then.
I think Seth has a git tree ready, but not sure which branch he'd want
us to look at.
Splitting a namespaced devtmpfs from loopdevfs discussion might be
sensible. However, in defense of a namespaced devtmpfs I'd say
that for userspace to, at every container startup, bind-mount in
devices from the global devtmpfs into a private tmpfs (for systemd's
sake it can't just be on the container rootfs), seems like something
worth avoiding.
I think having to pick and choose what device nodes you want in a
container is a good thing.
Both static and dynamic devices. It's got to support hotplug. We have
(I have) use cases. That's what I'm trying to do with host udev rules
and some custom configurations. I can play games with udev rules.
Maybe we can keep the user spaces policies in user space and not burden
the kernel.
Post by Greg Kroah-Hartman
Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
IMHO, there's nothing wrong with that as long as we agree on how it's to
be done. I'm not convinced that it can all be done in user space and
I'm not convinced that name spaced devtmpfs is the magic pill to make it
all go away either. Making the user space make the decisions and having
the kernel enforce them is a principle worth considering.
Post by Greg Kroah-Hartman
Post by Serge Hallyn
PS - Apparently both parallels and Michael independently
project devices which are hot-plugged on the host into containers.
That also seems like something worth talking about (best practices,
shortcomings, use cases not met by it, any ways tha the kernel can
help out) at ksummit/linuxcon.
I was told that containers would never want devices hotplugged into
them.
Interesting. You were told they (who they?) would never want them? Who
said that? I would have never thought that given that other
implementations can provide that. I would certainly want them. Seems
strange to explicitly relegate LXC containers to being second class
citizens behind OpenVZ, Parallels, BSD Gaols, and Solaris Zones.
How do these solution deal with dynamic devices?
Post by Michael H. Warfield
I might believe you were never told they would need them, but that's a
totally different sense. Are we going to tell RedHat and the Docker
people that LXC is an inferior technology that is complex and unreliable
(to quote another poster) compared to these others? They're saying this
will be enterprise technology. If I go to Amazon AWS or other VPS
services and compare, are we not going to stand on a level playing
field? Admittedly, I don't expect Amazon AWS to provide me with serial
consoles, but I do expect to be able to mount file system images within
my VPS.
I didn't say that containers are unreliable. They work.
Red hat is well aware of the problems (okay, say complexities) of
containers.
Docker is a completely different story. :-)
Post by Michael H. Warfield
Post by Greg Kroah-Hartman
What use case has this happening / needed?
Hello? Dink... Dink... Is this microphone on? I've already detailed
out a use case (serial USB console case) that I'm dealing with now.
Now, I'm dealing with it in host user space and that's probably the
correct answer there. I probably don't need kernel space help in this
particular case. There's still a lot of bolt holes to fill with bolts
though for the more general case. It's not the common case but it is a
valid legitimate use case and one that would be expected of a "virtual
machine" (VirtualBox can handle it - waste of computing cycles that it
is). The loopback device case is even more common and, currently,
rather inconsistent but strangle self consistent and workable.
In the 80/20 case, I agree we can and should deal with this in the host
user space as much as possible. That's the realm I'm working within.
Seth and others seem to want more in the namespace region and I'm not
convinced. But, I'm not convinced we can accomplish everything in user
space either.
We've got use cases and we've got problem sets. Don't give into
confirmational bias and automatically discount the use cases that have
been mentioned and then assume there are none. I don't know if Seth's
paths are part of the answer or not. I'm not pro Seth's patches or
against Seth's patches but we've got a need in search of solutions.
Post by Greg Kroah-Hartman
thanks,
greg k-h
Regards,
Mike
--
Michael H. Warfield (AI4NB) | (770) 978-7061 | mhw at WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!
_______________________________________________
lxc-devel mailing list
lxc-devel at lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-devel
--
Thanks,
//richard
Michael H. Warfield
2014-05-16 19:42:40 UTC
Permalink
Post by Michael H. Warfield
Post by Greg Kroah-Hartman
Post by Serge Hallyn
PS - Apparently both parallels and Michael independently
project devices which are hot-plugged on the host into containers.
That also seems like something worth talking about (best practices,
shortcomings, use cases not met by it, any ways tha the kernel can
help out) at ksummit/linuxcon.
I was told that containers would never want devices hotplugged into
them.
Interesting. You were told they (who they?) would never want them? Who
said that? I would have never thought that given that other
implementations can provide that. I would certainly want them. Seems
strange to explicitly relegate LXC containers to being second class
citizens behind OpenVZ, Parallels, BSD Gaols, and Solaris Zones.
That would probably be me. Running hotplug inside a container is a
security problem and, since containers are easily entered by the host,
it's very easy to listen for the hotplug in the host and inject it into
the container using nsenter.
In all virtualization... The host, particularly root on the host,
exists as deus ex machina, the "god outside the machine". They are at
my mercy. Even hardware virtualization can not protect you from the
host. You wanna hear some frightening talks on virtualization, catch
Joanna (miss little blue pill) Rutkowska some time. I'm particularly
interesting in her takes on the "anti evil-maid attacks" and I sat in on
her talks on the "north bridge" and "south bridge" malware evasion
techniques. She's a good speaker who makes powerful points that makes
you sweat but is pleasant in face to face conversation. I've played
with her Qubes distribution a couple of times and the way it works with
the TPM to insure a secure boot is interesting. But that's a completely
different topic on trusted computing.

OTOH, there are plenty of other things to worry about in all forms of
virtualization. At Internet Security Systems, where I was a founder,
fellow, and "X-Force Senior Wizard", we were looking at the ability to
leak information through the USB subsystem. No isolation is perfect,
especially when you have USB enabled.

But that's my turf.
I don't think the intention is to label anyone's implementation as
preferred. What this shows, I think, is that we all have different
practises when it comes to setting up containers. Some are necessary
because our containers are different. Some could do with serious
examination to see if there's really a best way to do the action which
we would then all use.
And I hope to contribute to the discussion of said actions.
Post by Michael H. Warfield
I might believe you were never told they would need them, but that's a
totally different sense. Are we going to tell RedHat and the Docker
people that LXC is an inferior technology that is complex and unreliable
(to quote another poster) compared to these others? They're saying this
will be enterprise technology. If I go to Amazon AWS or other VPS
services and compare, are we not going to stand on a level playing
field? Admittedly, I don't expect Amazon AWS to provide me with serial
consoles, but I do expect to be able to mount file system images within
my VPS.
Well, that's another nasty, isn't it. We all have different ways of
coping with mount in the container. I think at plumbers we need to sit
down with some of this plumbing and work out which pipes carry the same
fluids and whether we could unify them.
Concur
As an aside (probably requiring a new thread) we were wondering about
some type of notifier on the mount call that we could vector into the
host to perform the action. The main issue for us is mount of procfs,
which really needs to be a bind mount in a container. All of this led
me to speculate that we could use some type of syscall notifier
mechanism to manage capabilities in the host and even intercept and
complete the syscall action within the host rather than having to keep
evolving more an more complex kernel drivers to do this.
Interesting. That could be very useful. That might even help with the
loop device case where the mounts have to go through loop devices for
things like file system images and builds. Very interesting...
James
Regards,
Mike
--
Michael H. Warfield (AI4NB) | (770) 978-7061 | mhw at WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20140516/30148670/attachment.sig>
Michael H. Warfield
2014-05-16 20:04:22 UTC
Permalink
Post by Michael H. Warfield
As an aside (probably requiring a new thread) we were wondering about
some type of notifier on the mount call that we could vector into the
host to perform the action. The main issue for us is mount of procfs,
which really needs to be a bind mount in a container. All of this led
me to speculate that we could use some type of syscall notifier
mechanism to manage capabilities in the host and even intercept and
complete the syscall action within the host rather than having to keep
evolving more an more complex kernel drivers to do this.
Interesting. That could be very useful. That might even help with the
loop device case where the mounts have to go through loop devices for
things like file system images and builds. Very interesting...
Right, it might even make the loop case go away because now we can
present a dummy device in the container and when the host sees and
attempted mount on this, it just projects a bind mount into the
container and says I've *wink* mounted your "device" for you.
Nice. That idea has prospects. I like the concept.
This idea is extremely rough, it came from a conversation I had with
Pavel (cc'd) just before OpenStack about how we might go about
eliminating our OpenVZ interception of the mount system call which
currently does all of this in kernel, so we have no code and no proof
that it's actually feasible (yet).
K. I look forward to hearing more.

I switched from OpenVZ years ago to LXC because OpenVZ was falling too
far behind in kernel support and patches for the leading edge kernels.
At the time, I was working on the MD5 signature code for the Quagga
routing suite for BGP and couldn't maintain my hosts with OpenVZ and
maintain my BGP connections (I have a public ASN and peer on both IPv4
and IPv6) with MD5 signatures at the same time. At the time LXC had
just matured enough to serve my needs. That's interesting to note that
OpenVZ did this by intercepting the mount call. Very interesting...
James
Regards,
Mike
--
Michael H. Warfield (AI4NB) | (770) 978-7061 | mhw at WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20140516/1108d773/attachment.sig>
James Bottomley
2014-05-16 19:52:21 UTC
Permalink
Post by Michael H. Warfield
As an aside (probably requiring a new thread) we were wondering about
some type of notifier on the mount call that we could vector into the
host to perform the action. The main issue for us is mount of procfs,
which really needs to be a bind mount in a container. All of this led
me to speculate that we could use some type of syscall notifier
mechanism to manage capabilities in the host and even intercept and
complete the syscall action within the host rather than having to keep
evolving more an more complex kernel drivers to do this.
Interesting. That could be very useful. That might even help with the
loop device case where the mounts have to go through loop devices for
things like file system images and builds. Very interesting...
Right, it might even make the loop case go away because now we can
present a dummy device in the container and when the host sees and
attempted mount on this, it just projects a bind mount into the
container and says I've *wink* mounted your "device" for you.

This idea is extremely rough, it came from a conversation I had with
Pavel (cc'd) just before OpenStack about how we might go about
eliminating our OpenVZ interception of the mount system call which
currently does all of this in kernel, so we have no code and no proof
that it's actually feasible (yet).

James
James Bottomley
2014-05-16 19:20:08 UTC
Permalink
Post by Michael H. Warfield
Post by Greg Kroah-Hartman
Post by Serge Hallyn
PS - Apparently both parallels and Michael independently
project devices which are hot-plugged on the host into containers.
That also seems like something worth talking about (best practices,
shortcomings, use cases not met by it, any ways tha the kernel can
help out) at ksummit/linuxcon.
I was told that containers would never want devices hotplugged into
them.
Interesting. You were told they (who they?) would never want them? Who
said that? I would have never thought that given that other
implementations can provide that. I would certainly want them. Seems
strange to explicitly relegate LXC containers to being second class
citizens behind OpenVZ, Parallels, BSD Gaols, and Solaris Zones.
That would probably be me. Running hotplug inside a container is a
security problem and, since containers are easily entered by the host,
it's very easy to listen for the hotplug in the host and inject it into
the container using nsenter.

I don't think the intention is to label anyone's implementation as
preferred. What this shows, I think, is that we all have different
practises when it comes to setting up containers. Some are necessary
because our containers are different. Some could do with serious
examination to see if there's really a best way to do the action which
we would then all use.
Post by Michael H. Warfield
I might believe you were never told they would need them, but that's a
totally different sense. Are we going to tell RedHat and the Docker
people that LXC is an inferior technology that is complex and unreliable
(to quote another poster) compared to these others? They're saying this
will be enterprise technology. If I go to Amazon AWS or other VPS
services and compare, are we not going to stand on a level playing
field? Admittedly, I don't expect Amazon AWS to provide me with serial
consoles, but I do expect to be able to mount file system images within
my VPS.
Well, that's another nasty, isn't it. We all have different ways of
coping with mount in the container. I think at plumbers we need to sit
down with some of this plumbing and work out which pipes carry the same
fluids and whether we could unify them.

As an aside (probably requiring a new thread) we were wondering about
some type of notifier on the mount call that we could vector into the
host to perform the action. The main issue for us is mount of procfs,
which really needs to be a bind mount in a container. All of this led
me to speculate that we could use some type of syscall notifier
mechanism to manage capabilities in the host and even intercept and
complete the syscall action within the host rather than having to keep
evolving more an more complex kernel drivers to do this.

James
Serge Hallyn
2014-05-16 01:49:59 UTC
Permalink
Post by Greg Kroah-Hartman
Post by Serge Hallyn
What exactly defines '"normal" use case for a container'?
Well, I'd say "acting like a virtual machine" is a good start :)
Post by Serge Hallyn
Not too long ago much of what we can now do with network namespaces
was not a normal container use case. Neither "you can't do it now"
nor "I don't use it like that" should be grounds for a pre-emptive
nack. "It will horribly break security assumptions" certainly would
be.
I agree, and maybe we will get there over time, but this patch is nto
the way to do that.
Ok. [ I/we may be asking for more details later, but think there is enough
below :), particularly the point about event forwarding ] Thanks.
Post by Greg Kroah-Hartman
Post by Serge Hallyn
That's not to say there might not be good reasons why this in particular
is not appropriate, but ISTM if things are going to be nacked without
consideration of the patchset itself, we ought to be having a ksummit
session to come to a consensus [ or receive a decree, presumably by you :)
but after we have a chance to make our case ] on what things are going to
be un/acceptable.
I already stood up and publically said this last year at Plumbers, why
is anything now different?
Well I've simply never had a chance to talk to you since then to find out
exactly what it is that is unacceptable, and why. And, of course, code
makes it easier to discuss these things.
Post by Greg Kroah-Hartman
And this patchset is proof of why it's not a good idea. You really
didn't do anything with all of the namespace stuff, except change loop.
That's the only thing that cares, so, just do it there, like I said to
do so, last August.
Sorry, just do it where?
Post by Greg Kroah-Hartman
And you are ignoring the notifications to userspace and how namespaces
here would deal with that.
Good point. Addressing that is at the same time necessary, interesting,
and complicated.
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Post by Greg Kroah-Hartman
Post by Michael H. Warfield
Serge mentioned something to me about a loopdevfs (?) thing that someone
else is working on. That would seem to be a better solution in this
particular case but I don't know much about it or where it's at.
Ok, let's see those patches then.
I think Seth has a git tree ready, but not sure which branch he'd want
us to look at.
Splitting a namespaced devtmpfs from loopdevfs discussion might be
sensible. However, in defense of a namespaced devtmpfs I'd say
that for userspace to, at every container startup, bind-mount in
devices from the global devtmpfs into a private tmpfs (for systemd's
sake it can't just be on the container rootfs), seems like something
worth avoiding.
I think having to pick and choose what device nodes you want in a
container is a good thing. Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
For 'real' devices that sounds sensible. The thing about loop devices
is that we simply want to allow a container to say "give me a loop
device to use" and have it receive a unique loop device (or 3), without
having to pre-assign them. I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
Post by Greg Kroah-Hartman
Post by Serge Hallyn
PS - Apparently both parallels and Michael independently
project devices which are hot-plugged on the host into containers.
That also seems like something worth talking about (best practices,
shortcomings, use cases not met by it, any ways tha the kernel can
help out) at ksummit/linuxcon.
I was told that containers would never want devices hotplugged into
them. What use case has this happening / needed?
I'm pretty sure I didn't say that <looks around nervously>. But I guess
we are combining two topics here, the loop psuedofs and the namespaced
devtmpfs.

The use case of loop-control device and loop pseudofs is to have
multiple chrooted/namespaced programs be able to grab a loop device
on demand which they can use for the obvious things (building a livecd,
extracting file contents, etc) without stepping on each other's toes. The
namespaced devtmpfs is not required for this.

One advantage of a namespaced devtmpfs would be sane-looking devices
in unprivileged containers. Currently we have to bind-mount the host's
/dev/{full,zero,etc} which, due to uid and guid mappings, then shows up
as:

crw-rw-rw- 1 nobody nogroup 1, 7 May 12 13:35 full

Also you mentioned uevent forwarding above. Michael has talked several
times about having userspace on the host 'pass' devices into the
container. One thing which I believe he and Eric have discussed
before was how to have userspace in the container be notified when
a device is passed in. It seems to me that at least this is something
that would be simpler done from devtmpfs. I could be wrong on this -
Michael do you have any updates or corrections?

Still I think we may be all agreed that we could wait a bit longer and
see how far we can get with userspace guidance (which we had
originally decided a year ago, and again a year or two before that
before user namespaces were complete).

thanks,
-serge
Greg Kroah-Hartman
2014-05-16 04:35:32 UTC
Permalink
Post by Serge Hallyn
Post by Greg Kroah-Hartman
I think having to pick and choose what device nodes you want in a
container is a good thing. Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
For 'real' devices that sounds sensible. The thing about loop devices
is that we simply want to allow a container to say "give me a loop
device to use" and have it receive a unique loop device (or 3), without
having to pre-assign them. I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that. So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.

greg k-h
Seth Forshee
2014-05-16 14:06:07 UTC
Permalink
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Post by Greg Kroah-Hartman
I think having to pick and choose what device nodes you want in a
container is a good thing. Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
For 'real' devices that sounds sensible. The thing about loop devices
is that we simply want to allow a container to say "give me a loop
device to use" and have it receive a unique loop device (or 3), without
having to pre-assign them. I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that. So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.
No matter what I don't think we get out of this without driver core
changes, whether this was done in loop or by creating something new.
Not unless the whole thing is punted to userspace, anyway.

The first problem is that many block device ioctls check for
CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
not really sure. But loop does at minimum support partitions, and to get
that functionality in an unprivileged container at least the block layer
needs to know the namespace which has privileges for that device.

The second is that all block devices automatically appear in devtmpfs.
The scenario I'm concerned about is that the host could unknowingly use
a loop device exposed to a container, then the container could see data
from the host. So we either need a flag to tell the driver core not to
create a node in devtmpfs, or we need a privileged manager in userspace
to remove them (which kind of defeats the purpose). And it gets more
complicated when partition block devs are mixed in, because they can be
created without involvement from the driver - they would need to inherit
the "no devtmpfs node" property from their parent, and if the driver
uses a psuedo fs to create device nodes for userspace then it needs to
be informed about the partitions too so it can create those nodes.

So maybe we could get by without the privileged ioctls, as long as it
was understood that unprivileged containers can't do partitioning. But I
do think the devtmpfs problem would need to be addressed.

Thanks,
Seth
Michael H. Warfield
2014-05-16 15:28:28 UTC
Permalink
Post by Seth Forshee
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Post by Greg Kroah-Hartman
I think having to pick and choose what device nodes you want in a
container is a good thing. Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
For 'real' devices that sounds sensible. The thing about loop devices
is that we simply want to allow a container to say "give me a loop
device to use" and have it receive a unique loop device (or 3), without
having to pre-assign them. I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that. So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.
No matter what I don't think we get out of this without driver core
changes, whether this was done in loop or by creating something new.
Not unless the whole thing is punted to userspace, anyway.
The first problem is that many block device ioctls check for
CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
not really sure. But loop does at minimum support partitions, and to get
that functionality in an unprivileged container at least the block layer
needs to know the namespace which has privileges for that device.
Woa! Time out... Sorry, this will be an off topic aside.

Loop devices support partitions? I'd love to know how that works. I've
tried several times in the past to do that but it's failed every time.
I haven't been able to find any how-to in the past. This article was
just a couple of years ago (after the last time I tried this):

http://madduck.net/blog/2006.10.20:loop-mounting-partitions-from-a-disk-image/

This guy didn't use partitions directly but used the offset to the
mount, which is what I had to use. Everything I found always referred
to using mount offsets in order to mount partitions within a loop
device.

Regards,
Mike
Post by Seth Forshee
The second is that all block devices automatically appear in devtmpfs.
The scenario I'm concerned about is that the host could unknowingly use
a loop device exposed to a container, then the container could see data
from the host. So we either need a flag to tell the driver core not to
create a node in devtmpfs, or we need a privileged manager in userspace
to remove them (which kind of defeats the purpose). And it gets more
complicated when partition block devs are mixed in, because they can be
created without involvement from the driver - they would need to inherit
the "no devtmpfs node" property from their parent, and if the driver
uses a psuedo fs to create device nodes for userspace then it needs to
be informed about the partitions too so it can create those nodes.
So maybe we could get by without the privileged ioctls, as long as it
was understood that unprivileged containers can't do partitioning. But I
do think the devtmpfs problem would need to be addressed.
Thanks,
Seth
_______________________________________________
lxc-devel mailing list
lxc-devel at lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-devel
--
Michael H. Warfield (AI4NB) | (770) 978-7061 | mhw at WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20140516/18c6c03b/attachment.sig>
Christian Seiler
2014-05-16 20:17:49 UTC
Permalink
Hi,

(removed every CC but yourself and lxc-devel, don't need to spam LKML
for this)
Post by Michael H. Warfield
Woa! Time out... Sorry, this will be an off topic aside.
Loop devices support partitions? I'd love to know how that works.
Use util-linux >= 2.21 with Kernel >= 3.1:

losetup -P -f filename
Creates: /dev/loopXpY
Remove with: losetup -d /dev/loopX
(Also removes partition devices automatically.)

Also, using device mapper is an option (supported by all Linux
distributions that aren't completely ancient):

losetup -f filename
kpartx -a /dev/loopXpY
Creates: /dev/mapper/loopXpY
Remove with: kpartx -d /dev/loopX && losetup -d /dev/loopX
(Note that not doing kpartx -d /dev/loopX is problematic.)

Regards,
Christian
Michael H. Warfield
2014-05-16 20:28:34 UTC
Permalink
Post by Christian Seiler
Hi,
(removed every CC but yourself and lxc-devel, don't need to spam LKML
for this)
Post by Michael H. Warfield
Woa! Time out... Sorry, this will be an off topic aside.
Loop devices support partitions? I'd love to know how that works.
losetup -P -f filename
Creates: /dev/loopXpY
Remove with: losetup -d /dev/loopX
(Also removes partition devices automatically.)
Also, using device mapper is an option (supported by all Linux
losetup -f filename
kpartx -a /dev/loopXpY
Creates: /dev/mapper/loopXpY
Remove with: kpartx -d /dev/loopX && losetup -d /dev/loopX
(Note that not doing kpartx -d /dev/loopX is problematic.)
NICE! New toys to play with...

Many Thanks!
Post by Christian Seiler
Regards,
Christian
Regards,
Mike
--
Michael H. Warfield (AI4NB) | (770) 978-7061 | mhw at WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20140516/f9c1af5f/attachment.sig>
Seth Forshee
2014-05-16 15:43:38 UTC
Permalink
Post by Michael H. Warfield
Post by Seth Forshee
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Post by Greg Kroah-Hartman
I think having to pick and choose what device nodes you want in a
container is a good thing. Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
For 'real' devices that sounds sensible. The thing about loop devices
is that we simply want to allow a container to say "give me a loop
device to use" and have it receive a unique loop device (or 3), without
having to pre-assign them. I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that. So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.
No matter what I don't think we get out of this without driver core
changes, whether this was done in loop or by creating something new.
Not unless the whole thing is punted to userspace, anyway.
The first problem is that many block device ioctls check for
CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
not really sure. But loop does at minimum support partitions, and to get
that functionality in an unprivileged container at least the block layer
needs to know the namespace which has privileges for that device.
Woa! Time out... Sorry, this will be an off topic aside.
Loop devices support partitions? I'd love to know how that works. I've
tried several times in the past to do that but it's failed every time.
I haven't been able to find any how-to in the past. This article was
http://madduck.net/blog/2006.10.20:loop-mounting-partitions-from-a-disk-image/
This guy didn't use partitions directly but used the offset to the
mount, which is what I had to use. Everything I found always referred
to using mount offsets in order to mount partitions within a loop
device.
It's controlled by the loop.max_part module parameter. It defaults to 0,
which means no partition support. For any value > 0 max_part will be the
maximum available partition number, after rounding it up to the nearest
power of 2 minus 1 (so max_part=5 gives you up to 8 partitions,
max_part=8 gives you up to 16, etc).
Greg Kroah-Hartman
2014-05-16 18:57:49 UTC
Permalink
Post by Seth Forshee
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Post by Greg Kroah-Hartman
I think having to pick and choose what device nodes you want in a
container is a good thing. Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
For 'real' devices that sounds sensible. The thing about loop devices
is that we simply want to allow a container to say "give me a loop
device to use" and have it receive a unique loop device (or 3), without
having to pre-assign them. I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that. So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.
No matter what I don't think we get out of this without driver core
changes, whether this was done in loop or by creating something new.
Not unless the whole thing is punted to userspace, anyway.
The first problem is that many block device ioctls check for
CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
not really sure. But loop does at minimum support partitions, and to get
that functionality in an unprivileged container at least the block layer
needs to know the namespace which has privileges for that device.
That's fine, you should have those permissions in a container if you
want to do something like that on a loop device, right?
Post by Seth Forshee
The second is that all block devices automatically appear in devtmpfs.
The scenario I'm concerned about is that the host could unknowingly use
a loop device exposed to a container, then the container could see data
from the host.
I don't think that's a real issue, the host should know not to do that.
Post by Seth Forshee
So we either need a flag to tell the driver core not to create a node
in devtmpfs, or we need a privileged manager in userspace to remove
them (which kind of defeats the purpose). And it gets more complicated
when partition block devs are mixed in, because they can be created
without involvement from the driver - they would need to inherit the
"no devtmpfs node" property from their parent, and if the driver uses
a psuedo fs to create device nodes for userspace then it needs to be
informed about the partitions too so it can create those nodes.
I don't think that will be needed. Root in a host can do whatever it
wants in the containers, so mixing up block devices is the least of the
issues involved :)
Post by Seth Forshee
So maybe we could get by without the privileged ioctls, as long as it
was understood that unprivileged containers can't do partitioning. But I
do think the devtmpfs problem would need to be addressed.
I don't think unpriviliged containers should be able to do partitioning.
An unpriviliged user can't do that, so why should a container be any
different?

thanks,

greg k-h
James Bottomley
2014-05-16 19:28:35 UTC
Permalink
Post by Greg Kroah-Hartman
Post by Seth Forshee
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Post by Greg Kroah-Hartman
I think having to pick and choose what device nodes you want in a
container is a good thing. Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
For 'real' devices that sounds sensible. The thing about loop devices
is that we simply want to allow a container to say "give me a loop
device to use" and have it receive a unique loop device (or 3), without
having to pre-assign them. I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that. So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.
No matter what I don't think we get out of this without driver core
changes, whether this was done in loop or by creating something new.
Not unless the whole thing is punted to userspace, anyway.
The first problem is that many block device ioctls check for
CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
not really sure. But loop does at minimum support partitions, and to get
that functionality in an unprivileged container at least the block layer
needs to know the namespace which has privileges for that device.
That's fine, you should have those permissions in a container if you
want to do something like that on a loop device, right?
Really, no. CAP_SYS_ADMIN is effectively a pseudo root security hole.
Any user possessing CAP_SYS_ADMIN can do about as much damage as real
root can, whether or not you use user namespaces, so it would compromise
a lot of the security we're just bringing to containers.
Post by Greg Kroah-Hartman
Post by Seth Forshee
The second is that all block devices automatically appear in devtmpfs.
The scenario I'm concerned about is that the host could unknowingly use
a loop device exposed to a container, then the container could see data
from the host.
I don't think that's a real issue, the host should know not to do that.
Post by Seth Forshee
So we either need a flag to tell the driver core not to create a node
in devtmpfs, or we need a privileged manager in userspace to remove
them (which kind of defeats the purpose). And it gets more complicated
when partition block devs are mixed in, because they can be created
without involvement from the driver - they would need to inherit the
"no devtmpfs node" property from their parent, and if the driver uses
a psuedo fs to create device nodes for userspace then it needs to be
informed about the partitions too so it can create those nodes.
I don't think that will be needed. Root in a host can do whatever it
wants in the containers, so mixing up block devices is the least of the
issues involved :)
Post by Seth Forshee
So maybe we could get by without the privileged ioctls, as long as it
was understood that unprivileged containers can't do partitioning. But I
do think the devtmpfs problem would need to be addressed.
I don't think unpriviliged containers should be able to do partitioning.
An unpriviliged user can't do that, so why should a container be any
different?
To make sure we're on the same page with terminology, there's an
unprivileged container and a secure container. In the former, there's
no root user (all the processes run as non-root), so the container isn't
expected to perform any actions root would ... that's easy. In a secure
container, root is mapped to a nobody user in the host, so is
effectively unprivileged, but root in the container expects to look like
a real root within the VPS (and thus may expect to partition things,
depending on how they've been given access to the block device). The
big problem is giving back capabilities to the container root such that
a) it loses them if it escapes the container and b) it doesn't get
sufficient capabilities to damage the system.

James
Seth Forshee
2014-05-16 20:18:41 UTC
Permalink
Post by James Bottomley
Post by Greg Kroah-Hartman
Post by Seth Forshee
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Post by Greg Kroah-Hartman
I think having to pick and choose what device nodes you want in a
container is a good thing. Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
For 'real' devices that sounds sensible. The thing about loop devices
is that we simply want to allow a container to say "give me a loop
device to use" and have it receive a unique loop device (or 3), without
having to pre-assign them. I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that. So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.
No matter what I don't think we get out of this without driver core
changes, whether this was done in loop or by creating something new.
Not unless the whole thing is punted to userspace, anyway.
The first problem is that many block device ioctls check for
CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
not really sure. But loop does at minimum support partitions, and to get
that functionality in an unprivileged container at least the block layer
needs to know the namespace which has privileges for that device.
That's fine, you should have those permissions in a container if you
want to do something like that on a loop device, right?
Really, no. CAP_SYS_ADMIN is effectively a pseudo root security hole.
Any user possessing CAP_SYS_ADMIN can do about as much damage as real
root can, whether or not you use user namespaces, so it would compromise
a lot of the security we're just bringing to containers.
Post by Greg Kroah-Hartman
Post by Seth Forshee
The second is that all block devices automatically appear in devtmpfs.
The scenario I'm concerned about is that the host could unknowingly use
a loop device exposed to a container, then the container could see data
from the host.
I don't think that's a real issue, the host should know not to do that.
Post by Seth Forshee
So we either need a flag to tell the driver core not to create a node
in devtmpfs, or we need a privileged manager in userspace to remove
them (which kind of defeats the purpose). And it gets more complicated
when partition block devs are mixed in, because they can be created
without involvement from the driver - they would need to inherit the
"no devtmpfs node" property from their parent, and if the driver uses
a psuedo fs to create device nodes for userspace then it needs to be
informed about the partitions too so it can create those nodes.
I don't think that will be needed. Root in a host can do whatever it
wants in the containers, so mixing up block devices is the least of the
issues involved :)
Post by Seth Forshee
So maybe we could get by without the privileged ioctls, as long as it
was understood that unprivileged containers can't do partitioning. But I
do think the devtmpfs problem would need to be addressed.
I don't think unpriviliged containers should be able to do partitioning.
An unpriviliged user can't do that, so why should a container be any
different?
To make sure we're on the same page with terminology, there's an
unprivileged container and a secure container. In the former, there's
no root user (all the processes run as non-root), so the container isn't
expected to perform any actions root would ... that's easy. In a secure
container, root is mapped to a nobody user in the host, so is
effectively unprivileged, but root in the container expects to look like
a real root within the VPS (and thus may expect to partition things,
depending on how they've been given access to the block device). The
big problem is giving back capabilities to the container root such that
a) it loses them if it escapes the container and b) it doesn't get
sufficient capabilities to damage the system.
Based on your description what I was talking about is a secure
container. Thanks for clearing that up, and sorry for misusing the
terminology.

What I set out for was feature parity between loop devices in a secure
container and loop devices on the host. Since some operations currently
check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish
this is to push knowledge of the user namespace farther down into the
driver stack so the check can instead be for CAP_SYS_ADMIN in the user
namespace associated with the device.

That said, I suspect our current use cases can get by without these
capabilities. Really though I suspect this is just deferring the
discussion rather than settling it, and what we'll end up with is little
more than a fancy way for userspace to ask the kernel to run mknod on
its behalf.

Thanks,
Seth
Eric W. Biederman
2014-05-20 00:04:55 UTC
Permalink
Post by Seth Forshee
What I set out for was feature parity between loop devices in a secure
container and loop devices on the host. Since some operations currently
check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish
this is to push knowledge of the user namespace farther down into the
driver stack so the check can instead be for CAP_SYS_ADMIN in the user
namespace associated with the device.
That said, I suspect our current use cases can get by without these
capabilities. Really though I suspect this is just deferring the
discussion rather than settling it, and what we'll end up with is little
more than a fancy way for userspace to ask the kernel to run mknod on
its behalf.
A fancy way to ask the kernel to run mknod on its behalf is what
/dev/pts is.

When I suggested this I did not mean you should forgo making changes to
allow partitions and the like. What I itended is that you should find a
way to make this safe for users who don't have root capabilities.

Which possibly means that mount needs to learn how to keep a more
privileged user from using your new loop devices.

To get to the point where this is really and truly usable I expect to be
technically daunting.

Ultimately the technical challenge is how do we create a block device
that is safe for a user who does not have any capabilities to use, and
what can we do with that block device to make it useful.

Only when the question is can this kernel functionality which is
otherwise safe confuse a preexisting setuid application do namespace
or container bits significantly come into play.

Eric
Michael H. Warfield
2014-05-20 01:14:41 UTC
Permalink
Post by Eric W. Biederman
Post by Seth Forshee
What I set out for was feature parity between loop devices in a secure
container and loop devices on the host. Since some operations currently
check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish
this is to push knowledge of the user namespace farther down into the
driver stack so the check can instead be for CAP_SYS_ADMIN in the user
namespace associated with the device.
That said, I suspect our current use cases can get by without these
capabilities. Really though I suspect this is just deferring the
discussion rather than settling it, and what we'll end up with is little
more than a fancy way for userspace to ask the kernel to run mknod on
its behalf.
A fancy way to ask the kernel to run mknod on its behalf is what
/dev/pts is.
When I suggested this I did not mean you should forgo making changes to
allow partitions and the like. What I itended is that you should find a
way to make this safe for users who don't have root capabilities.
I like to think in terms of the "rootless" configurations where "root"
per se is not absolute and everything is framed in terms of
capabilities.
Post by Eric W. Biederman
Which possibly means that mount needs to learn how to keep a more
privileged user from using your new loop devices.
Not sure I got that one. As user with "more" privileges may or may not
have access dependent on the congruence of the privileges. They're not
heiarchial. If someone has that "priv" then they have access. If they
do not, they do not.
Post by Eric W. Biederman
To get to the point where this is really and truly usable I expect to be
technically daunting.
Most technically non-trivial problems generally are.
Post by Eric W. Biederman
Ultimately the technical challenge is how do we create a block device
that is safe for a user who does not have any capabilities to use, and
what can we do with that block device to make it useful.
Concur. It boils down to privilege management and access. Absolutely
concur.
Post by Eric W. Biederman
Only when the question is can this kernel functionality which is
otherwise safe confuse a preexisting setuid application do namespace
or container bits significantly come into play.
Ah... Admittedly it's not as late as our conversation at LinuxPlumbers
last year in NOLA but... Maybe late at night but I failed to parse the
above.
Post by Eric W. Biederman
Eric
Regards,
Mike
--
Michael H. Warfield (AI4NB) | (770) 978-7061 | mhw at WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20140519/b4e884b9/attachment-0001.sig>
Serge Hallyn
2014-05-20 14:18:30 UTC
Permalink
Post by Michael H. Warfield
Post by Eric W. Biederman
Post by Seth Forshee
What I set out for was feature parity between loop devices in a secure
container and loop devices on the host. Since some operations currently
check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish
this is to push knowledge of the user namespace farther down into the
driver stack so the check can instead be for CAP_SYS_ADMIN in the user
namespace associated with the device.
That said, I suspect our current use cases can get by without these
capabilities. Really though I suspect this is just deferring the
discussion rather than settling it, and what we'll end up with is little
more than a fancy way for userspace to ask the kernel to run mknod on
its behalf.
A fancy way to ask the kernel to run mknod on its behalf is what
/dev/pts is.
When I suggested this I did not mean you should forgo making changes to
allow partitions and the like. What I itended is that you should find a
way to make this safe for users who don't have root capabilities.
I like to think in terms of the "rootless" configurations where "root"
per se is not absolute and everything is framed in terms of
capabilities.
Post by Eric W. Biederman
Which possibly means that mount needs to learn how to keep a more
privileged user from using your new loop devices.
Not sure I got that one. As user with "more" privileges may or may not
have access dependent on the congruence of the privileges. They're not
Yes so in this case by more privileged' he meant a privileged user in a
userns which is ancestor to the current userns. It is in fact *more*
privileged than any user in the current userns.
Post by Michael H. Warfield
heiarchial. If someone has that "priv" then they have access. If they
They are in fact implicitly hierarchical due to the hierarchical userns
design.
Post by Michael H. Warfield
do not, they do not.
Post by Eric W. Biederman
To get to the point where this is really and truly usable I expect to be
technically daunting.
Most technically non-trivial problems generally are.
Post by Eric W. Biederman
Ultimately the technical challenge is how do we create a block device
that is safe for a user who does not have any capabilities to use, and
what can we do with that block device to make it useful.
Concur. It boils down to privilege management and access. Absolutely
concur.
Post by Eric W. Biederman
Only when the question is can this kernel functionality which is
otherwise safe confuse a preexisting setuid application do namespace
or container bits significantly come into play.
Ah... Admittedly it's not as late as our conversation at LinuxPlumbers
last year in NOLA but... Maybe late at night but I failed to parse the
above.
Post by Eric W. Biederman
Eric
Regards,
Mike
--
Michael H. Warfield (AI4NB) | (770) 978-7061 | mhw at WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!
_______________________________________________
lxc-devel mailing list
lxc-devel at lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-devel
Seth Forshee
2014-05-20 14:21:03 UTC
Permalink
Post by Eric W. Biederman
Post by Seth Forshee
What I set out for was feature parity between loop devices in a secure
container and loop devices on the host. Since some operations currently
check for system-wide CAP_SYS_ADMIN, the only way I see to accomplish
this is to push knowledge of the user namespace farther down into the
driver stack so the check can instead be for CAP_SYS_ADMIN in the user
namespace associated with the device.
That said, I suspect our current use cases can get by without these
capabilities. Really though I suspect this is just deferring the
discussion rather than settling it, and what we'll end up with is little
more than a fancy way for userspace to ask the kernel to run mknod on
its behalf.
A fancy way to ask the kernel to run mknod on its behalf is what
/dev/pts is.
When I suggested this I did not mean you should forgo making changes to
allow partitions and the like. What I itended is that you should find a
way to make this safe for users who don't have root capabilities.
But Greg did say that "unprivileged" or "secure" containers (depending
on whose terminology you're using) should not be able to do partitioning
[1]. I don't really understand this stance though, as I don't see what
possible security problems arise from letting root in a user ns do
BLKRRPART on a block device that it's explicitly been granted privileged
use of.

Assuming we come to an agreement that root in a user ns can do BLKRRPART
on some devices, we've got two issues. First, the block layer enforces
this restriction so it has to be aware of what namespace has privileges
for the device, but Greg wants a solution localized to the loop driver.
Second, if we're using a loop psuedo fs then we'd logically want block
devices for the partitions in the loop fs, so we have to create some
mechanism for the loop driver to get notified about these devices being
created.
Post by Eric W. Biederman
Which possibly means that mount needs to learn how to keep a more
privileged user from using your new loop devices.
The patches I posted have mechanisms to at least mitigate the problem.
First, anyone using loop-control to find a free loop device will never
get a device allocated to a different user ns (the loop psuedo fs code I
have also does this). Second, a given loop block device would only show
up in the devtmpfs of the namespace which owned that device. So a
sufficiently priveleged user isn't completely prevented from using the
devices, but since they would have to explicitly mknod the block device
node it should prevent accidental use by a more privileged user.

But I also brought this up previously, and Greg argued that it isn't a
real issue [1].
Post by Eric W. Biederman
To get to the point where this is really and truly usable I expect to be
technically daunting.
Ultimately the technical challenge is how do we create a block device
that is safe for a user who does not have any capabilities to use, and
what can we do with that block device to make it useful.
Yes, and I'd like to get started solving those challenges. But I also
don't think we can address these two points (support partition blkdevs,
help prevent more priveleged users from using a namespace's loop
devices) sufficiently while having an implementation completely
contained within the loop driver as Greg is requesting.

Thanks,
Seth
Post by Eric W. Biederman
Only when the question is can this kernel functionality which is
otherwise safe confuse a preexisting setuid application do namespace
or container bits significantly come into play.
Eric
[1] http://www.spinics.net/linux/lists/kernel/msg1744750.html
Eric W. Biederman
2014-05-21 22:00:33 UTC
Permalink
Post by Seth Forshee
Post by Eric W. Biederman
Ultimately the technical challenge is how do we create a block device
that is safe for a user who does not have any capabilities to use, and
what can we do with that block device to make it useful.
Yes, and I'd like to get started solving those challenges. But I also
don't think we can address these two points (support partition blkdevs,
help prevent more priveleged users from using a namespace's loop
devices) sufficiently while having an implementation completely
contained within the loop driver as Greg is requesting.
My key take away from the conversation is that we should reduce the
scope of what is being done to something that makes sense and the
propblems are immediately visible.

Part of me would like to suggest that fuse and it's ability to imitate
device nodes might be a more appropriate solution, to something that
just needs block device access and nothing else.

For purposes of discussion let's call it unprivloopfs. That can reuse
code from the loop device or not as appropriate. Not supporting
paritioning I think is a very reasonable first step until it is shown
that we can make good use of partitioning support, and there are not
better ways of solving the problem.

I expect the most productive thing to talk about is what is your
immediate goal? Mounting a filesystem? Building an iso?

We have a long history with the namespace support of punting on issues
and not solving them until a long term maintainable solution becomes
clear. Let's do what we can to make the problem and the solution clear.

Eric
Serge Hallyn
2014-05-21 22:33:19 UTC
Permalink
Post by Eric W. Biederman
Post by Seth Forshee
Post by Eric W. Biederman
Ultimately the technical challenge is how do we create a block device
that is safe for a user who does not have any capabilities to use, and
what can we do with that block device to make it useful.
Yes, and I'd like to get started solving those challenges. But I also
don't think we can address these two points (support partition blkdevs,
help prevent more priveleged users from using a namespace's loop
devices) sufficiently while having an implementation completely
contained within the loop driver as Greg is requesting.
My key take away from the conversation is that we should reduce the
scope of what is being done to something that makes sense and the
propblems are immediately visible.
Part of me would like to suggest that fuse and it's ability to imitate
device nodes might be a more appropriate solution, to something that
Do you have a link to more info on this? Some googling got me to an
interesting but old thread on CUSE, but nothing specifically about fuse
doing this.
Post by Eric W. Biederman
just needs block device access and nothing else.
For purposes of discussion let's call it unprivloopfs. That can reuse
code from the loop device or not as appropriate. Not supporting
paritioning I think is a very reasonable first step until it is shown
that we can make good use of partitioning support, and there are not
better ways of solving the problem.
I expect the most productive thing to talk about is what is your
immediate goal? Mounting a filesystem? Building an iso?
For me it would be taking an iso and making some changes to it to
localize it (i.e. take an install iso and add preseed file).

Now of course in the end there is no reason why we can't do all of
this with a new suite of libraries which simply uses read/write with
knowledge of the fs layouts to parse and modify the backing files.
My concern there is that duplicating all of the fs code seems unlikely
to improve the soundness of either implementation. Perhaps we can
autogenerate this from the kernel source? Does fuse already do
something like that?
Post by Eric W. Biederman
We have a long history with the namespace support of punting on issues
and not solving them until a long term maintainable solution becomes
clear. Let's do what we can to make the problem and the solution clear.
-serge
Michael J Coss
2014-05-22 18:12:27 UTC
Permalink
I've been working on this issue for a while as my use case is having
containers as virtual desktops for users, that run X, and allow sharing
of the desktop via injection of displays to the container, as well as
mice/keyboard using a remote usb ip solution. To make this work, we
needed udev messages. But instead of being broadcasted to every
container which is what happens now, it needs to be delivered to the
appropriate container. So have the uevents are localized to the host.
and a new daemon (udevns) listens via libudev for events, and forwards
the events to the appropriate container(s) via inject of the events to
the appropriate network namespace. It also is responsible for creation
of device nodes in the container. We create a local dev directory in
/etc/lxc/<containername>/ that is bound during the startup of the
container. The container's udev gets the events, and handles them
locally based on the admin's rules. Device creation is controlled via
the lxc.conf file.

When I orginally looked at this problem, I too though of a FUSE, but
after implementing a /dev FUSE I found the performance penalty was just
too much, as each access required traversing the kernel a few times. If
there was a way to handoff the file descriptor, it might be viable. And
there have been attempts at implementing the handoff but they weren't
very stable.

My current attempt at a FUSE is to provide a filtered view of sysfs, as
this is another kernel filesystem that poses problems for a more
generalized view of containers as a virtual machine replacement. In
this case, the performance issues are less as it just isn't as critical.
Eric W. Biederman
2014-05-23 22:23:50 UTC
Permalink
Post by Serge Hallyn
Post by Eric W. Biederman
Post by Seth Forshee
Post by Eric W. Biederman
Ultimately the technical challenge is how do we create a block device
that is safe for a user who does not have any capabilities to use, and
what can we do with that block device to make it useful.
Yes, and I'd like to get started solving those challenges. But I also
don't think we can address these two points (support partition blkdevs,
help prevent more priveleged users from using a namespace's loop
devices) sufficiently while having an implementation completely
contained within the loop driver as Greg is requesting.
My key take away from the conversation is that we should reduce the
scope of what is being done to something that makes sense and the
propblems are immediately visible.
Part of me would like to suggest that fuse and it's ability to imitate
device nodes might be a more appropriate solution, to something that
Do you have a link to more info on this? Some googling got me to an
interesting but old thread on CUSE, but nothing specifically about fuse
doing this.
CUSE is probably what I was thinking of. It is all part of the fuse
code base in the kernel. And now that I am reminded it is called CUSE
I go Duh that is a character device...

Fuse and everything it can do is definitely the filesystem I would like
to see most have the audits to be enabled in user namespace. Fuse
was built to be sufficiently paranoid to allow this and so it should not
take a lot to take fuse the rest of the way.
Post by Serge Hallyn
Post by Eric W. Biederman
just needs block device access and nothing else.
For purposes of discussion let's call it unprivloopfs. That can reuse
code from the loop device or not as appropriate. Not supporting
paritioning I think is a very reasonable first step until it is shown
that we can make good use of partitioning support, and there are not
better ways of solving the problem.
I expect the most productive thing to talk about is what is your
immediate goal? Mounting a filesystem? Building an iso?
For me it would be taking an iso and making some changes to it to
localize it (i.e. take an install iso and add preseed file).
Now of course in the end there is no reason why we can't do all of
this with a new suite of libraries which simply uses read/write with
knowledge of the fs layouts to parse and modify the backing files.
My concern there is that duplicating all of the fs code seems unlikely
to improve the soundness of either implementation. Perhaps we can
autogenerate this from the kernel source? Does fuse already do
something like that?
I am not aware of that. But I have not worked extensively with fuse.

I do agree that finding a way to perform a read-only mount of an ISO by
an unprivielged user is a very interesting use case. Given it's
interchange medium nature isofs should be as hardened as human possible,
and that is likely easier with a read-only filesystem. And at less than
4000 lines of code isofs is auditable.

So as a target for unprivileged mounts of a block device isofs looks
like a good place to start.

Eric
Seth Forshee
2014-05-28 09:26:55 UTC
Permalink
Post by Eric W. Biederman
Post by Serge Hallyn
Post by Eric W. Biederman
Post by Seth Forshee
Post by Eric W. Biederman
Ultimately the technical challenge is how do we create a block device
that is safe for a user who does not have any capabilities to use, and
what can we do with that block device to make it useful.
Yes, and I'd like to get started solving those challenges. But I also
don't think we can address these two points (support partition blkdevs,
help prevent more priveleged users from using a namespace's loop
devices) sufficiently while having an implementation completely
contained within the loop driver as Greg is requesting.
My key take away from the conversation is that we should reduce the
scope of what is being done to something that makes sense and the
propblems are immediately visible.
Part of me would like to suggest that fuse and it's ability to imitate
device nodes might be a more appropriate solution, to something that
Do you have a link to more info on this? Some googling got me to an
interesting but old thread on CUSE, but nothing specifically about fuse
doing this.
CUSE is probably what I was thinking of. It is all part of the fuse
code base in the kernel. And now that I am reminded it is called CUSE
I go Duh that is a character device...
Fuse and everything it can do is definitely the filesystem I would like
to see most have the audits to be enabled in user namespace. Fuse
was built to be sufficiently paranoid to allow this and so it should not
take a lot to take fuse the rest of the way.
I was aware of FUSE but hadn't ever looked at it much. Looking at it
now, this isn't going to satisfy any of the use cases I know about,
which are wanting to use filesystems supported in-kernel (isofs, ext*).
I don't see that any of these have a FUSE implementation, and I think we
gain more from figuring out how to use in-kernel filesystems in
containers than trying to find a way to shoehorn selected filesystems
into FUSE.
Serge E. Hallyn
2014-05-28 13:12:59 UTC
Permalink
Post by Seth Forshee
Post by Eric W. Biederman
Post by Serge Hallyn
Post by Eric W. Biederman
Post by Seth Forshee
Post by Eric W. Biederman
Ultimately the technical challenge is how do we create a block device
that is safe for a user who does not have any capabilities to use, and
what can we do with that block device to make it useful.
Yes, and I'd like to get started solving those challenges. But I also
don't think we can address these two points (support partition blkdevs,
help prevent more priveleged users from using a namespace's loop
devices) sufficiently while having an implementation completely
contained within the loop driver as Greg is requesting.
My key take away from the conversation is that we should reduce the
scope of what is being done to something that makes sense and the
propblems are immediately visible.
Part of me would like to suggest that fuse and it's ability to imitate
device nodes might be a more appropriate solution, to something that
Do you have a link to more info on this? Some googling got me to an
interesting but old thread on CUSE, but nothing specifically about fuse
doing this.
CUSE is probably what I was thinking of. It is all part of the fuse
code base in the kernel. And now that I am reminded it is called CUSE
I go Duh that is a character device...
Fuse and everything it can do is definitely the filesystem I would like
to see most have the audits to be enabled in user namespace. Fuse
was built to be sufficiently paranoid to allow this and so it should not
take a lot to take fuse the rest of the way.
I was aware of FUSE but hadn't ever looked at it much. Looking at it
now, this isn't going to satisfy any of the use cases I know about,
which are wanting to use filesystems supported in-kernel (isofs, ext*).
I don't see that any of these have a FUSE implementation, and I think we
gain more from figuring out how to use in-kernel filesystems in
containers than trying to find a way to shoehorn selected filesystems
into FUSE.
That's why I was wondering how much work it would be to auto-generate
fuse fs support from the in-kernel source.

-serge
Eric W. Biederman
2014-05-28 20:33:51 UTC
Permalink
Post by Serge E. Hallyn
Post by Seth Forshee
I was aware of FUSE but hadn't ever looked at it much. Looking at it
now, this isn't going to satisfy any of the use cases I know about,
which are wanting to use filesystems supported in-kernel (isofs, ext*).
I don't see that any of these have a FUSE implementation, and I think we
gain more from figuring out how to use in-kernel filesystems in
containers than trying to find a way to shoehorn selected filesystems
into FUSE.
That's why I was wondering how much work it would be to auto-generate
fuse fs support from the in-kernel source.
So at a quick look I have found fuseext2, fuseiso and mountlo-0.5 (which
claims to have supported all the in-kernel filesystems with the help of
user mode linux).

Give that the first two are just an apt-get install away fuse really
looks like the shortest path to being able to mount an iso, do other
interesting things.

We probably want something more but only when performance becomes a
bottle-neck.

Eric

Serge E. Hallyn
2014-05-18 02:42:00 UTC
Permalink
Post by James Bottomley
Post by Greg Kroah-Hartman
Post by Seth Forshee
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Post by Greg Kroah-Hartman
I think having to pick and choose what device nodes you want in a
container is a good thing. Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
For 'real' devices that sounds sensible. The thing about loop devices
is that we simply want to allow a container to say "give me a loop
device to use" and have it receive a unique loop device (or 3), without
having to pre-assign them. I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that. So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.
No matter what I don't think we get out of this without driver core
changes, whether this was done in loop or by creating something new.
Not unless the whole thing is punted to userspace, anyway.
The first problem is that many block device ioctls check for
CAP_SYS_ADMIN. Most of these might not ever be used on loop devices, I'm
not really sure. But loop does at minimum support partitions, and to get
that functionality in an unprivileged container at least the block layer
needs to know the namespace which has privileges for that device.
That's fine, you should have those permissions in a container if you
want to do something like that on a loop device, right?
Really, no. CAP_SYS_ADMIN is effectively a pseudo root security hole.
Any user possessing CAP_SYS_ADMIN can do about as much damage as real
root can, whether or not you use user namespaces, so it would compromise
a lot of the security we're just bringing to containers.
Post by Greg Kroah-Hartman
Post by Seth Forshee
The second is that all block devices automatically appear in devtmpfs.
The scenario I'm concerned about is that the host could unknowingly use
a loop device exposed to a container, then the container could see data
from the host.
I don't think that's a real issue, the host should know not to do that.
Post by Seth Forshee
So we either need a flag to tell the driver core not to create a node
in devtmpfs, or we need a privileged manager in userspace to remove
them (which kind of defeats the purpose). And it gets more complicated
when partition block devs are mixed in, because they can be created
without involvement from the driver - they would need to inherit the
"no devtmpfs node" property from their parent, and if the driver uses
a psuedo fs to create device nodes for userspace then it needs to be
informed about the partitions too so it can create those nodes.
I don't think that will be needed. Root in a host can do whatever it
wants in the containers, so mixing up block devices is the least of the
issues involved :)
Post by Seth Forshee
So maybe we could get by without the privileged ioctls, as long as it
was understood that unprivileged containers can't do partitioning. But I
do think the devtmpfs problem would need to be addressed.
I don't think unpriviliged containers should be able to do partitioning.
An unpriviliged user can't do that, so why should a container be any
different?
To make sure we're on the same page with terminology, there's an
unprivileged container and a secure container. In the former, there's
Hm, that terminology (which isn't what we've been using) could be
useful, but is still not quite precise enough if we're going down
that road.
Post by James Bottomley
no root user (all the processes run as non-root), so the container isn't
"there is no root user" and "all processes run as non-root" are not the
same thing. Is it just that no processes are running as root? Or that
uid 0 in the container is not mapped at all and hence not achievable?

The former really isn't a function of the container itself, and depends
on there really not being any setuid-root or capability-wielding files
available in the container.

If the latter, and you're hoping to claim that the host is saved from
the container exercising kernel code which falls under 'if
(ns_capable(X))', then you're stil just one unprivileged
clone(CLONE_NEWUSER) and mapping of nested uid 0 to any actually validly
mapped container uid away from hitting that kernel code. Your container
resources (i.e. networking) are mostly saved from being changed by the
container, although file capabilities will still thwart that.
Post by James Bottomley
expected to perform any actions root would ... that's easy. In a secure
container, root is mapped to a nobody user in the host, so is
effectively unprivileged, but root in the container expects to look like
a real root within the VPS (and thus may expect to partition things,
depending on how they've been given access to the block device). The
big problem is giving back capabilities to the container root such that
a) it loses them if it escapes the container and b) it doesn't get
sufficient capabilities to damage the system.
James
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Eric W. Biederman
2014-05-17 04:31:37 UTC
Permalink
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Post by Greg Kroah-Hartman
I think having to pick and choose what device nodes you want in a
container is a good thing. Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
For 'real' devices that sounds sensible. The thing about loop devices
is that we simply want to allow a container to say "give me a loop
device to use" and have it receive a unique loop device (or 3), without
having to pre-assign them. I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that. So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.
Yes. Something like devpts (without the newinstance option). Built to
allow unprivileged users to create loopback devices.

There is still a huge kettle of fish in with verifying a filesystem is
safe from a hostile user that has acess to the block device while the
filesystem is mounted.

Having a few filesystems that are robust enough to trust with arbitrary
filesystem corruption would be very interesting.

I assume unprivileged and hostile users because if you trusted the real
root inside of your container this would not be an issue.

Eric
Seth Forshee
2014-05-17 16:01:45 UTC
Permalink
Post by Eric W. Biederman
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Post by Greg Kroah-Hartman
I think having to pick and choose what device nodes you want in a
container is a good thing. Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
For 'real' devices that sounds sensible. The thing about loop devices
is that we simply want to allow a container to say "give me a loop
device to use" and have it receive a unique loop device (or 3), without
having to pre-assign them. I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that. So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.
Yes. Something like devpts (without the newinstance option). Built to
allow unprivileged users to create loopback devices.
That's where I started, and I've got code, so I guess I'll clean it up
and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
gets to do privileged block device ioctls, including reading partitions
on a block device which has been assigned to a contiainer, then I guess
that approach works well enough.
Post by Eric W. Biederman
There is still a huge kettle of fish in with verifying a filesystem is
safe from a hostile user that has acess to the block device while the
filesystem is mounted.
Having a few filesystems that are robust enough to trust with arbitrary
filesystem corruption would be very interesting.
I assume unprivileged and hostile users because if you trusted the real
root inside of your container this would not be an issue.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Serge E. Hallyn
2014-05-18 02:44:58 UTC
Permalink
Post by Seth Forshee
Post by Eric W. Biederman
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Post by Greg Kroah-Hartman
I think having to pick and choose what device nodes you want in a
container is a good thing. Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
For 'real' devices that sounds sensible. The thing about loop devices
is that we simply want to allow a container to say "give me a loop
device to use" and have it receive a unique loop device (or 3), without
having to pre-assign them. I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that. So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.
Yes. Something like devpts (without the newinstance option). Built to
allow unprivileged users to create loopback devices.
That's where I started, and I've got code, so I guess I'll clean it up
and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
gets to do privileged block device ioctls, including reading partitions
Sorry, where did that come from? What Eric was referring to below is
the fs superblock readers not being trusted. Maybe I glossed over another
email where it was mentioned?
Post by Seth Forshee
on a block device which has been assigned to a contiainer, then I guess
that approach works well enough.
Post by Eric W. Biederman
There is still a huge kettle of fish in with verifying a filesystem is
safe from a hostile user that has acess to the block device while the
filesystem is mounted.
Having a few filesystems that are robust enough to trust with arbitrary
filesystem corruption would be very interesting.
I assume unprivileged and hostile users because if you trusted the real
root inside of your container this would not be an issue.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Seth Forshee
2014-05-19 13:27:03 UTC
Permalink
Post by Serge E. Hallyn
Post by Seth Forshee
Post by Eric W. Biederman
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Post by Greg Kroah-Hartman
I think having to pick and choose what device nodes you want in a
container is a good thing. Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
For 'real' devices that sounds sensible. The thing about loop devices
is that we simply want to allow a container to say "give me a loop
device to use" and have it receive a unique loop device (or 3), without
having to pre-assign them. I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that. So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.
Yes. Something like devpts (without the newinstance option). Built to
allow unprivileged users to create loopback devices.
That's where I started, and I've got code, so I guess I'll clean it up
and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
gets to do privileged block device ioctls, including reading partitions
Sorry, where did that come from? What Eric was referring to below is
the fs superblock readers not being trusted. Maybe I glossed over another
email where it was mentioned?
You must have. Take a look at [1].

To repeat the point: the ioctl to reread partitions (along with several
other block device ioctls) has a capable(CAP_SYS_ADMIN) check. We can't
change this to an ns_capable check without at minimum the block layer
knowing about the namespace associated with the block device. Ergo we
can't reread paritions if this is done entirely within the loop driver
via a psuedo fs.

[1] http://article.gmane.org/gmane.linux.kernel.containers.lxc.devel/8191
Serge Hallyn
2014-05-20 14:15:39 UTC
Permalink
Post by Seth Forshee
Post by Serge E. Hallyn
Post by Seth Forshee
Post by Eric W. Biederman
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Post by Greg Kroah-Hartman
I think having to pick and choose what device nodes you want in a
container is a good thing. Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
For 'real' devices that sounds sensible. The thing about loop devices
is that we simply want to allow a container to say "give me a loop
device to use" and have it receive a unique loop device (or 3), without
having to pre-assign them. I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that. So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.
Yes. Something like devpts (without the newinstance option). Built to
allow unprivileged users to create loopback devices.
That's where I started, and I've got code, so I guess I'll clean it up
and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
gets to do privileged block device ioctls, including reading partitions
Sorry, where did that come from? What Eric was referring to below is
the fs superblock readers not being trusted. Maybe I glossed over another
email where it was mentioned?
You must have. Take a look at [1].
To repeat the point: the ioctl to reread partitions (along with several
other block device ioctls) has a capable(CAP_SYS_ADMIN) check. We can't
change this to an ns_capable check without at minimum the block layer
knowing about the namespace associated with the block device. Ergo we
Which only means those changes are necessary :)

So far as I understand, a namespaced devtmpfs is nacked, but a loopfs
is interesting (and, depending on the implementation, acceptable). That
necessarily includes the minimal blockdev changes to support it.
Post by Seth Forshee
can't reread paritions if this is done entirely within the loop driver
via a psuedo fs.
[1] http://article.gmane.org/gmane.linux.kernel.containers.lxc.devel/8191
Serge Hallyn
2014-05-20 14:26:29 UTC
Permalink
Post by Serge Hallyn
Post by Seth Forshee
Post by Serge E. Hallyn
Post by Seth Forshee
Post by Eric W. Biederman
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Post by Greg Kroah-Hartman
I think having to pick and choose what device nodes you want in a
container is a good thing. Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
For 'real' devices that sounds sensible. The thing about loop devices
is that we simply want to allow a container to say "give me a loop
device to use" and have it receive a unique loop device (or 3), without
having to pre-assign them. I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that. So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.
Yes. Something like devpts (without the newinstance option). Built to
allow unprivileged users to create loopback devices.
That's where I started, and I've got code, so I guess I'll clean it up
and send patches. If the stance is that only system-wide CAP_SYS_ADMIN
gets to do privileged block device ioctls, including reading partitions
Sorry, where did that come from? What Eric was referring to below is
the fs superblock readers not being trusted. Maybe I glossed over another
email where it was mentioned?
You must have. Take a look at [1].
To repeat the point: the ioctl to reread partitions (along with several
other block device ioctls) has a capable(CAP_SYS_ADMIN) check. We can't
change this to an ns_capable check without at minimum the block layer
knowing about the namespace associated with the block device. Ergo we
Which only means those changes are necessary :)
So far as I understand, a namespaced devtmpfs is nacked, but a loopfs
is interesting (and, depending on the implementation, acceptable). That
necessarily includes the minimal blockdev changes to support it.
Post by Seth Forshee
can't reread paritions if this is done entirely within the loop driver
via a psuedo fs.
[1] http://article.gmane.org/gmane.linux.kernel.containers.lxc.devel/8191
Hm, yeah, I was confuddling two issues. Nevertheless, for real block devices I
absolutely agree. For loop devices I don't. My answer to
Post by Serge Hallyn
I don't think unpriviliged containers should be able to do partitioning.
An unpriviliged user can't do that, so why should a container be any
different?
would be that the loop device is a convenience built atop the backing image,
and if the user had the rights to loop-attach the backing image, he can
just as will partition using write(2), so why artificially plac this limit?

Nevertheless this is not really a debate worth having until we have a
blockdev fs mountable in a userns.

My main interest currently is with privileged containers. I think we can
learn plenty from that for now.
Michael H. Warfield
2014-05-17 12:57:38 UTC
Permalink
Post by Greg Kroah-Hartman
Post by Serge Hallyn
Post by Greg Kroah-Hartman
I think having to pick and choose what device nodes you want in a
container is a good thing. Becides, you would have to do the same thing
in the kernel anyway, what's wrong with userspace making the decision
here, especially as it knows exactly what it wants to do much more so
than the kernel ever can.
For 'real' devices that sounds sensible. The thing about loop devices
is that we simply want to allow a container to say "give me a loop
device to use" and have it receive a unique loop device (or 3), without
having to pre-assign them. I think that would be cleaner to do using
a pseudofs and loop-control device, rather than having to have a
daemon in userspace on the host farming those out in response to
some, I don't know, dbus request?
I agree that loop devices would be nice to have in a container, and that
the existing loop interface doesn't really lend itself to that. So
create a new type of thing that acts like a loop device in a container.
But don't try to mess with the whole driver core just for a single type
of device.
Yeah, a lot of dynamic devices (like serial devices) can be handled in
user space with the proviso that we could use some way to tickle udev
and hotplug in the container with events.

But the loop device is the real ugly duckling here. It's a unique case
of an on-demand device with a shared control device that's not really
hot-plug and not really deterministic enough to be handled purely in
user space. It presents unique challenges unto itself.

Makes sense to me.
Post by Greg Kroah-Hartman
greg k-h
Regards,
Mike
--
Michael H. Warfield (AI4NB) | (770) 978-7061 | mhw at WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20140517/36ba7a56/attachment.sig>
Richard Weinberger
2014-05-15 18:25:58 UTC
Permalink
On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
Post by Greg Kroah-Hartman
Then don't use a container to build such a thing, or fix the build
scripts to not do that :)
I second this.
To me it looks like some folks try to (ab)use Linux containers
for purposes where KVM would much better fit in.
Please don't put more complexity into containers. They are already
horrible complex
and error prone.
--
Thanks,
//richard
Serge Hallyn
2014-05-15 19:50:11 UTC
Permalink
Post by Richard Weinberger
On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
Post by Greg Kroah-Hartman
Then don't use a container to build such a thing, or fix the build
scripts to not do that :)
I second this.
To me it looks like some folks try to (ab)use Linux containers
for purposes where KVM would much better fit in.
Please don't put more complexity into containers. They are already
horrible complex
and error prone.
I, naturally, disagree :) The only use case which is inherently not
valid for containers is running a kernel. Practically speaking there
are other things which likely will never be possible, but if someone
offers a way to do something in containers, "you can't do that in
containers" is not an apropos response.

"That abstraction is wrong" is certainly valid, as when vpids were
originally proposed and rejected, resulting in the development of
pid namespaces. "We have to work out (x) first" can be valid (and
I can think of examples here), assuming it's not just trying to hide
behind a catch-22/chicken-egg problem.

Finally, saying "containers are complex and error prone" is conflating
several large suites of userspace code and many kernel features which
support them. Being more precise would, if the argument is valid,
lend it a lot more weight.
Richard Weinberger
2014-05-15 20:13:46 UTC
Permalink
Post by Serge Hallyn
Post by Richard Weinberger
On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
Post by Greg Kroah-Hartman
Then don't use a container to build such a thing, or fix the build
scripts to not do that :)
I second this.
To me it looks like some folks try to (ab)use Linux containers
for purposes where KVM would much better fit in.
Please don't put more complexity into containers. They are already
horrible complex
and error prone.
I, naturally, disagree :) The only use case which is inherently not
valid for containers is running a kernel. Practically speaking there
are other things which likely will never be possible, but if someone
offers a way to do something in containers, "you can't do that in
containers" is not an apropos response.
"That abstraction is wrong" is certainly valid, as when vpids were
originally proposed and rejected, resulting in the development of
pid namespaces. "We have to work out (x) first" can be valid (and
I can think of examples here), assuming it's not just trying to hide
behind a catch-22/chicken-egg problem.
Finally, saying "containers are complex and error prone" is conflating
several large suites of userspace code and many kernel features which
support them. Being more precise would, if the argument is valid,
lend it a lot more weight.
We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc.
To understand the internals better I also wrote my own userspace to create/start
containers. There are so many things which can hurt you badly.
With user namespaces we expose a really big attack surface to regular users.
I.e. Suddenly a user is allowed to mount filesystems.
Ask Andy, he found already lots of nasty things...
I agree that user namespaces are the way to go, all the papering with LSM
over security issues is much worse.
But we have to make sure that we don't add too much features too fast.

That said, I like containers a lot because they are cheap but as they are lightweight
also therefore also isolation level is lightweight.
IMHO containers are not a cheap replacement for KVM.

Thanks,
//richard
Serge E. Hallyn
2014-05-15 20:26:28 UTC
Permalink
Post by Richard Weinberger
Post by Serge Hallyn
Post by Richard Weinberger
On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
Post by Greg Kroah-Hartman
Then don't use a container to build such a thing, or fix the build
scripts to not do that :)
I second this.
To me it looks like some folks try to (ab)use Linux containers
for purposes where KVM would much better fit in.
Please don't put more complexity into containers. They are already
horrible complex
and error prone.
I, naturally, disagree :) The only use case which is inherently not
valid for containers is running a kernel. Practically speaking there
are other things which likely will never be possible, but if someone
offers a way to do something in containers, "you can't do that in
containers" is not an apropos response.
"That abstraction is wrong" is certainly valid, as when vpids were
originally proposed and rejected, resulting in the development of
pid namespaces. "We have to work out (x) first" can be valid (and
I can think of examples here), assuming it's not just trying to hide
behind a catch-22/chicken-egg problem.
Finally, saying "containers are complex and error prone" is conflating
several large suites of userspace code and many kernel features which
support them. Being more precise would, if the argument is valid,
lend it a lot more weight.
We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc.
To understand the internals better I also wrote my own userspace to create/start
containers. There are so many things which can hurt you badly.
With user namespaces we expose a really big attack surface to regular users.
I.e. Suddenly a user is allowed to mount filesystems.
That is currently not the case. They can mount some virtual filesystems
and do bind mounts, but cannot mount most real filesystems. This keeps
us protected (for now) from potentially unsafe superblock readers in the
kernel.
Post by Richard Weinberger
Ask Andy, he found already lots of nasty things...
Yes, of course, and there may be more to come...
Post by Richard Weinberger
I agree that user namespaces are the way to go, all the papering with LSM
over security issues is much worse.
But we have to make sure that we don't add too much features too fast.
Agreed. Like I said, 'we have to work (x) out first' could be valid,
including 'we should wait (a year?) for user ns issues to fall out
before relaxing any of the current user ns constraints."

On the other hand, not exercising the new code may only mean that
existing flaws stick around longer, undetected (by most).
Post by Richard Weinberger
That said, I like containers a lot because they are cheap but as they are lightweight
also therefore also isolation level is lightweight.
IMHO containers are not a cheap replacement for KVM.
The building blocks for containers can also be used for entirely
new, simpler use cases - i.e. perhaps a new fakeroot alternative based
on user namespace mappings. Which is why "this is not a use case for
containers" is not the right way to push back, whether or not the
feature ends up being appropriate.

-serge
Richard Weinberger
2014-05-15 20:33:11 UTC
Permalink
Post by Serge E. Hallyn
Post by Richard Weinberger
Post by Serge Hallyn
Post by Richard Weinberger
On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
Post by Greg Kroah-Hartman
Then don't use a container to build such a thing, or fix the build
scripts to not do that :)
I second this.
To me it looks like some folks try to (ab)use Linux containers
for purposes where KVM would much better fit in.
Please don't put more complexity into containers. They are already
horrible complex
and error prone.
I, naturally, disagree :) The only use case which is inherently not
valid for containers is running a kernel. Practically speaking there
are other things which likely will never be possible, but if someone
offers a way to do something in containers, "you can't do that in
containers" is not an apropos response.
"That abstraction is wrong" is certainly valid, as when vpids were
originally proposed and rejected, resulting in the development of
pid namespaces. "We have to work out (x) first" can be valid (and
I can think of examples here), assuming it's not just trying to hide
behind a catch-22/chicken-egg problem.
Finally, saying "containers are complex and error prone" is conflating
several large suites of userspace code and many kernel features which
support them. Being more precise would, if the argument is valid,
lend it a lot more weight.
We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc.
To understand the internals better I also wrote my own userspace to create/start
containers. There are so many things which can hurt you badly.
With user namespaces we expose a really big attack surface to regular users.
I.e. Suddenly a user is allowed to mount filesystems.
That is currently not the case. They can mount some virtual filesystems
and do bind mounts, but cannot mount most real filesystems. This keeps
us protected (for now) from potentially unsafe superblock readers in the
kernel.
Yeah, I meant not only "real" filesystems.
I had VFS issues in mind where an attacker could do bad things
using bind mounts for example.
Post by Serge E. Hallyn
Post by Richard Weinberger
Ask Andy, he found already lots of nasty things...
Yes, of course, and there may be more to come...
Post by Richard Weinberger
I agree that user namespaces are the way to go, all the papering with LSM
over security issues is much worse.
But we have to make sure that we don't add too much features too fast.
Agreed. Like I said, 'we have to work (x) out first' could be valid,
including 'we should wait (a year?) for user ns issues to fall out
before relaxing any of the current user ns constraints."
On the other hand, not exercising the new code may only mean that
existing flaws stick around longer, undetected (by most).
Fair point.
Post by Serge E. Hallyn
Post by Richard Weinberger
That said, I like containers a lot because they are cheap but as they are lightweight
also therefore also isolation level is lightweight.
IMHO containers are not a cheap replacement for KVM.
The building blocks for containers can also be used for entirely
new, simpler use cases - i.e. perhaps a new fakeroot alternative based
on user namespace mappings. Which is why "this is not a use case for
containers" is not the right way to push back, whether or not the
feature ends up being appropriate.
Agreed.

Maybe I'm too pessimistic.
We'll see. :-)

Thanks,
//richard
Andy Lutomirski
2014-05-19 20:22:06 UTC
Permalink
Post by Serge E. Hallyn
Post by Richard Weinberger
Post by Serge Hallyn
Post by Richard Weinberger
On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
Post by Greg Kroah-Hartman
Then don't use a container to build such a thing, or fix the build
scripts to not do that :)
I second this.
To me it looks like some folks try to (ab)use Linux containers
for purposes where KVM would much better fit in.
Please don't put more complexity into containers. They are already
horrible complex
and error prone.
I, naturally, disagree :) The only use case which is inherently not
valid for containers is running a kernel. Practically speaking there
are other things which likely will never be possible, but if someone
offers a way to do something in containers, "you can't do that in
containers" is not an apropos response.
"That abstraction is wrong" is certainly valid, as when vpids were
originally proposed and rejected, resulting in the development of
pid namespaces. "We have to work out (x) first" can be valid (and
I can think of examples here), assuming it's not just trying to hide
behind a catch-22/chicken-egg problem.
Finally, saying "containers are complex and error prone" is conflating
several large suites of userspace code and many kernel features which
support them. Being more precise would, if the argument is valid,
lend it a lot more weight.
We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc.
To understand the internals better I also wrote my own userspace to create/start
containers. There are so many things which can hurt you badly.
With user namespaces we expose a really big attack surface to regular users.
I.e. Suddenly a user is allowed to mount filesystems.
That is currently not the case. They can mount some virtual filesystems
and do bind mounts, but cannot mount most real filesystems. This keeps
us protected (for now) from potentially unsafe superblock readers in the
kernel.
Post by Richard Weinberger
Ask Andy, he found already lots of nasty things...
I don't think I have anything brilliant to add to this discussion
right now, except possibly:

ISTM that Linux distributions are, in general, vulnerable to all kinds
of shenanigans that would happen if an untrusted user can cause a
block device to appear. That user doesn't need permission to mount it
or even necessarily to change its contents on the fly.

E.g. what happens if you boot a machine that contains a malicious disk
image that has the same partition UUID as /? Nothing good, I imagine.

So if we're going to go down this road, we really need some way to
tell the host that certain devices are not trusted.

--Andy
Serge Hallyn
2014-05-20 14:19:31 UTC
Permalink
Post by Andy Lutomirski
Post by Serge E. Hallyn
Post by Richard Weinberger
Post by Serge Hallyn
Post by Richard Weinberger
On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
Post by Greg Kroah-Hartman
Then don't use a container to build such a thing, or fix the build
scripts to not do that :)
I second this.
To me it looks like some folks try to (ab)use Linux containers
for purposes where KVM would much better fit in.
Please don't put more complexity into containers. They are already
horrible complex
and error prone.
I, naturally, disagree :) The only use case which is inherently not
valid for containers is running a kernel. Practically speaking there
are other things which likely will never be possible, but if someone
offers a way to do something in containers, "you can't do that in
containers" is not an apropos response.
"That abstraction is wrong" is certainly valid, as when vpids were
originally proposed and rejected, resulting in the development of
pid namespaces. "We have to work out (x) first" can be valid (and
I can think of examples here), assuming it's not just trying to hide
behind a catch-22/chicken-egg problem.
Finally, saying "containers are complex and error prone" is conflating
several large suites of userspace code and many kernel features which
support them. Being more precise would, if the argument is valid,
lend it a lot more weight.
We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc.
To understand the internals better I also wrote my own userspace to create/start
containers. There are so many things which can hurt you badly.
With user namespaces we expose a really big attack surface to regular users.
I.e. Suddenly a user is allowed to mount filesystems.
That is currently not the case. They can mount some virtual filesystems
and do bind mounts, but cannot mount most real filesystems. This keeps
us protected (for now) from potentially unsafe superblock readers in the
kernel.
Post by Richard Weinberger
Ask Andy, he found already lots of nasty things...
I don't think I have anything brilliant to add to this discussion
ISTM that Linux distributions are, in general, vulnerable to all kinds
of shenanigans that would happen if an untrusted user can cause a
block device to appear. That user doesn't need permission to mount it
Interesting point. This would further suggest that we absolutely must
ensure that a loop device which shows up in the container does not also
show up in the host.
Post by Andy Lutomirski
or even necessarily to change its contents on the fly.
E.g. what happens if you boot a machine that contains a malicious disk
image that has the same partition UUID as /? Nothing good, I imagine.
So if we're going to go down this road, we really need some way to
tell the host that certain devices are not trusted.
--Andy
Marian Marinov
2014-05-23 08:20:15 UTC
Permalink
Then don't use a container to build such a thing, or fix the build scripts to not do that :)
I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM
would much better fit in. Please don't put more complexity into containers. They are already horrible
complex and error prone.
I, naturally, disagree :) The only use case which is inherently not valid for containers is running a
kernel. Practically speaking there are other things which likely will never be possible, but if someone
offers a way to do something in containers, "you can't do that in containers" is not an apropos response.
"That abstraction is wrong" is certainly valid, as when vpids were originally proposed and rejected,
resulting in the development of pid namespaces. "We have to work out (x) first" can be valid (and I can
think of examples here), assuming it's not just trying to hide behind a catch-22/chicken-egg problem.
Finally, saying "containers are complex and error prone" is conflating several large suites of userspace
code and many kernel features which support them. Being more precise would, if the argument is valid, lend
it a lot more weight.
We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc. To understand the
internals better I also wrote my own userspace to create/start containers. There are so many things which can
hurt you badly. With user namespaces we expose a really big attack surface to regular users. I.e. Suddenly a
user is allowed to mount filesystems.
That is currently not the case. They can mount some virtual filesystems and do bind mounts, but cannot mount
most real filesystems. This keeps us protected (for now) from potentially unsafe superblock readers in the
kernel.
Ask Andy, he found already lots of nasty things...
ISTM that Linux distributions are, in general, vulnerable to all kinds of shenanigans that would happen if an
untrusted user can cause a block device to appear. That user doesn't need permission to mount it
Interesting point. This would further suggest that we absolutely must ensure that a loop device which shows up in
the container does not also show up in the host.
Can I suggest the usage of the devices cgroup to achieve that?

Marian
or even necessarily to change its contents on the fly.
E.g. what happens if you boot a machine that contains a malicious disk image that has the same partition UUID as
/? Nothing good, I imagine.
So if we're going to go down this road, we really need some way to tell the host that certain devices are not
trusted.
--Andy
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to
majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
http://www.tux.org/lkml/
- --
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hackman at jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
James Bottomley
2014-05-23 13:16:00 UTC
Permalink
Post by Marian Marinov
Then don't use a container to build such a thing, or fix the build scripts to not do that :)
I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM
would much better fit in. Please don't put more complexity into containers. They are already horrible
complex and error prone.
I, naturally, disagree :) The only use case which is inherently not valid for containers is running a
kernel. Practically speaking there are other things which likely will never be possible, but if someone
offers a way to do something in containers, "you can't do that in containers" is not an apropos response.
"That abstraction is wrong" is certainly valid, as when vpids were originally proposed and rejected,
resulting in the development of pid namespaces. "We have to work out (x) first" can be valid (and I can
think of examples here), assuming it's not just trying to hide behind a catch-22/chicken-egg problem.
Finally, saying "containers are complex and error prone" is conflating several large suites of userspace
code and many kernel features which support them. Being more precise would, if the argument is valid, lend
it a lot more weight.
We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc. To understand the
internals better I also wrote my own userspace to create/start containers. There are so many things which can
hurt you badly. With user namespaces we expose a really big attack surface to regular users. I.e. Suddenly a
user is allowed to mount filesystems.
That is currently not the case. They can mount some virtual filesystems and do bind mounts, but cannot mount
most real filesystems. This keeps us protected (for now) from potentially unsafe superblock readers in the
kernel.
Ask Andy, he found already lots of nasty things...
ISTM that Linux distributions are, in general, vulnerable to all kinds of shenanigans that would happen if an
untrusted user can cause a block device to appear. That user doesn't need permission to mount it
Interesting point. This would further suggest that we absolutely must ensure that a loop device which shows up in
the container does not also show up in the host.
Can I suggest the usage of the devices cgroup to achieve that?
Not really ... cgroups impose resource limits, it's namespaces that
impose visibility separations. In theory this can be done with the
device namespace that's been proposed; however, a simpler way is simply
to rm the device node in the host and mknod it in the guest. I don't
really see host visibility as a huge problem: in a shared OS
virtualisation it's not really possible securely to separate the guest
from the host (only vice versa).

But I really don't think we want to do it this way. Giving a container
the ability to do a mount is too dangerous. What we want to do is
intercept the mount in the host and perform it on behalf of the guest as
host root in the guest's mount namespace. If you do it that way, it
doesn't really matter what device actually shows up in the guest, as
long as the host knows what to do when the mount request comes along.

James
Andy Lutomirski
2014-05-23 16:39:04 UTC
Permalink
On Fri, May 23, 2014 at 6:16 AM, James Bottomley
Post by James Bottomley
Post by Marian Marinov
Then don't use a container to build such a thing, or fix the build scripts to not do that :)
I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM
would much better fit in. Please don't put more complexity into containers. They are already horrible
complex and error prone.
I, naturally, disagree :) The only use case which is inherently not valid for containers is running a
kernel. Practically speaking there are other things which likely will never be possible, but if someone
offers a way to do something in containers, "you can't do that in containers" is not an apropos response.
"That abstraction is wrong" is certainly valid, as when vpids were originally proposed and rejected,
resulting in the development of pid namespaces. "We have to work out (x) first" can be valid (and I can
think of examples here), assuming it's not just trying to hide behind a catch-22/chicken-egg problem.
Finally, saying "containers are complex and error prone" is conflating several large suites of userspace
code and many kernel features which support them. Being more precise would, if the argument is valid, lend
it a lot more weight.
We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc. To understand the
internals better I also wrote my own userspace to create/start containers. There are so many things which can
hurt you badly. With user namespaces we expose a really big attack surface to regular users. I.e. Suddenly a
user is allowed to mount filesystems.
That is currently not the case. They can mount some virtual filesystems and do bind mounts, but cannot mount
most real filesystems. This keeps us protected (for now) from potentially unsafe superblock readers in the
kernel.
Ask Andy, he found already lots of nasty things...
ISTM that Linux distributions are, in general, vulnerable to all kinds of shenanigans that would happen if an
untrusted user can cause a block device to appear. That user doesn't need permission to mount it
Interesting point. This would further suggest that we absolutely must ensure that a loop device which shows up in
the container does not also show up in the host.
Can I suggest the usage of the devices cgroup to achieve that?
Not really ... cgroups impose resource limits, it's namespaces that
impose visibility separations. In theory this can be done with the
device namespace that's been proposed; however, a simpler way is simply
to rm the device node in the host and mknod it in the guest. I don't
really see host visibility as a huge problem: in a shared OS
virtualisation it's not really possible securely to separate the guest
from the host (only vice versa).
But I really don't think we want to do it this way. Giving a container
the ability to do a mount is too dangerous. What we want to do is
intercept the mount in the host and perform it on behalf of the guest as
host root in the guest's mount namespace. If you do it that way, it
doesn't really matter what device actually shows up in the guest, as
long as the host knows what to do when the mount request comes along.
This is only useful/safe if the host understands what's going on. By
the host, I mean the host's udev and other system-level stuff. This
is probably fine for disks and such, but it might not be so great for
loop devices, FUSE, etc. I already know of one user of containers
that wants container-local FUSE mounts. This ought to Just Work (tm),
but there's fair amount of work needed to get there.

--Andy
Serge Hallyn
2014-05-24 22:25:35 UTC
Permalink
Post by James Bottomley
Post by Marian Marinov
Then don't use a container to build such a thing, or fix the build scripts to not do that :)
I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM
would much better fit in. Please don't put more complexity into containers. They are already horrible
complex and error prone.
I, naturally, disagree :) The only use case which is inherently not valid for containers is running a
kernel. Practically speaking there are other things which likely will never be possible, but if someone
offers a way to do something in containers, "you can't do that in containers" is not an apropos response.
"That abstraction is wrong" is certainly valid, as when vpids were originally proposed and rejected,
resulting in the development of pid namespaces. "We have to work out (x) first" can be valid (and I can
think of examples here), assuming it's not just trying to hide behind a catch-22/chicken-egg problem.
Finally, saying "containers are complex and error prone" is conflating several large suites of userspace
code and many kernel features which support them. Being more precise would, if the argument is valid, lend
it a lot more weight.
We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc. To understand the
internals better I also wrote my own userspace to create/start containers. There are so many things which can
hurt you badly. With user namespaces we expose a really big attack surface to regular users. I.e. Suddenly a
user is allowed to mount filesystems.
That is currently not the case. They can mount some virtual filesystems and do bind mounts, but cannot mount
most real filesystems. This keeps us protected (for now) from potentially unsafe superblock readers in the
kernel.
Ask Andy, he found already lots of nasty things...
ISTM that Linux distributions are, in general, vulnerable to all kinds of shenanigans that would happen if an
untrusted user can cause a block device to appear. That user doesn't need permission to mount it
Interesting point. This would further suggest that we absolutely must ensure that a loop device which shows up in
the container does not also show up in the host.
Can I suggest the usage of the devices cgroup to achieve that?
Not really ... cgroups impose resource limits, it's namespaces that
impose visibility separations. In theory this can be done with the
device namespace that's been proposed; however, a simpler way is simply
to rm the device node in the host and mknod it in the guest. I don't
really see host visibility as a huge problem: in a shared OS
virtualisation it's not really possible securely to separate the guest
from the host (only vice versa).
But I really don't think we want to do it this way. Giving a container
the ability to do a mount is too dangerous. What we want to do is
intercept the mount in the host and perform it on behalf of the guest as
host root in the guest's mount namespace. If you do it that way, it
That doesn't help the problem of guests being able to provide bad input
for (basically fuzz) the in-kernel filesystem code. So apparently I'm
suffering a failure of the imagination - what problem exactly does it solve?
Post by James Bottomley
doesn't really matter what device actually shows up in the guest, as
long as the host knows what to do when the mount request comes along.
James
James Bottomley
2014-05-25 08:12:10 UTC
Permalink
Post by Serge Hallyn
Post by James Bottomley
Post by Marian Marinov
Then don't use a container to build such a thing, or fix the build scripts to not do that :)
I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM
would much better fit in. Please don't put more complexity into containers. They are already horrible
complex and error prone.
I, naturally, disagree :) The only use case which is inherently not valid for containers is running a
kernel. Practically speaking there are other things which likely will never be possible, but if someone
offers a way to do something in containers, "you can't do that in containers" is not an apropos response.
"That abstraction is wrong" is certainly valid, as when vpids were originally proposed and rejected,
resulting in the development of pid namespaces. "We have to work out (x) first" can be valid (and I can
think of examples here), assuming it's not just trying to hide behind a catch-22/chicken-egg problem.
Finally, saying "containers are complex and error prone" is conflating several large suites of userspace
code and many kernel features which support them. Being more precise would, if the argument is valid, lend
it a lot more weight.
We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc. To understand the
internals better I also wrote my own userspace to create/start containers. There are so many things which can
hurt you badly. With user namespaces we expose a really big attack surface to regular users. I.e. Suddenly a
user is allowed to mount filesystems.
That is currently not the case. They can mount some virtual filesystems and do bind mounts, but cannot mount
most real filesystems. This keeps us protected (for now) from potentially unsafe superblock readers in the
kernel.
Ask Andy, he found already lots of nasty things...
ISTM that Linux distributions are, in general, vulnerable to all kinds of shenanigans that would happen if an
untrusted user can cause a block device to appear. That user doesn't need permission to mount it
Interesting point. This would further suggest that we absolutely must ensure that a loop device which shows up in
the container does not also show up in the host.
Can I suggest the usage of the devices cgroup to achieve that?
Not really ... cgroups impose resource limits, it's namespaces that
impose visibility separations. In theory this can be done with the
device namespace that's been proposed; however, a simpler way is simply
to rm the device node in the host and mknod it in the guest. I don't
really see host visibility as a huge problem: in a shared OS
virtualisation it's not really possible securely to separate the guest
from the host (only vice versa).
But I really don't think we want to do it this way. Giving a container
the ability to do a mount is too dangerous. What we want to do is
intercept the mount in the host and perform it on behalf of the guest as
host root in the guest's mount namespace. If you do it that way, it
That doesn't help the problem of guests being able to provide bad input
for (basically fuzz) the in-kernel filesystem code. So apparently I'm
suffering a failure of the imagination - what problem exactly does it solve?
Well, there's two types of fuzzing, one is on sys_mount, which this
would help with because the host filters the mount including all
parameters and may even redo the mount (from direct to bind etc).

If you're thinking the system can be compromised by fuzzing within the
filesystem, then yes, I agree, but it's the same vulnerability an
unvirtualised host would have, so I don't necessarily see it as our
problem.

The problem vectored mount solves is the one of not wanting root in the
container to have unfettered access to sys_mount because it allows the
host to vet all calls and execute the ones it likes in the context of
real root (possibly after modifying the parameters).

James
Serge E. Hallyn
2014-05-25 22:24:43 UTC
Permalink
Post by James Bottomley
Post by Serge Hallyn
Post by James Bottomley
Post by Marian Marinov
Then don't use a container to build such a thing, or fix the build scripts to not do that :)
I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM
would much better fit in. Please don't put more complexity into containers. They are already horrible
complex and error prone.
I, naturally, disagree :) The only use case which is inherently not valid for containers is running a
kernel. Practically speaking there are other things which likely will never be possible, but if someone
offers a way to do something in containers, "you can't do that in containers" is not an apropos response.
"That abstraction is wrong" is certainly valid, as when vpids were originally proposed and rejected,
resulting in the development of pid namespaces. "We have to work out (x) first" can be valid (and I can
think of examples here), assuming it's not just trying to hide behind a catch-22/chicken-egg problem.
Finally, saying "containers are complex and error prone" is conflating several large suites of userspace
code and many kernel features which support them. Being more precise would, if the argument is valid, lend
it a lot more weight.
We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc. To understand the
internals better I also wrote my own userspace to create/start containers. There are so many things which can
hurt you badly. With user namespaces we expose a really big attack surface to regular users. I.e. Suddenly a
user is allowed to mount filesystems.
That is currently not the case. They can mount some virtual filesystems and do bind mounts, but cannot mount
most real filesystems. This keeps us protected (for now) from potentially unsafe superblock readers in the
kernel.
Ask Andy, he found already lots of nasty things...
ISTM that Linux distributions are, in general, vulnerable to all kinds of shenanigans that would happen if an
untrusted user can cause a block device to appear. That user doesn't need permission to mount it
Interesting point. This would further suggest that we absolutely must ensure that a loop device which shows up in
the container does not also show up in the host.
Can I suggest the usage of the devices cgroup to achieve that?
Not really ... cgroups impose resource limits, it's namespaces that
impose visibility separations. In theory this can be done with the
device namespace that's been proposed; however, a simpler way is simply
to rm the device node in the host and mknod it in the guest. I don't
really see host visibility as a huge problem: in a shared OS
virtualisation it's not really possible securely to separate the guest
from the host (only vice versa).
But I really don't think we want to do it this way. Giving a container
the ability to do a mount is too dangerous. What we want to do is
intercept the mount in the host and perform it on behalf of the guest as
host root in the guest's mount namespace. If you do it that way, it
That doesn't help the problem of guests being able to provide bad input
for (basically fuzz) the in-kernel filesystem code. So apparently I'm
suffering a failure of the imagination - what problem exactly does it solve?
Well, there's two types of fuzzing, one is on sys_mount, which this
would help with because the host filters the mount including all
parameters and may even redo the mount (from direct to bind etc).
Sorry - I'm not *trying* to be dense, but am still not seeing it.

Let's assume that we continue to be strict about what a container may
mount - let's say they can only mount using loopdev from blockdev images.
They have to own the file, as well as the mount target. Whatever they
do with sys_mount, the only danger I see is the one where the filesystem
data is bad and causes a DOS or privilege escalation in some bad fs
reading code in the kernel.

What else is there? Are you thinking of the sys_mount flags? I guess
the void *data? (Though I see that as the same problem; we're just
not trusting the fs code to deal with badly formed data)
Post by James Bottomley
If you're thinking the system can be compromised by fuzzing within the
filesystem, then yes, I agree, but it's the same vulnerability an
unvirtualised host would have, so I don't necessarily see it as our
problem.
The problem vectored mount solves is the one of not wanting root in the
container to have unfettered access to sys_mount because it allows the
host to vet all calls and execute the ones it likes in the context of
real root (possibly after modifying the parameters).
James
James Bottomley
2014-05-28 07:02:59 UTC
Permalink
Post by Serge E. Hallyn
Post by James Bottomley
Post by Serge Hallyn
Post by James Bottomley
Post by Marian Marinov
Then don't use a container to build such a thing, or fix the build scripts to not do that :)
I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM
would much better fit in. Please don't put more complexity into containers. They are already horrible
complex and error prone.
I, naturally, disagree :) The only use case which is inherently not valid for containers is running a
kernel. Practically speaking there are other things which likely will never be possible, but if someone
offers a way to do something in containers, "you can't do that in containers" is not an apropos response.
"That abstraction is wrong" is certainly valid, as when vpids were originally proposed and rejected,
resulting in the development of pid namespaces. "We have to work out (x) first" can be valid (and I can
think of examples here), assuming it's not just trying to hide behind a catch-22/chicken-egg problem.
Finally, saying "containers are complex and error prone" is conflating several large suites of userspace
code and many kernel features which support them. Being more precise would, if the argument is valid, lend
it a lot more weight.
We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc. To understand the
internals better I also wrote my own userspace to create/start containers. There are so many things which can
hurt you badly. With user namespaces we expose a really big attack surface to regular users. I.e. Suddenly a
user is allowed to mount filesystems.
That is currently not the case. They can mount some virtual filesystems and do bind mounts, but cannot mount
most real filesystems. This keeps us protected (for now) from potentially unsafe superblock readers in the
kernel.
Ask Andy, he found already lots of nasty things...
ISTM that Linux distributions are, in general, vulnerable to all kinds of shenanigans that would happen if an
untrusted user can cause a block device to appear. That user doesn't need permission to mount it
Interesting point. This would further suggest that we absolutely must ensure that a loop device which shows up in
the container does not also show up in the host.
Can I suggest the usage of the devices cgroup to achieve that?
Not really ... cgroups impose resource limits, it's namespaces that
impose visibility separations. In theory this can be done with the
device namespace that's been proposed; however, a simpler way is simply
to rm the device node in the host and mknod it in the guest. I don't
really see host visibility as a huge problem: in a shared OS
virtualisation it's not really possible securely to separate the guest
from the host (only vice versa).
But I really don't think we want to do it this way. Giving a container
the ability to do a mount is too dangerous. What we want to do is
intercept the mount in the host and perform it on behalf of the guest as
host root in the guest's mount namespace. If you do it that way, it
That doesn't help the problem of guests being able to provide bad input
for (basically fuzz) the in-kernel filesystem code. So apparently I'm
suffering a failure of the imagination - what problem exactly does it solve?
Well, there's two types of fuzzing, one is on sys_mount, which this
would help with because the host filters the mount including all
parameters and may even redo the mount (from direct to bind etc).
Sorry - I'm not *trying* to be dense, but am still not seeing it.
Let's assume that we continue to be strict about what a container may
mount - let's say they can only mount using loopdev from blockdev images.
They have to own the file, as well as the mount target. Whatever they
do with sys_mount, the only danger I see is the one where the filesystem
data is bad and causes a DOS or privilege escalation in some bad fs
reading code in the kernel.
What else is there? Are you thinking of the sys_mount flags? I guess
the void *data? (Though I see that as the same problem; we're just
not trusting the fs code to deal with badly formed data)
OK, so the problem you're worrying about is allowing the user to modify
a block device and then mount it? In that case, I agree, it doesn't
matter who does the mount, because a hostile user is looking to exploit
bad data on the device. By and large, filesystems are tolerant to this
type of fuzzing, but the strict solution is not to allow a container to
mount any block devices it has direct access to.

James
Serge Hallyn
2014-05-28 13:49:05 UTC
Permalink
Post by James Bottomley
Post by Serge E. Hallyn
Post by James Bottomley
Post by Serge Hallyn
Post by James Bottomley
Post by Marian Marinov
Then don't use a container to build such a thing, or fix the build scripts to not do that :)
I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM
would much better fit in. Please don't put more complexity into containers. They are already horrible
complex and error prone.
I, naturally, disagree :) The only use case which is inherently not valid for containers is running a
kernel. Practically speaking there are other things which likely will never be possible, but if someone
offers a way to do something in containers, "you can't do that in containers" is not an apropos response.
"That abstraction is wrong" is certainly valid, as when vpids were originally proposed and rejected,
resulting in the development of pid namespaces. "We have to work out (x) first" can be valid (and I can
think of examples here), assuming it's not just trying to hide behind a catch-22/chicken-egg problem.
Finally, saying "containers are complex and error prone" is conflating several large suites of userspace
code and many kernel features which support them. Being more precise would, if the argument is valid, lend
it a lot more weight.
We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc. To understand the
internals better I also wrote my own userspace to create/start containers. There are so many things which can
hurt you badly. With user namespaces we expose a really big attack surface to regular users. I.e. Suddenly a
user is allowed to mount filesystems.
That is currently not the case. They can mount some virtual filesystems and do bind mounts, but cannot mount
most real filesystems. This keeps us protected (for now) from potentially unsafe superblock readers in the
kernel.
Ask Andy, he found already lots of nasty things...
ISTM that Linux distributions are, in general, vulnerable to all kinds of shenanigans that would happen if an
untrusted user can cause a block device to appear. That user doesn't need permission to mount it
Interesting point. This would further suggest that we absolutely must ensure that a loop device which shows up in
the container does not also show up in the host.
Can I suggest the usage of the devices cgroup to achieve that?
Not really ... cgroups impose resource limits, it's namespaces that
impose visibility separations. In theory this can be done with the
device namespace that's been proposed; however, a simpler way is simply
to rm the device node in the host and mknod it in the guest. I don't
really see host visibility as a huge problem: in a shared OS
virtualisation it's not really possible securely to separate the guest
from the host (only vice versa).
But I really don't think we want to do it this way. Giving a container
the ability to do a mount is too dangerous. What we want to do is
intercept the mount in the host and perform it on behalf of the guest as
host root in the guest's mount namespace. If you do it that way, it
That doesn't help the problem of guests being able to provide bad input
for (basically fuzz) the in-kernel filesystem code. So apparently I'm
suffering a failure of the imagination - what problem exactly does it solve?
Well, there's two types of fuzzing, one is on sys_mount, which this
would help with because the host filters the mount including all
parameters and may even redo the mount (from direct to bind etc).
Sorry - I'm not *trying* to be dense, but am still not seeing it.
Let's assume that we continue to be strict about what a container may
mount - let's say they can only mount using loopdev from blockdev images.
They have to own the file, as well as the mount target. Whatever they
do with sys_mount, the only danger I see is the one where the filesystem
data is bad and causes a DOS or privilege escalation in some bad fs
reading code in the kernel.
What else is there? Are you thinking of the sys_mount flags? I guess
the void *data? (Though I see that as the same problem; we're just
not trusting the fs code to deal with badly formed data)
OK, so the problem you're worrying about is allowing the user to modify
a block device and then mount it?
That's half of the problem I'm worrying about.

The other half is what Andy mentioned earlier - having a container modify
a loop device and trick the host into mounting it (i.e. settingn its uuid
to the host's HOME)
Post by James Bottomley
In that case, I agree, it doesn't
matter who does the mount, because a hostile user is looking to exploit
bad data on the device. By and large, filesystems are tolerant to this
type of fuzzing, but the strict solution is not to allow a container to
mount any block devices it has direct access to.
James
Michael J Coss
2014-05-23 15:55:26 UTC
Permalink
Post by Marian Marinov
Can I suggest the usage of the devices cgroup to achieve that?
Marian
We make use of devices cgroup as part of our overall solution. Given
that systemd has some embedded policy for the start of udev in a
container, we needed to enable CAP_MKNOD within the container to get
systemd to launch udev. To constrain what can and can not be done, we
added a deny all, and then enumerate the allowed devices access (rwm)
within the device cgroup for the container. It doesn't help the
visibility issue, but does provide needed resource constraints.
--
---Michael J Coss
Seth Forshee
2014-05-15 03:15:27 UTC
Permalink
Post by Michael H. Warfield
Post by Greg Kroah-Hartman
Post by Seth Forshee
Using devtmpfs is one possible
solution, and it would have the added benefit of making container setup
simpler. But simply letting containers mount devtmpfs isn't sufficient
since the container may need to see a different, more limited set of
devices, and because different environments making modifications to
the filesystem could lead to conflicts.
This series solves these problems by assigning devices to user
namespaces. Each device has an "owner" namespace which specifies which
devtmpfs mount the device should appear in as well allowing priveleged
operations on the device from that namespace. This defaults to
init_user_ns. There's also an ns_global flag to indicate a device should
appear in all devtmpfs mounts.
I'd strongly argue that this isn't even a "problem" at all. And, as I
said at the Plumbers conference last year, adding namespaces to devices
isn't going to happen, sorry. Please don't continue down this path.
I was just mentioning that to Serge just a week or so ago reminding him
of what you told all of us face to face back then. We were having a
discussion over loop devices into containers and this topic came up.
It was the loop device use case that got me started down this path in
the first place, so I don't personally have any interest in physical
devices right now (though I was sure others would).

As things stand today, to support loop devices lxc would need to do
something like this: grab some unused loop devices, remove them from
/dev, and make device nodes with appropriate ownership/permissions in
the container's /dev. Otherwise there's potential for accidental
duplicate use of the devices, which besides having unexpected results
could result in information leak into the container. At that point you
have some loop devices that the container can use, but privileged
operations such as re-reading partitions and encrypted loop aren't
possible. Even if you can re-read partitions device nodes will appear in
the main /dev and not in the container.

With these patches the container could mount devtmpfs, and since
loop-control is global it would appear in the mount. The
LOOP_CTL_GET_FREE ioctl can be used to get an unused loop device which
will owned by the container's user namespace, so it will only appear in
that container's devtmpfs mount. Privileged operations would be allowed
on the loop device by root in the namespace, and if partition devices
were created they would inherit the namespace from the parent and thus
show up in the container's devtmpfs mount.

I think this use case demonstrates some real problems with only half-way
solutions atm. I'm certainly open to other suggestions about how to
solve them.

Thanks,
Seth
Continue reading on narkive:
Loading...