[RFC] Block IO Controller V1

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 - 3 - 4 - 5 | Next >

[RFC] Block IO Controller V1

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi All,

This is V1 of the Block IO controller patches on top of 2.6.32-rc5.

A consolidated patch can be found here:

http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v1.patch

After the discussions at IO minisummit at Tokyo, Japan, it was agreed that
one single IO control policy at either leaf nodes or at higher level nodes
does not meet all the requirements and we need something so that we have
the capability to support more than one IO control policy (like proportional
weight division and max bandwidth control) and also have capability to
implement some of these policies at higher level logical devices.

It was agreed that CFQ is the right place to implement time based proportional
weight division policy. Other policies like max bandwidth control/throttling
will make more sense at higher level logical devices.

This patch introduces blkio cgroup controller. It provides the management
interface for the block IO control. The idea is that keep the interface
common and in the background we should be able to switch policies based on
user options. Hence user can control the IO throughout the IO stack with
a single cgroup interface.

Apart from blkio cgroup interface, this patchset also modifies CFQ to implement
time based proportional weight division of disk. CFQ already does it in flat
mode. It has been modified to do group IO scheduling also.

IO control is a huge problem and the moment we start addressing all the
issues in one patchset, it bloats to unmanageable proportions and then nothing
gets inside the kernel. So at io mini summit we agreed that lets take small
steps and once a piece of code is inside the kernel and stablized, take the
next step. So this is the first step.

Some parts of the code are based on BFQ patches posted by Paolo and Fabio.

Your feedback is welcome.

TODO
====
- Support async IO control (buffered writes).

 Buffered writes is a beast and requires changes at many a places to solve the
 problem and patchset becomes huge. Hence first we plan to support only sync
 IO in control then work on async IO too.

 Some of the work items identified are.

        - Per memory cgroup dirty ratio
        - Possibly modification of writeback to force writeback from a
          particular cgroup.
        - Implement IO tracking support so that a bio can be mapped to a cgroup.
        - Per group request descriptor infrastructure in block layer.
        - At CFQ level, implement per cfq_group async queues.

  In this patchset, all the async IO goes in system wide queues and there are
  no per group async queues. That means we will see service differentiation
  only for sync IO only. Async IO willl be handled later.

- Support for higher level policies like max BW controller.

Thanks
Vivek

 Documentation/cgroups/blkio-controller.txt |  106 +++
 block/Kconfig                              |   22 +
 block/Kconfig.iosched                      |   17 +
 block/Makefile                             |    1 +
 block/blk-cgroup.c                         |  343 ++++++++
 block/blk-cgroup.h                         |   67 ++
 block/cfq-iosched.c                        | 1187 ++++++++++++++++++++++-----
 include/linux/cgroup_subsys.h              |    6 +
 include/linux/iocontext.h                  |    4 +
 9 files changed, 1533 insertions(+), 220 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 01/20] blkio: Documentation

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Signed-off-by: Vivek Goyal <vgoyal@...>
---
 Documentation/cgroups/blkio-controller.txt |  106 ++++++++++++++++++++++++++++
 1 files changed, 106 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/cgroups/blkio-controller.txt

diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
new file mode 100644
index 0000000..dc8fb1a
--- /dev/null
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -0,0 +1,106 @@
+ Block IO Controller
+ ===================
+Overview
+========
+cgroup subsys "blkio" implements the block io controller. There seems to be
+a need of various kind of IO control policies (like proportional BW, max BW)
+both at leaf nodes as well as at intermediate nodes in storage hierarchy. Plan
+is to use same cgroup based management interface for blkio controller and
+based on user options switch IO policies in the background.
+
+In the first phase, this patchset implements proportional weight time based
+division of disk policy. It is implemented in CFQ. Hence this policy takes
+effect only on leaf nodes when CFQ is being used.
+
+HOWTO
+=====
+You can do a very simple testing of running two dd threads in two different
+cgroups. Here is what you can do.
+
+- Enable group scheduling in CFQ
+ CONFIG_CFQ_GROUP_IOSCHED=y
+
+- Compile and boot into kernel and mount IO controller (blkio).
+
+ mount -t cgroup -o blkio none /cgroup
+
+- Create two cgroups
+ mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set weights of group test1 and test2
+ echo 1000 > /cgroup/test1/blkio.weight
+ echo 500 > /cgroup/test2/blkio.weight
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+  launch two dd threads in different cgroup to read those files.
+
+ sync
+ echo 3 > /proc/sys/vm/drop_caches
+
+ dd if=/mnt/sdb/zerofile1 of=/dev/null &
+ echo $! > /cgroup/test1/tasks
+ cat /cgroup/test1/tasks
+
+ dd if=/mnt/sdb/zerofile2 of=/dev/null &
+ echo $! > /cgroup/test2/tasks
+ cat /cgroup/test2/tasks
+
+- At macro level, first dd should finish first. To get more precise data, keep
+  on looking at (with the help of script), at blkio.disk_time and
+  blkio.disk_sectors files of both test1 and test2 groups. This will tell how
+  much disk time (in milli seconds), each group got and how many secotors each
+  group dispatched to the disk. We provide fairness in terms of disk time, so
+  ideally io.disk_time of cgroups should be in proportion to the weight.
+
+Various user visible config options
+===================================
+CONFIG_CFQ_GROUP_IOSCHED
+ - Enables group scheduling in CFQ. Currently only 1 level of group
+  creation is allowed.
+
+CONFIG_DEBUG_CFQ_IOSCHED
+ - Enables some debugging messages in blktrace. Also creates extra
+  cgroup file blkio.dequeue.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configuration.
+
+CONFIG_BLK_CGROUP
+ - Block IO controller. Selected by CONFIG_CFQ_GROUP_IOSCHED.
+
+CONFIG_DEBUG_BLK_CGROUP
+ - Debug help. Selected by CONFIG_DEBUG_CFQ_IOSCHED.
+
+Details of cgroup files
+=======================
+- blkio.ioprio_class
+ - Specifies class of the cgroup (RT, BE, IDLE). This is default io
+  class of the group on all the devices.
+
+  1 = RT; 2 = BE, 3 = IDLE
+
+- blkio.weight
+ - Specifies per cgroup weight.
+
+  Currently allowed range of weights is from 100 to 1000.
+
+- blkio.time
+ - disk time allocated to cgroup per device in milliseconds. First
+  two fields specify the major and minor number of the device and
+  third field specifies the disk time allocated to group in
+  milliseconds.
+
+- blkio.sectors
+ - number of sectors transferred to/from disk by the group. First
+  two fields specify the major and minor number of the device and
+  third field specifies the number of sectors transferred by the
+  group to/from the device.
+
+- blkio.dequeue
+ - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This
+  gives the statistics about how many a times a group was dequeued
+  from service tree of the device. First two fields specify the major
+  and minor number of the device and third field specifies the number
+  of times a group was dequeued from a particular device.
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 02/20] blkio: Change CFQ to use CFS like queue time stamps

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

o Currently CFQ provides priority scaled time slices to processes. If a process
  does not use the time slice, either because process did not have sufficient
  IO to do or because think time of process is large and CFQ decided to disable
  idling, then processes looses it time slice share.

o This works well in flat setup where fair share of a process can be achieved
  in one go (by scaled time slices), and CFQ does not have to time stamp the
  queue. But once IO groups are introduced, it does not work very well.
  Consider following case.

                        root
                        / \
                      G1  G2
                      |    |
                     T1    T2

  Here G1 and G2 are two groups of weights 100 each and T1 and T2 are two
  tasks of prio 0 and 4 each. Now both groups should get 50% of disk time.
  CFQ assigns slice length of 180ms to T1 (prio 0) and slice length of 100ms
  to T2 (prio4). Now plain round robin of scaled slices does not work at
  group level.

o One possible way to handle this is implement CFS like time stamping of the
  cfq queues and keep track of vtime. Next queue for execution will be selected
  based on the one who got lowest vtime. This patch implemented time stamping
  mechanism of cfq queues based on disk time used.

o min_vdisktime represents the minimum vdisktime of the queue, either being
  serviced or leftmost element on the serviec tree.

o Previously CFQ had one service tree where queues of all theree prio classes
  were being queued. One side affect of this time stamping approach is that
  now single tree approach might not work and we need to keep separate service
  trees for three prio classes.

o Some parts of code of this patch are taken from CFS and BFQ.

Signed-off-by: Vivek Goyal <vgoyal@...>
---
 block/cfq-iosched.c |  480 +++++++++++++++++++++++++++++++++++----------------
 1 files changed, 335 insertions(+), 145 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 069a610..58ac8b7 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -28,6 +28,8 @@ static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
 
+#define IO_IOPRIO_CLASSES 3
+
 /*
  * offset from end of service tree
  */
@@ -64,11 +66,17 @@ static DEFINE_SPINLOCK(ioc_gone_lock);
  * to find it. Idea borrowed from Ingo Molnars CFS scheduler. We should
  * move this into the elevator for the rq sorting as well.
  */
-struct cfq_rb_root {
+struct cfq_service_tree {
  struct rb_root rb;
  struct rb_node *left;
+ u64 min_vdisktime;
+ struct cfq_queue *active;
+};
+#define CFQ_RB_ROOT (struct cfq_service_tree) { RB_ROOT, NULL, 0, NULL}
+
+struct cfq_sched_data {
+ struct cfq_service_tree service_tree[IO_IOPRIO_CLASSES];
 };
-#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, }
 
 /*
  * Per process-grouping structure
@@ -83,7 +91,9 @@ struct cfq_queue {
  /* service_tree member */
  struct rb_node rb_node;
  /* service_tree key */
- unsigned long rb_key;
+ u64 vdisktime;
+ /* service tree we belong to */
+ struct cfq_service_tree *st;
  /* prio tree member */
  struct rb_node p_node;
  /* prio tree root we belong to, if any */
@@ -99,8 +109,9 @@ struct cfq_queue {
  /* fifo list of requests in sort_list */
  struct list_head fifo;
 
+ /* time when first request from queue completed and slice started. */
+ unsigned long slice_start;
  unsigned long slice_end;
- long slice_resid;
  unsigned int slice_dispatch;
 
  /* pending metadata requests */
@@ -111,6 +122,7 @@ struct cfq_queue {
  /* io prio of this group */
  unsigned short ioprio, org_ioprio;
  unsigned short ioprio_class, org_ioprio_class;
+ bool ioprio_class_changed;
 
  pid_t pid;
 };
@@ -124,7 +136,7 @@ struct cfq_data {
  /*
  * rr list of queues with requests and the count of them
  */
- struct cfq_rb_root service_tree;
+ struct cfq_sched_data sched_data;
 
  /*
  * Each priority tree is sorted by next_request position.  These
@@ -234,6 +246,67 @@ static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
        struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
  struct io_context *);
+static void cfq_put_queue(struct cfq_queue *cfqq);
+static struct cfq_queue *__cfq_get_next_queue(struct cfq_service_tree *st);
+
+static inline void
+init_cfqq_service_tree(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+ unsigned short idx = cfqq->ioprio_class - 1;
+
+ BUG_ON(idx >= IO_IOPRIO_CLASSES);
+
+ cfqq->st = &cfqd->sched_data.service_tree[idx];
+}
+
+static inline s64
+cfqq_key(struct cfq_service_tree *st, struct cfq_queue *cfqq)
+{
+ return cfqq->vdisktime - st->min_vdisktime;
+}
+
+static inline u64
+cfq_delta_fair(unsigned long delta, struct cfq_queue *cfqq)
+{
+ const int base_slice = cfqq->cfqd->cfq_slice[cfq_cfqq_sync(cfqq)];
+
+ return delta + (base_slice/CFQ_SLICE_SCALE * (cfqq->ioprio - 4));
+}
+
+static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
+{
+ s64 delta = (s64)(vdisktime - min_vdisktime);
+ if (delta > 0)
+ min_vdisktime = vdisktime;
+
+ return min_vdisktime;
+}
+
+static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
+{
+ s64 delta = (s64)(vdisktime - min_vdisktime);
+ if (delta < 0)
+ min_vdisktime = vdisktime;
+
+ return min_vdisktime;
+}
+
+static void update_min_vdisktime(struct cfq_service_tree *st)
+{
+ u64 vdisktime = st->min_vdisktime;
+
+ if (st->active)
+ vdisktime = st->active->vdisktime;
+
+ if (st->left) {
+ struct cfq_queue *cfqq = rb_entry(st->left, struct cfq_queue,
+ rb_node);
+
+ vdisktime = min_vdisktime(vdisktime, cfqq->vdisktime);
+ }
+
+ st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+}
 
 static inline int rq_in_driver(struct cfq_data *cfqd)
 {
@@ -277,7 +350,7 @@ static int cfq_queue_empty(struct request_queue *q)
 {
  struct cfq_data *cfqd = q->elevator->elevator_data;
 
- return !cfqd->busy_queues;
+ return !cfqd->rq_queued;
 }
 
 /*
@@ -304,6 +377,7 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 static inline void
 cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
+ cfqq->slice_start = jiffies;
  cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
  cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
 }
@@ -419,33 +493,6 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
 }
 
 /*
- * The below is leftmost cache rbtree addon
- */
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
-{
- if (!root->left)
- root->left = rb_first(&root->rb);
-
- if (root->left)
- return rb_entry(root->left, struct cfq_queue, rb_node);
-
- return NULL;
-}
-
-static void rb_erase_init(struct rb_node *n, struct rb_root *root)
-{
- rb_erase(n, root);
- RB_CLEAR_NODE(n);
-}
-
-static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
-{
- if (root->left == n)
- root->left = NULL;
- rb_erase_init(n, &root->rb);
-}
-
-/*
  * would be nice to take fifo expire time into account as well
  */
 static struct request *
@@ -472,102 +519,192 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
  return cfq_choose_req(cfqd, next, prev);
 }
 
-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
-      struct cfq_queue *cfqq)
-{
- /*
- * just an approximation, should be ok.
- */
- return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
-       cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
-/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
- * requests waiting to be processed. It is sorted in the order that
- * we will service the queues.
- */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
- bool add_front)
+static void
+place_cfqq(struct cfq_service_tree *st, struct cfq_queue *cfqq, int add_front)
 {
- struct rb_node **p, *parent;
+ u64 vdisktime = st->min_vdisktime;
+ struct rb_node *parent;
  struct cfq_queue *__cfqq;
- unsigned long rb_key;
- int left;
 
  if (cfq_class_idle(cfqq)) {
- rb_key = CFQ_IDLE_DELAY;
- parent = rb_last(&cfqd->service_tree.rb);
+ vdisktime = CFQ_IDLE_DELAY;
+ parent = rb_last(&st->rb);
  if (parent && parent != &cfqq->rb_node) {
  __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
- rb_key += __cfqq->rb_key;
+ vdisktime += __cfqq->vdisktime;
  } else
- rb_key += jiffies;
+ vdisktime += st->min_vdisktime;
  } else if (!add_front) {
- /*
- * Get our rb key offset. Subtract any residual slice
- * value carried from last service. A negative resid
- * count indicates slice overrun, and this should position
- * the next service time further away in the tree.
- */
- rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
- rb_key -= cfqq->slice_resid;
- cfqq->slice_resid = 0;
- } else {
- rb_key = -HZ;
- __cfqq = cfq_rb_first(&cfqd->service_tree);
- rb_key += __cfqq ? __cfqq->rb_key : jiffies;
+ parent = rb_last(&st->rb);
+ if (parent && parent != &cfqq->rb_node) {
+ __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
+ vdisktime = __cfqq->vdisktime;
+ }
  }
 
- if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
+ cfqq->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+}
+
+static inline void cfqq_update_ioprio_class(struct cfq_queue *cfqq)
+{
+ if (unlikely(cfqq->ioprio_class_changed)) {
+ struct cfq_data *cfqd = cfqq->cfqd;
+
  /*
- * same position, nothing more to do
+ * Re-initialize the service tree pointer as ioprio class
+ * change will lead to service tree change.
  */
- if (rb_key == cfqq->rb_key)
- return;
-
- cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+ init_cfqq_service_tree(cfqd, cfqq);
+ cfqq->ioprio_class_changed = 0;
+ cfqq->vdisktime = 0;
  }
+}
 
- left = 1;
- parent = NULL;
- p = &cfqd->service_tree.rb.rb_node;
- while (*p) {
- struct rb_node **n;
+static void __dequeue_cfqq(struct cfq_service_tree *st, struct cfq_queue *cfqq)
+{
+ /* Node is not on tree */
+ if (RB_EMPTY_NODE(&cfqq->rb_node))
+ return;
 
- parent = *p;
+ if (st->left == &cfqq->rb_node)
+ st->left = rb_next(&cfqq->rb_node);
+
+ rb_erase(&cfqq->rb_node, &st->rb);
+ RB_CLEAR_NODE(&cfqq->rb_node);
+}
+
+static void dequeue_cfqq(struct cfq_queue *cfqq)
+{
+ struct cfq_service_tree *st = cfqq->st;
+
+ if (st->active == cfqq)
+ st->active = NULL;
+
+ __dequeue_cfqq(st, cfqq);
+}
+
+static void __enqueue_cfqq(struct cfq_service_tree *st, struct cfq_queue *cfqq,
+ int add_front)
+{
+ struct rb_node **node = &st->rb.rb_node;
+ struct rb_node *parent = NULL;
+ struct cfq_queue *__cfqq;
+ s64 key = cfqq_key(st, cfqq);
+ int leftmost = 1;
+
+ while (*node != NULL) {
+ parent = *node;
  __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
 
- /*
- * sort RT queues first, we always want to give
- * preference to them. IDLE queues goes to the back.
- * after that, sort on the next service time.
- */
- if (cfq_class_rt(cfqq) > cfq_class_rt(__cfqq))
- n = &(*p)->rb_left;
- else if (cfq_class_rt(cfqq) < cfq_class_rt(__cfqq))
- n = &(*p)->rb_right;
- else if (cfq_class_idle(cfqq) < cfq_class_idle(__cfqq))
- n = &(*p)->rb_left;
- else if (cfq_class_idle(cfqq) > cfq_class_idle(__cfqq))
- n = &(*p)->rb_right;
- else if (time_before(rb_key, __cfqq->rb_key))
- n = &(*p)->rb_left;
- else
- n = &(*p)->rb_right;
+ if (key < cfqq_key(st, __cfqq) ||
+ (add_front && (key == cfqq_key(st, __cfqq)))) {
+ node = &parent->rb_left;
+ } else {
+ node = &parent->rb_right;
+ leftmost = 0;
+ }
+ }
+
+ /*
+ * Maintain a cache of leftmost tree entries (it is frequently
+ * used)
+ */
+ if (leftmost)
+ st->left = &cfqq->rb_node;
 
- if (n == &(*p)->rb_right)
- left = 0;
+ rb_link_node(&cfqq->rb_node, parent, node);
+ rb_insert_color(&cfqq->rb_node, &st->rb);
+}
 
- p = n;
+static void enqueue_cfqq(struct cfq_queue *cfqq)
+{
+ cfqq_update_ioprio_class(cfqq);
+ place_cfqq(cfqq->st, cfqq, 0);
+ __enqueue_cfqq(cfqq->st, cfqq, 0);
+}
+
+/* Requeue a cfqq which is already on the service tree */
+static void requeue_cfqq(struct cfq_queue *cfqq, int add_front)
+{
+ struct cfq_service_tree *st = cfqq->st;
+ struct cfq_queue *next_cfqq;
+
+ if (add_front) {
+ next_cfqq = __cfq_get_next_queue(st);
+ if (next_cfqq && next_cfqq == cfqq)
+ return;
+ }
+
+ __dequeue_cfqq(st, cfqq);
+ place_cfqq(st, cfqq, add_front);
+ __enqueue_cfqq(st, cfqq, add_front);
+}
+
+static void __cfqq_served(struct cfq_queue *cfqq, unsigned long served)
+{
+ /*
+ * Can't update entity disk time while it is on sorted rb-tree
+ * as vdisktime is used as key.
+ */
+ __dequeue_cfqq(cfqq->st, cfqq);
+ cfqq->vdisktime += cfq_delta_fair(served, cfqq);
+ update_min_vdisktime(cfqq->st);
+ __enqueue_cfqq(cfqq->st, cfqq, 0);
+}
+
+static void cfqq_served(struct cfq_queue *cfqq, unsigned long served)
+{
+ /*
+ * We don't want to charge more than allocated slice otherwise this
+ * queue can miss one dispatch round doubling max latencies. On the
+ * other hand we don't want to charge less than allocated slice as
+ * we stick to CFQ theme of queue loosing its share if it does not
+ * use the slice and moves to the back of service tree (almost).
+ */
+ served = cfq_prio_to_slice(cfqq->cfqd, cfqq);
+ __cfqq_served(cfqq, served);
+
+ /* If cfqq prio class has changed, take that into account */
+ if (unlikely(cfqq->ioprio_class_changed)) {
+ dequeue_cfqq(cfqq);
+ enqueue_cfqq(cfqq);
  }
+}
+
+/*
+ * Handles three operations.
+ * Addition of a new queue to service tree, when a new request comes in.
+ * Resorting of an expiring queue (used after slice expired)
+ * Requeuing a queue at the front (used during preemption).
+ */
+static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
+ bool add_front, unsigned long service)
+{
+ if (RB_EMPTY_NODE(&cfqq->rb_node)) {
+ /* Its a new queue. Add it to service tree */
+ enqueue_cfqq(cfqq);
+ return;
+ }
+
+ if (service) {
+ /*
+ * This queue just got served. Compute the new key and requeue
+ * in the service tree
+ */
+ cfqq_served(cfqq, service);
 
- if (left)
- cfqd->service_tree.left = &cfqq->rb_node;
+ /*
+ * Requeue async ioq so that these will be again placed at the
+ * end of service tree giving a chance to sync queues.
+ * TODO: Handle this case in a better manner.
+ */
+ if (!cfq_cfqq_sync(cfqq))
+ requeue_cfqq(cfqq, 0);
+ return;
+ }
 
- cfqq->rb_key = rb_key;
- rb_link_node(&cfqq->rb_node, parent, p);
- rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
+ /* Just requeuing an existing queue, used during preemption */
+ requeue_cfqq(cfqq, add_front);
 }
 
 static struct cfq_queue *
@@ -634,13 +771,14 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 /*
  * Update cfqq's position in the service tree.
  */
-static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq,
+ unsigned long service)
 {
  /*
  * Resorting requires the cfqq to be on the RR list already.
  */
  if (cfq_cfqq_on_rr(cfqq)) {
- cfq_service_tree_add(cfqd, cfqq, 0);
+ cfq_service_tree_add(cfqd, cfqq, 0, service);
  cfq_prio_tree_add(cfqd, cfqq);
  }
 }
@@ -656,7 +794,7 @@ static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
  cfq_mark_cfqq_on_rr(cfqq);
  cfqd->busy_queues++;
 
- cfq_resort_rr_list(cfqd, cfqq);
+ cfq_resort_rr_list(cfqd, cfqq, 0);
 }
 
 /*
@@ -669,8 +807,7 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
  BUG_ON(!cfq_cfqq_on_rr(cfqq));
  cfq_clear_cfqq_on_rr(cfqq);
 
- if (!RB_EMPTY_NODE(&cfqq->rb_node))
- cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+ dequeue_cfqq(cfqq);
  if (cfqq->p_root) {
  rb_erase(&cfqq->p_node, cfqq->p_root);
  cfqq->p_root = NULL;
@@ -686,7 +823,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 static void cfq_del_rq_rb(struct request *rq)
 {
  struct cfq_queue *cfqq = RQ_CFQQ(rq);
- struct cfq_data *cfqd = cfqq->cfqd;
  const int sync = rq_is_sync(rq);
 
  BUG_ON(!cfqq->queued[sync]);
@@ -694,8 +830,17 @@ static void cfq_del_rq_rb(struct request *rq)
 
  elv_rb_del(&cfqq->sort_list, rq);
 
- if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
- cfq_del_cfqq_rr(cfqd, cfqq);
+ if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list)) {
+ /*
+ * Queue will be deleted from service tree when we actually
+ * expire it later. Right now just remove it from prio tree
+ * as it is empty.
+ */
+ if (cfqq->p_root) {
+ rb_erase(&cfqq->p_node, cfqq->p_root);
+ cfqq->p_root = NULL;
+ }
+ }
 }
 
 static void cfq_add_rq_rb(struct request *rq)
@@ -869,6 +1014,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
 {
  if (cfqq) {
  cfq_log_cfqq(cfqd, cfqq, "set_active");
+ cfqq->slice_start = 0;
  cfqq->slice_end = 0;
  cfqq->slice_dispatch = 0;
 
@@ -888,10 +1034,11 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
  * current cfqq expired its slice (or was too idle), select new one
  */
 static void
-__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-    bool timed_out)
+__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
- cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
+ long slice_used = 0;
+
+ cfq_log_cfqq(cfqd, cfqq, "slice expired");
 
  if (cfq_cfqq_wait_request(cfqq))
  del_timer(&cfqd->idle_slice_timer);
@@ -899,14 +1046,20 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
  cfq_clear_cfqq_wait_request(cfqq);
 
  /*
- * store what was left of this slice, if the queue idled/timed out
+ * Queue got expired before even a single request completed or
+ * got expired immediately after first request completion.
  */
- if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
- cfqq->slice_resid = cfqq->slice_end - jiffies;
- cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
- }
+ if (!cfqq->slice_end || cfqq->slice_start == jiffies)
+ slice_used = 1;
+ else
+ slice_used = jiffies - cfqq->slice_start;
 
- cfq_resort_rr_list(cfqd, cfqq);
+ cfq_log_cfqq(cfqd, cfqq, "sl_used=%ld", slice_used);
+
+ if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
+ cfq_del_cfqq_rr(cfqd, cfqq);
+
+ cfq_resort_rr_list(cfqd, cfqq, slice_used);
 
  if (cfqq == cfqd->active_queue)
  cfqd->active_queue = NULL;
@@ -917,12 +1070,22 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
  }
 }
 
-static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
 {
  struct cfq_queue *cfqq = cfqd->active_queue;
 
  if (cfqq)
- __cfq_slice_expired(cfqd, cfqq, timed_out);
+ __cfq_slice_expired(cfqd, cfqq);
+}
+
+static struct cfq_queue *__cfq_get_next_queue(struct cfq_service_tree *st)
+{
+ struct rb_node *left = st->left;
+
+ if (!left)
+ return NULL;
+
+ return rb_entry(left, struct cfq_queue, rb_node);
 }
 
 /*
@@ -931,10 +1094,24 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out)
  */
 static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
 {
- if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
+ struct cfq_sched_data *sd = &cfqd->sched_data;
+ struct cfq_service_tree *st = sd->service_tree;
+ struct cfq_queue *cfqq = NULL;
+ int i;
+
+ if (!cfqd->rq_queued)
  return NULL;
 
- return cfq_rb_first(&cfqd->service_tree);
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+ cfqq = __cfq_get_next_queue(st);
+ if (cfqq) {
+ st->active = cfqq;
+ update_min_vdisktime(cfqq->st);
+ break;
+ }
+ }
+
+ return cfqq;
 }
 
 /*
@@ -1181,6 +1358,9 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
  if (!cfqq)
  goto new_queue;
 
+ if (!cfqd->rq_queued)
+ return NULL;
+
  /*
  * The active queue has run out of time, expire it and select new.
  */
@@ -1216,7 +1396,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
  }
 
 expire:
- cfq_slice_expired(cfqd, 0);
+ cfq_slice_expired(cfqd);
 new_queue:
  cfqq = cfq_set_active_queue(cfqd, new_cfqq);
 keep_queue:
@@ -1233,6 +1413,10 @@ static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
  }
 
  BUG_ON(!list_empty(&cfqq->fifo));
+
+ /* By default cfqq is not expired if it is empty. Do it explicitly */
+ __cfq_slice_expired(cfqq->cfqd, cfqq);
+
  return dispatched;
 }
 
@@ -1245,10 +1429,10 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
  struct cfq_queue *cfqq;
  int dispatched = 0;
 
- while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+ while ((cfqq = cfq_get_next_queue(cfqd)) != NULL)
  dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
- cfq_slice_expired(cfqd, 0);
+ cfq_slice_expired(cfqd);
 
  BUG_ON(cfqd->busy_queues);
 
@@ -1391,7 +1575,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
     cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
     cfq_class_idle(cfqq))) {
  cfqq->slice_end = jiffies + 1;
- cfq_slice_expired(cfqd, 0);
+ cfq_slice_expired(cfqd);
  }
 
  cfq_log_cfqq(cfqd, cfqq, "dispatched a request");
@@ -1416,13 +1600,14 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
  cfq_log_cfqq(cfqd, cfqq, "put_queue");
  BUG_ON(rb_first(&cfqq->sort_list));
  BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
- BUG_ON(cfq_cfqq_on_rr(cfqq));
 
  if (unlikely(cfqd->active_queue == cfqq)) {
- __cfq_slice_expired(cfqd, cfqq, 0);
+ __cfq_slice_expired(cfqd, cfqq);
  cfq_schedule_dispatch(cfqd);
  }
 
+ BUG_ON(cfq_cfqq_on_rr(cfqq));
+
  kmem_cache_free(cfq_pool, cfqq);
 }
 
@@ -1514,7 +1699,7 @@ static void cfq_free_io_context(struct io_context *ioc)
 static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
  if (unlikely(cfqq == cfqd->active_queue)) {
- __cfq_slice_expired(cfqd, cfqq, 0);
+ __cfq_slice_expired(cfqd, cfqq);
  cfq_schedule_dispatch(cfqd);
  }
 
@@ -1634,6 +1819,8 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
  break;
  }
 
+ if (cfqq->org_ioprio_class != cfqq->ioprio_class)
+ cfqq->ioprio_class_changed = 1;
  /*
  * keep track of original prio settings in case we have to temporarily
  * elevate the priority of this queue
@@ -2079,7 +2266,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
  cfq_log_cfqq(cfqd, cfqq, "preempt");
- cfq_slice_expired(cfqd, 1);
+ cfq_slice_expired(cfqd);
 
  /*
  * Put the new queue at the front of the of the current list,
@@ -2087,7 +2274,7 @@ static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
  */
  BUG_ON(!cfq_cfqq_on_rr(cfqq));
 
- cfq_service_tree_add(cfqd, cfqq, 1);
+ cfq_service_tree_add(cfqd, cfqq, 1, 0);
 
  cfqq->slice_end = 0;
  cfq_mark_cfqq_slice_new(cfqq);
@@ -2229,7 +2416,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
  * of idling.
  */
  if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
- cfq_slice_expired(cfqd, 1);
+ cfq_slice_expired(cfqd);
  else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
  sync && !rq_noidle(rq))
  cfq_arm_slice_timer(cfqd);
@@ -2250,16 +2437,20 @@ static void cfq_prio_boost(struct cfq_queue *cfqq)
  * boost idle prio on transactions that would lock out other
  * users of the filesystem
  */
- if (cfq_class_idle(cfqq))
+ if (cfq_class_idle(cfqq)) {
  cfqq->ioprio_class = IOPRIO_CLASS_BE;
+ cfqq->ioprio_class_changed = 1;
+ }
  if (cfqq->ioprio > IOPRIO_NORM)
  cfqq->ioprio = IOPRIO_NORM;
  } else {
  /*
  * check if we need to unboost the queue
  */
- if (cfqq->ioprio_class != cfqq->org_ioprio_class)
+ if (cfqq->ioprio_class != cfqq->org_ioprio_class) {
  cfqq->ioprio_class = cfqq->org_ioprio_class;
+ cfqq->ioprio_class_changed = 1;
+ }
  if (cfqq->ioprio != cfqq->org_ioprio)
  cfqq->ioprio = cfqq->org_ioprio;
  }
@@ -2391,7 +2582,6 @@ static void cfq_idle_slice_timer(unsigned long data)
  struct cfq_data *cfqd = (struct cfq_data *) data;
  struct cfq_queue *cfqq;
  unsigned long flags;
- int timed_out = 1;
 
  cfq_log(cfqd, "idle timer fired");
 
@@ -2399,7 +2589,6 @@ static void cfq_idle_slice_timer(unsigned long data)
 
  cfqq = cfqd->active_queue;
  if (cfqq) {
- timed_out = 0;
 
  /*
  * We saw a request before the queue expired, let it through
@@ -2427,7 +2616,7 @@ static void cfq_idle_slice_timer(unsigned long data)
  goto out_kick;
  }
 expire:
- cfq_slice_expired(cfqd, timed_out);
+ cfq_slice_expired(cfqd);
 out_kick:
  cfq_schedule_dispatch(cfqd);
 out_cont:
@@ -2465,7 +2654,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
  spin_lock_irq(q->queue_lock);
 
  if (cfqd->active_queue)
- __cfq_slice_expired(cfqd, cfqd->active_queue, 0);
+ __cfq_slice_expired(cfqd, cfqd->active_queue);
 
  while (!list_empty(&cfqd->cic_list)) {
  struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
@@ -2493,7 +2682,8 @@ static void *cfq_init_queue(struct request_queue *q)
  if (!cfqd)
  return NULL;
 
- cfqd->service_tree = CFQ_RB_ROOT;
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+ cfqd->sched_data.service_tree[i] = CFQ_RB_ROOT;
 
  /*
  * Not strictly needed (since RB_ROOT just clears the node and we
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 03/20] blkio: Introduce the notion of weights

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

o Introduce the notion of weights. Priorities are mapped to weights internally.
  These weights will be useful once IO groups are introduced and group's share
  will be decided by the group weight.

Signed-off-by: Vivek Goyal <vgoyal@...>
---
 block/cfq-iosched.c |   58 ++++++++++++++++++++++++++++++++++----------------
 1 files changed, 39 insertions(+), 19 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 58ac8b7..ca815ce 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -29,6 +29,10 @@ static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
 
 #define IO_IOPRIO_CLASSES 3
+#define CFQ_WEIGHT_MIN 100
+#define CFQ_WEIGHT_MAX 1000
+#define CFQ_WEIGHT_DEFAULT 500
+#define CFQ_SERVICE_SHIFT       12
 
 /*
  * offset from end of service tree
@@ -40,7 +44,7 @@ static int cfq_slice_idle = HZ / 125;
  */
 #define CFQ_MIN_TT (2)
 
-#define CFQ_SLICE_SCALE (5)
+#define CFQ_SLICE_SCALE (500)
 #define CFQ_HW_QUEUE_MIN (5)
 
 #define RQ_CIC(rq) \
@@ -123,6 +127,7 @@ struct cfq_queue {
  unsigned short ioprio, org_ioprio;
  unsigned short ioprio_class, org_ioprio_class;
  bool ioprio_class_changed;
+ unsigned int weight;
 
  pid_t pid;
 };
@@ -266,11 +271,22 @@ cfqq_key(struct cfq_service_tree *st, struct cfq_queue *cfqq)
 }
 
 static inline u64
+cfq_delta(u64 service, unsigned int numerator_wt, unsigned int denominator_wt)
+{
+ if (numerator_wt != denominator_wt) {
+ service = service * numerator_wt;
+ do_div(service, denominator_wt);
+ }
+
+ return service;
+}
+
+static inline u64
 cfq_delta_fair(unsigned long delta, struct cfq_queue *cfqq)
 {
- const int base_slice = cfqq->cfqd->cfq_slice[cfq_cfqq_sync(cfqq)];
+ u64 d = delta << CFQ_SERVICE_SHIFT;
 
- return delta + (base_slice/CFQ_SLICE_SCALE * (cfqq->ioprio - 4));
+ return cfq_delta(d, CFQ_WEIGHT_DEFAULT, cfqq->weight);
 }
 
 static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
@@ -308,6 +324,23 @@ static void update_min_vdisktime(struct cfq_service_tree *st)
  st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
 }
 
+static inline unsigned int cfq_ioprio_to_weight(int ioprio)
+{
+ WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+ /* Map prio 7 - 0 to weights 200 to 900 */
+ return CFQ_WEIGHT_DEFAULT + (CFQ_WEIGHT_DEFAULT/5 * (4 - ioprio));
+}
+
+static inline int
+cfq_weight_slice(struct cfq_data *cfqd, int sync, unsigned int weight)
+{
+ const int base_slice = cfqd->cfq_slice[sync];
+
+ WARN_ON(weight > CFQ_WEIGHT_MAX);
+
+ return cfq_delta(base_slice, weight, CFQ_WEIGHT_DEFAULT);
+}
+
 static inline int rq_in_driver(struct cfq_data *cfqd)
 {
  return cfqd->rq_in_driver[0] + cfqd->rq_in_driver[1];
@@ -353,25 +386,10 @@ static int cfq_queue_empty(struct request_queue *q)
  return !cfqd->rq_queued;
 }
 
-/*
- * Scale schedule slice based on io priority. Use the sync time slice only
- * if a queue is marked sync and has sync io queued. A sync queue with async
- * io only, should not get full sync slice length.
- */
-static inline int cfq_prio_slice(struct cfq_data *cfqd, bool sync,
- unsigned short prio)
-{
- const int base_slice = cfqd->cfq_slice[sync];
-
- WARN_ON(prio >= IOPRIO_BE_NR);
-
- return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio));
-}
-
 static inline int
 cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
- return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
+ return cfq_weight_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->weight);
 }
 
 static inline void
@@ -1819,6 +1837,8 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
  break;
  }
 
+ cfqq->weight = cfq_ioprio_to_weight(cfqq->ioprio);
+
  if (cfqq->org_ioprio_class != cfqq->ioprio_class)
  cfqq->ioprio_class_changed = 1;
  /*
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 04/20] blkio: Introduce the notion of cfq entity

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

o Introduce the notion of cfq entity. This is a common structure which will
  be embedded both in cfq queues as well as cfq groups. This is something like
  scheduling entity of CFS.

o Once groups are introduced it becomes easier to deal with entities while
  enqueuing/dequeuing queues/groups on service tree and we can handle many
  of the operations with single functions dealing in entities instead of
  introducing seprate functions for queues and groups.

Signed-off-by: Vivek Goyal <vgoyal@...>
---
 block/cfq-iosched.c |  246 +++++++++++++++++++++++++++++----------------------
 1 files changed, 141 insertions(+), 105 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ca815ce..922aa8e 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -59,8 +59,10 @@ static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
 #define CFQ_PRIO_LISTS IOPRIO_BE_NR
-#define cfq_class_idle(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
-#define cfq_class_rt(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
+#define cfqe_class_idle(cfqe) ((cfqe)->ioprio_class == IOPRIO_CLASS_IDLE)
+#define cfqe_class_rt(cfqe) ((cfqe)->ioprio_class == IOPRIO_CLASS_RT)
+#define cfq_class_idle(cfqq) (cfqe_class_idle(&(cfqq)->entity))
+#define cfq_class_rt(cfqq) (cfqe_class_rt(&(cfqq)->entity))
 
 #define sample_valid(samples) ((samples) > 80)
 
@@ -74,7 +76,7 @@ struct cfq_service_tree {
  struct rb_root rb;
  struct rb_node *left;
  u64 min_vdisktime;
- struct cfq_queue *active;
+ struct cfq_entity *active;
 };
 #define CFQ_RB_ROOT (struct cfq_service_tree) { RB_ROOT, NULL, 0, NULL}
 
@@ -82,22 +84,26 @@ struct cfq_sched_data {
  struct cfq_service_tree service_tree[IO_IOPRIO_CLASSES];
 };
 
+struct cfq_entity {
+ struct rb_node rb_node;
+ u64 vdisktime;
+ unsigned int weight;
+ struct cfq_service_tree *st;
+ unsigned short ioprio_class;
+ bool ioprio_class_changed;
+};
+
 /*
  * Per process-grouping structure
  */
 struct cfq_queue {
+ struct cfq_entity entity;
  /* reference count */
  atomic_t ref;
  /* various state flags, see below */
  unsigned int flags;
  /* parent cfq_data */
  struct cfq_data *cfqd;
- /* service_tree member */
- struct rb_node rb_node;
- /* service_tree key */
- u64 vdisktime;
- /* service tree we belong to */
- struct cfq_service_tree *st;
  /* prio tree member */
  struct rb_node p_node;
  /* prio tree root we belong to, if any */
@@ -125,9 +131,7 @@ struct cfq_queue {
 
  /* io prio of this group */
  unsigned short ioprio, org_ioprio;
- unsigned short ioprio_class, org_ioprio_class;
- bool ioprio_class_changed;
- unsigned int weight;
+ unsigned short org_ioprio_class;
 
  pid_t pid;
 };
@@ -252,22 +256,27 @@ static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
  struct io_context *);
 static void cfq_put_queue(struct cfq_queue *cfqq);
-static struct cfq_queue *__cfq_get_next_queue(struct cfq_service_tree *st);
+static struct cfq_entity *__cfq_get_next_entity(struct cfq_service_tree *st);
+
+static inline struct cfq_queue *cfqq_of(struct cfq_entity *cfqe)
+{
+ return container_of(cfqe, struct cfq_queue, entity);
+}
 
 static inline void
-init_cfqq_service_tree(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+init_cfqe_service_tree(struct cfq_data *cfqd, struct cfq_entity *cfqe)
 {
- unsigned short idx = cfqq->ioprio_class - 1;
+ unsigned short idx = cfqe->ioprio_class - 1;
 
  BUG_ON(idx >= IO_IOPRIO_CLASSES);
 
- cfqq->st = &cfqd->sched_data.service_tree[idx];
+ cfqe->st = &cfqd->sched_data.service_tree[idx];
 }
 
 static inline s64
-cfqq_key(struct cfq_service_tree *st, struct cfq_queue *cfqq)
+cfqe_key(struct cfq_service_tree *st, struct cfq_entity *cfqe)
 {
- return cfqq->vdisktime - st->min_vdisktime;
+ return cfqe->vdisktime - st->min_vdisktime;
 }
 
 static inline u64
@@ -282,11 +291,11 @@ cfq_delta(u64 service, unsigned int numerator_wt, unsigned int denominator_wt)
 }
 
 static inline u64
-cfq_delta_fair(unsigned long delta, struct cfq_queue *cfqq)
+cfq_delta_fair(unsigned long delta, struct cfq_entity *cfqe)
 {
  u64 d = delta << CFQ_SERVICE_SHIFT;
 
- return cfq_delta(d, CFQ_WEIGHT_DEFAULT, cfqq->weight);
+ return cfq_delta(d, CFQ_WEIGHT_DEFAULT, cfqe->weight);
 }
 
 static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
@@ -315,10 +324,10 @@ static void update_min_vdisktime(struct cfq_service_tree *st)
  vdisktime = st->active->vdisktime;
 
  if (st->left) {
- struct cfq_queue *cfqq = rb_entry(st->left, struct cfq_queue,
+ struct cfq_entity *cfqe = rb_entry(st->left, struct cfq_entity,
  rb_node);
 
- vdisktime = min_vdisktime(vdisktime, cfqq->vdisktime);
+ vdisktime = min_vdisktime(vdisktime, cfqe->vdisktime);
  }
 
  st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
@@ -389,7 +398,7 @@ static int cfq_queue_empty(struct request_queue *q)
 static inline int
 cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
- return cfq_weight_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->weight);
+ return cfq_weight_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->entity.weight);
 }
 
 static inline void
@@ -538,84 +547,90 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 }
 
 static void
-place_cfqq(struct cfq_service_tree *st, struct cfq_queue *cfqq, int add_front)
+place_cfqe(struct cfq_service_tree *st, struct cfq_entity *cfqe, int add_front)
 {
  u64 vdisktime = st->min_vdisktime;
  struct rb_node *parent;
- struct cfq_queue *__cfqq;
+ struct cfq_entity *__cfqe;
 
- if (cfq_class_idle(cfqq)) {
+ if (cfqe_class_idle(cfqe)) {
  vdisktime = CFQ_IDLE_DELAY;
  parent = rb_last(&st->rb);
- if (parent && parent != &cfqq->rb_node) {
- __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
- vdisktime += __cfqq->vdisktime;
+ if (parent && parent != &cfqe->rb_node) {
+ __cfqe = rb_entry(parent, struct cfq_entity, rb_node);
+ vdisktime += __cfqe->vdisktime;
  } else
  vdisktime += st->min_vdisktime;
  } else if (!add_front) {
  parent = rb_last(&st->rb);
- if (parent && parent != &cfqq->rb_node) {
- __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
- vdisktime = __cfqq->vdisktime;
+ if (parent && parent != &cfqe->rb_node) {
+ __cfqe = rb_entry(parent, struct cfq_entity, rb_node);
+ vdisktime = __cfqe->vdisktime;
  }
  }
 
- cfqq->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
+ cfqe->vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
 }
 
-static inline void cfqq_update_ioprio_class(struct cfq_queue *cfqq)
+static inline void cfqe_update_ioprio_class(struct cfq_entity *cfqe)
 {
- if (unlikely(cfqq->ioprio_class_changed)) {
+ if (unlikely(cfqe->ioprio_class_changed)) {
+ struct cfq_queue *cfqq = cfqq_of(cfqe);
  struct cfq_data *cfqd = cfqq->cfqd;
 
  /*
  * Re-initialize the service tree pointer as ioprio class
  * change will lead to service tree change.
  */
- init_cfqq_service_tree(cfqd, cfqq);
- cfqq->ioprio_class_changed = 0;
- cfqq->vdisktime = 0;
+ init_cfqe_service_tree(cfqd, cfqe);
+ cfqe->ioprio_class_changed = 0;
+ cfqe->vdisktime = 0;
  }
 }
 
-static void __dequeue_cfqq(struct cfq_service_tree *st, struct cfq_queue *cfqq)
+static void __dequeue_cfqe(struct cfq_service_tree *st, struct cfq_entity *cfqe)
 {
  /* Node is not on tree */
- if (RB_EMPTY_NODE(&cfqq->rb_node))
+ if (RB_EMPTY_NODE(&cfqe->rb_node))
  return;
 
- if (st->left == &cfqq->rb_node)
- st->left = rb_next(&cfqq->rb_node);
+ if (st->left == &cfqe->rb_node)
+ st->left = rb_next(&cfqe->rb_node);
 
- rb_erase(&cfqq->rb_node, &st->rb);
- RB_CLEAR_NODE(&cfqq->rb_node);
+ rb_erase(&cfqe->rb_node, &st->rb);
+ RB_CLEAR_NODE(&cfqe->rb_node);
 }
 
-static void dequeue_cfqq(struct cfq_queue *cfqq)
+static void dequeue_cfqe(struct cfq_entity *cfqe)
 {
- struct cfq_service_tree *st = cfqq->st;
+ struct cfq_service_tree *st = cfqe->st;
 
- if (st->active == cfqq)
+ if (st->active == cfqe)
  st->active = NULL;
 
- __dequeue_cfqq(st, cfqq);
+ __dequeue_cfqe(st, cfqe);
 }
 
-static void __enqueue_cfqq(struct cfq_service_tree *st, struct cfq_queue *cfqq,
+static void dequeue_cfqq(struct cfq_queue *cfqq)
+{
+ dequeue_cfqe(&cfqq->entity);
+}
+
+static void __enqueue_cfqe(struct cfq_service_tree *st, struct cfq_entity *cfqe,
  int add_front)
 {
  struct rb_node **node = &st->rb.rb_node;
  struct rb_node *parent = NULL;
- struct cfq_queue *__cfqq;
- s64 key = cfqq_key(st, cfqq);
+ struct cfq_entity *__cfqe;
+ s64 key = cfqe_key(st, cfqe);
  int leftmost = 1;
 
  while (*node != NULL) {
  parent = *node;
- __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
+ __cfqe = rb_entry(parent, struct cfq_entity, rb_node);
 
- if (key < cfqq_key(st, __cfqq) ||
- (add_front && (key == cfqq_key(st, __cfqq)))) {
+ if (key < cfqe_key(st, __cfqe) ||
+ (add_front && (key == cfqe_key(st, __cfqe)))) {
  node = &parent->rb_left;
  } else {
  node = &parent->rb_right;
@@ -628,46 +643,56 @@ static void __enqueue_cfqq(struct cfq_service_tree *st, struct cfq_queue *cfqq,
  * used)
  */
  if (leftmost)
- st->left = &cfqq->rb_node;
+ st->left = &cfqe->rb_node;
+
+ rb_link_node(&cfqe->rb_node, parent, node);
+ rb_insert_color(&cfqe->rb_node, &st->rb);
+}
 
- rb_link_node(&cfqq->rb_node, parent, node);
- rb_insert_color(&cfqq->rb_node, &st->rb);
+static void enqueue_cfqe(struct cfq_entity *cfqe)
+{
+ cfqe_update_ioprio_class(cfqe);
+ place_cfqe(cfqe->st, cfqe, 0);
+ __enqueue_cfqe(cfqe->st, cfqe, 0);
 }
 
 static void enqueue_cfqq(struct cfq_queue *cfqq)
 {
- cfqq_update_ioprio_class(cfqq);
- place_cfqq(cfqq->st, cfqq, 0);
- __enqueue_cfqq(cfqq->st, cfqq, 0);
+ enqueue_cfqe(&cfqq->entity);
 }
 
 /* Requeue a cfqq which is already on the service tree */
-static void requeue_cfqq(struct cfq_queue *cfqq, int add_front)
+static void requeue_cfqe(struct cfq_entity *cfqe, int add_front)
 {
- struct cfq_service_tree *st = cfqq->st;
- struct cfq_queue *next_cfqq;
+ struct cfq_service_tree *st = cfqe->st;
+ struct cfq_entity *next_cfqe;
 
  if (add_front) {
- next_cfqq = __cfq_get_next_queue(st);
- if (next_cfqq && next_cfqq == cfqq)
+ next_cfqe = __cfq_get_next_entity(st);
+ if (next_cfqe && next_cfqe == cfqe)
  return;
  }
 
- __dequeue_cfqq(st, cfqq);
- place_cfqq(st, cfqq, add_front);
- __enqueue_cfqq(st, cfqq, add_front);
+ __dequeue_cfqe(st, cfqe);
+ place_cfqe(st, cfqe, add_front);
+ __enqueue_cfqe(st, cfqe, add_front);
 }
 
-static void __cfqq_served(struct cfq_queue *cfqq, unsigned long served)
+static void requeue_cfqq(struct cfq_queue *cfqq, int add_front)
+{
+ requeue_cfqe(&cfqq->entity, add_front);
+}
+
+static void cfqe_served(struct cfq_entity *cfqe, unsigned long served)
 {
  /*
  * Can't update entity disk time while it is on sorted rb-tree
  * as vdisktime is used as key.
  */
- __dequeue_cfqq(cfqq->st, cfqq);
- cfqq->vdisktime += cfq_delta_fair(served, cfqq);
- update_min_vdisktime(cfqq->st);
- __enqueue_cfqq(cfqq->st, cfqq, 0);
+ __dequeue_cfqe(cfqe->st, cfqe);
+ cfqe->vdisktime += cfq_delta_fair(served, cfqe);
+ update_min_vdisktime(cfqe->st);
+ __enqueue_cfqe(cfqe->st, cfqe, 0);
 }
 
 static void cfqq_served(struct cfq_queue *cfqq, unsigned long served)
@@ -680,10 +705,10 @@ static void cfqq_served(struct cfq_queue *cfqq, unsigned long served)
  * use the slice and moves to the back of service tree (almost).
  */
  served = cfq_prio_to_slice(cfqq->cfqd, cfqq);
- __cfqq_served(cfqq, served);
+ cfqe_served(&cfqq->entity, served);
 
  /* If cfqq prio class has changed, take that into account */
- if (unlikely(cfqq->ioprio_class_changed)) {
+ if (unlikely(cfqq->entity.ioprio_class_changed)) {
  dequeue_cfqq(cfqq);
  enqueue_cfqq(cfqq);
  }
@@ -698,7 +723,7 @@ static void cfqq_served(struct cfq_queue *cfqq, unsigned long served)
 static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
  bool add_front, unsigned long service)
 {
- if (RB_EMPTY_NODE(&cfqq->rb_node)) {
+ if (RB_EMPTY_NODE(&cfqq->entity.rb_node)) {
  /* Its a new queue. Add it to service tree */
  enqueue_cfqq(cfqq);
  return;
@@ -1096,14 +1121,32 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd)
  __cfq_slice_expired(cfqd, cfqq);
 }
 
-static struct cfq_queue *__cfq_get_next_queue(struct cfq_service_tree *st)
+static struct cfq_entity *__cfq_get_next_entity(struct cfq_service_tree *st)
 {
  struct rb_node *left = st->left;
 
  if (!left)
  return NULL;
 
- return rb_entry(left, struct cfq_queue, rb_node);
+ return rb_entry(left, struct cfq_entity, rb_node);
+}
+
+static struct cfq_entity *cfq_get_next_entity(struct cfq_sched_data *sd)
+{
+ struct cfq_service_tree *st = sd->service_tree;
+ struct cfq_entity *cfqe = NULL;
+ int i;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+ cfqe = __cfq_get_next_entity(st);
+ if (cfqe) {
+ st->active = cfqe;
+ update_min_vdisktime(cfqe->st);
+ break;
+ }
+ }
+
+ return cfqe;
 }
 
 /*
@@ -1112,24 +1155,17 @@ static struct cfq_queue *__cfq_get_next_queue(struct cfq_service_tree *st)
  */
 static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
 {
- struct cfq_sched_data *sd = &cfqd->sched_data;
- struct cfq_service_tree *st = sd->service_tree;
- struct cfq_queue *cfqq = NULL;
- int i;
+ struct cfq_entity *cfqe = NULL;
 
  if (!cfqd->rq_queued)
  return NULL;
 
- for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
- cfqq = __cfq_get_next_queue(st);
- if (cfqq) {
- st->active = cfqq;
- update_min_vdisktime(cfqq->st);
- break;
- }
- }
+ cfqe = cfq_get_next_entity(&cfqd->sched_data);
 
- return cfqq;
+ if (cfqe)
+ return cfqq_of(cfqe);
+ else
+ return NULL;
 }
 
 /*
@@ -1820,33 +1856,33 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
  * no prio set, inherit CPU scheduling settings
  */
  cfqq->ioprio = task_nice_ioprio(tsk);
- cfqq->ioprio_class = task_nice_ioclass(tsk);
+ cfqq->entity.ioprio_class = task_nice_ioclass(tsk);
  break;
  case IOPRIO_CLASS_RT:
  cfqq->ioprio = task_ioprio(ioc);
- cfqq->ioprio_class = IOPRIO_CLASS_RT;
+ cfqq->entity.ioprio_class = IOPRIO_CLASS_RT;
  break;
  case IOPRIO_CLASS_BE:
  cfqq->ioprio = task_ioprio(ioc);
- cfqq->ioprio_class = IOPRIO_CLASS_BE;
+ cfqq->entity.ioprio_class = IOPRIO_CLASS_BE;
  break;
  case IOPRIO_CLASS_IDLE:
- cfqq->ioprio_class = IOPRIO_CLASS_IDLE;
+ cfqq->entity.ioprio_class = IOPRIO_CLASS_IDLE;
  cfqq->ioprio = 7;
  cfq_clear_cfqq_idle_window(cfqq);
  break;
  }
 
- cfqq->weight = cfq_ioprio_to_weight(cfqq->ioprio);
+ cfqq->entity.weight = cfq_ioprio_to_weight(cfqq->ioprio);
 
- if (cfqq->org_ioprio_class != cfqq->ioprio_class)
- cfqq->ioprio_class_changed = 1;
+ if (cfqq->org_ioprio_class != cfqq->entity.ioprio_class)
+ cfqq->entity.ioprio_class_changed = 1;
  /*
  * keep track of original prio settings in case we have to temporarily
  * elevate the priority of this queue
  */
  cfqq->org_ioprio = cfqq->ioprio;
- cfqq->org_ioprio_class = cfqq->ioprio_class;
+ cfqq->org_ioprio_class = cfqq->entity.ioprio_class;
  cfq_clear_cfqq_prio_changed(cfqq);
 }
 
@@ -1888,7 +1924,7 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
   pid_t pid, bool is_sync)
 {
- RB_CLEAR_NODE(&cfqq->rb_node);
+ RB_CLEAR_NODE(&cfqq->entity.rb_node);
  RB_CLEAR_NODE(&cfqq->p_node);
  INIT_LIST_HEAD(&cfqq->fifo);
 
@@ -2458,8 +2494,8 @@ static void cfq_prio_boost(struct cfq_queue *cfqq)
  * users of the filesystem
  */
  if (cfq_class_idle(cfqq)) {
- cfqq->ioprio_class = IOPRIO_CLASS_BE;
- cfqq->ioprio_class_changed = 1;
+ cfqq->entity.ioprio_class = IOPRIO_CLASS_BE;
+ cfqq->entity.ioprio_class_changed = 1;
  }
  if (cfqq->ioprio > IOPRIO_NORM)
  cfqq->ioprio = IOPRIO_NORM;
@@ -2467,9 +2503,9 @@ static void cfq_prio_boost(struct cfq_queue *cfqq)
  /*
  * check if we need to unboost the queue
  */
- if (cfqq->ioprio_class != cfqq->org_ioprio_class) {
- cfqq->ioprio_class = cfqq->org_ioprio_class;
- cfqq->ioprio_class_changed = 1;
+ if (cfqq->entity.ioprio_class != cfqq->org_ioprio_class) {
+ cfqq->entity.ioprio_class = cfqq->org_ioprio_class;
+ cfqq->entity.ioprio_class_changed = 1;
  }
  if (cfqq->ioprio != cfqq->org_ioprio)
  cfqq->ioprio = cfqq->org_ioprio;
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 05/20] blkio: Introduce the notion of cfq groups

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

o This is first step in introducing cfq groups. Currently we define only
  on cfq_group (root cfq group) which is embedded in cfq_data.

o Down the line, each cfq_group will have its own service tree. Hence move
  the service tree from cfqd to root group so that it becomes property of
  group.

Signed-off-by: Vivek Goyal <vgoyal@...>
---
 block/cfq-iosched.c |   27 ++++++++++++++++++---------
 1 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 922aa8e..323ed12 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -136,16 +136,17 @@ struct cfq_queue {
  pid_t pid;
 };
 
+/* Per cgroup grouping structure */
+struct cfq_group {
+ struct cfq_sched_data sched_data;
+};
+
 /*
  * Per block device queue structure
  */
 struct cfq_data {
  struct request_queue *queue;
-
- /*
- * rr list of queues with requests and the count of them
- */
- struct cfq_sched_data sched_data;
+ struct cfq_group root_group;
 
  /*
  * Each priority tree is sorted by next_request position.  These
@@ -270,7 +271,7 @@ init_cfqe_service_tree(struct cfq_data *cfqd, struct cfq_entity *cfqe)
 
  BUG_ON(idx >= IO_IOPRIO_CLASSES);
 
- cfqe->st = &cfqd->sched_data.service_tree[idx];
+ cfqe->st = &cfqd->root_group.sched_data.service_tree[idx];
 }
 
 static inline s64
@@ -1160,7 +1161,7 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
  if (!cfqd->rq_queued)
  return NULL;
 
- cfqe = cfq_get_next_entity(&cfqd->sched_data);
+ cfqe = cfq_get_next_entity(&cfqd->root_group.sched_data);
 
  if (cfqe)
  return cfqq_of(cfqe);
@@ -2700,6 +2701,15 @@ static void cfq_put_async_queues(struct cfq_data *cfqd)
  cfq_put_queue(cfqd->async_idle_cfqq);
 }
 
+static void cfq_init_root_group(struct cfq_data *cfqd)
+{
+ struct cfq_group *cfqg = &cfqd->root_group;
+ int i;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+ cfqg->sched_data.service_tree[i] = CFQ_RB_ROOT;
+}
+
 static void cfq_exit_queue(struct elevator_queue *e)
 {
  struct cfq_data *cfqd = e->elevator_data;
@@ -2738,8 +2748,7 @@ static void *cfq_init_queue(struct request_queue *q)
  if (!cfqd)
  return NULL;
 
- for (i = 0; i < IO_IOPRIO_CLASSES; i++)
- cfqd->sched_data.service_tree[i] = CFQ_RB_ROOT;
+ cfq_init_root_group(cfqd);
 
  /*
  * Not strictly needed (since RB_ROOT just clears the node and we
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 06/20] blkio: Introduce cgroup interface

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

o This is basic blkio controller cgroup interface. This is the common interface
  which will be used by applications to control IO as it flows through IO stack.

o There are some places where it is assumed that only one policy implemented
  by CFQ is there hence things have been hardcoded. Once we have one more
  policy implmented, we need to introduce some dynamic infrastructure like
  registration of policy and get rid of hardcoded calls.

o Some parts of this code have been taken from BFQ patches.

Signed-off-by: Vivek Goyal <vgoyal@...>
---
 block/Kconfig                 |   13 +++
 block/Kconfig.iosched         |    8 ++
 block/Makefile                |    1 +
 block/blk-cgroup.c            |  199 +++++++++++++++++++++++++++++++++++++++++
 block/blk-cgroup.h            |   38 ++++++++
 block/cfq-iosched.c           |   15 ++--
 include/linux/cgroup_subsys.h |    6 ++
 include/linux/iocontext.h     |    4 +
 8 files changed, 277 insertions(+), 7 deletions(-)
 create mode 100644 block/blk-cgroup.c
 create mode 100644 block/blk-cgroup.h

diff --git a/block/Kconfig b/block/Kconfig
index 9be0b56..6ba1a8e 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -77,6 +77,19 @@ config BLK_DEV_INTEGRITY
  T10/SCSI Data Integrity Field or the T13/ATA External Path
  Protection.  If in doubt, say N.
 
+config BLK_CGROUP
+ bool
+ depends on CGROUPS
+ default n
+ ---help---
+ Generic block IO controller cgroup interface. This is the common
+ cgroup interface which should be used by various IO controlling
+ policies.
+
+ Currently, CFQ IO scheduler uses it to recognize task groups and
+ control disk bandwidth allocation (proportional time slice allocation)
+ to such task groups.
+
 endif # BLOCK
 
 config BLOCK_COMPAT
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 7e803fc..a521c69 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -40,6 +40,14 @@ config IOSCHED_CFQ
   working environment, suitable for desktop systems.
   This is the default I/O scheduler.
 
+config CFQ_GROUP_IOSCHED
+ bool "CFQ Group Scheduling support"
+ depends on IOSCHED_CFQ && CGROUPS
+ select BLK_CGROUP
+ default n
+ ---help---
+  Enable group IO scheduling in CFQ.
+
 choice
  prompt "Default I/O scheduler"
  default DEFAULT_CFQ
diff --git a/block/Makefile b/block/Makefile
index ba74ca6..16334c9 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
  blk-iopoll.o ioctl.o genhd.o scsi_ioctl.o
 
 obj-$(CONFIG_BLK_DEV_BSG) += bsg.o
+obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o
 obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
 obj-$(CONFIG_IOSCHED_AS) += as-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
new file mode 100644
index 0000000..7bde5c4
--- /dev/null
+++ b/block/blk-cgroup.c
@@ -0,0 +1,199 @@
+/*
+ * Common Block IO controller cgroup interface
+ *
+ * Based on ideas and code from CFQ, CFS and BFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@...>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@...>
+ *      Paolo Valente <paolo.valente@...>
+ *
+ * Copyright (C) 2009 Vivek Goyal <vgoyal@...>
+ *              Nauman Rafique <nauman@...>
+ */
+#include <linux/ioprio.h>
+#include "blk-cgroup.h"
+
+struct blkio_cgroup blkio_root_cgroup = {
+ .weight = BLKIO_WEIGHT_DEFAULT,
+ .ioprio_class = IOPRIO_CLASS_BE,
+};
+
+struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
+{
+ return container_of(cgroup_subsys_state(cgroup, blkio_subsys_id),
+    struct blkio_cgroup, css);
+}
+
+void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
+ struct blkio_group *blkg, void *key)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&blkcg->lock, flags);
+ rcu_assign_pointer(blkg->key, key);
+ hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
+ spin_unlock_irqrestore(&blkcg->lock, flags);
+}
+
+int blkiocg_del_blkio_group(struct blkio_group *blkg)
+{
+ /* Implemented later */
+ return 0;
+}
+
+/* called under rcu_read_lock(). */
+struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key)
+{
+ struct blkio_group *blkg;
+ struct hlist_node *n;
+ void *__key;
+
+ hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
+ __key = blkg->key;
+ if (__key == key)
+ return blkg;
+ }
+
+ return NULL;
+}
+
+#define SHOW_FUNCTION(__VAR) \
+static u64 blkiocg_##__VAR##_read(struct cgroup *cgroup, \
+       struct cftype *cftype) \
+{ \
+ struct blkio_cgroup *blkcg; \
+ \
+ blkcg = cgroup_to_blkio_cgroup(cgroup); \
+ return (u64)blkcg->__VAR; \
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+static int
+blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
+{
+ struct blkio_cgroup *blkcg;
+
+ if (val < BLKIO_WEIGHT_MIN || val > BLKIO_WEIGHT_MAX)
+ return -EINVAL;
+
+ blkcg = cgroup_to_blkio_cgroup(cgroup);
+ blkcg->weight = (unsigned int)val;
+ return 0;
+}
+
+static int blkiocg_ioprio_class_write(struct cgroup *cgroup,
+ struct cftype *cftype, u64 val)
+{
+ struct blkio_cgroup *blkcg;
+
+ if (val < IOPRIO_CLASS_RT || val > IOPRIO_CLASS_IDLE)
+ return -EINVAL;
+
+ blkcg = cgroup_to_blkio_cgroup(cgroup);
+ blkcg->ioprio_class = (unsigned int)val;
+ return 0;
+}
+
+struct cftype blkio_files[] = {
+ {
+ .name = "weight",
+ .read_u64 = blkiocg_weight_read,
+ .write_u64 = blkiocg_weight_write,
+ },
+ {
+ .name = "ioprio_class",
+ .read_u64 = blkiocg_ioprio_class_read,
+ .write_u64 = blkiocg_ioprio_class_write,
+ },
+};
+
+static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ return cgroup_add_files(cgroup, subsys, blkio_files,
+ ARRAY_SIZE(blkio_files));
+}
+
+static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+
+ free_css_id(&blkio_subsys, &blkcg->css);
+ kfree(blkcg);
+}
+
+static struct cgroup_subsys_state *
+blkiocg_create(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+ struct blkio_cgroup *blkcg, *parent_blkcg;
+
+ if (!cgroup->parent) {
+ blkcg = &blkio_root_cgroup;
+ goto done;
+ }
+
+ /* Currently we do not support hierarchy deeper than two level (0,1) */
+ parent_blkcg = cgroup_to_blkio_cgroup(cgroup->parent);
+ if (css_depth(&parent_blkcg->css) > 0)
+ return ERR_PTR(-EINVAL);
+
+ blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
+ if (!blkcg)
+ return ERR_PTR(-ENOMEM);
+done:
+ spin_lock_init(&blkcg->lock);
+ INIT_HLIST_HEAD(&blkcg->blkg_list);
+ blkcg->weight = BLKIO_WEIGHT_DEFAULT;
+ blkcg->ioprio_class = IOPRIO_CLASS_BE;
+
+ return &blkcg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic data structures.  For now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc.
+ */
+static int blkiocg_can_attach(struct cgroup_subsys *subsys,
+ struct cgroup *cgroup, struct task_struct *tsk,
+ bool threadgroup)
+{
+ struct io_context *ioc;
+ int ret = 0;
+
+ /* task_lock() is needed to avoid races with exit_io_context() */
+ task_lock(tsk);
+ ioc = tsk->io_context;
+ if (ioc && atomic_read(&ioc->nr_tasks) > 1)
+ ret = -EINVAL;
+ task_unlock(tsk);
+
+ return ret;
+}
+
+static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+ struct cgroup *prev, struct task_struct *tsk,
+ bool threadgroup)
+{
+ struct io_context *ioc;
+
+ task_lock(tsk);
+ ioc = tsk->io_context;
+ if (ioc)
+ ioc->cgroup_changed = 1;
+ task_unlock(tsk);
+}
+
+struct cgroup_subsys blkio_subsys = {
+ .name = "blkio",
+ .create = blkiocg_create,
+ .can_attach = blkiocg_can_attach,
+ .attach = blkiocg_attach,
+ .destroy = blkiocg_destroy,
+ .populate = blkiocg_populate,
+ .subsys_id = blkio_subsys_id,
+ .use_id = 1,
+};
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
new file mode 100644
index 0000000..49ca84b
--- /dev/null
+++ b/block/blk-cgroup.h
@@ -0,0 +1,38 @@
+/*
+ * Common Block IO controller cgroup interface
+ *
+ * Based on ideas and code from CFQ, CFS and BFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@...>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@...>
+ *      Paolo Valente <paolo.valente@...>
+ *
+ * Copyright (C) 2009 Vivek Goyal <vgoyal@...>
+ *              Nauman Rafique <nauman@...>
+ */
+
+#include <linux/cgroup.h>
+
+struct blkio_cgroup {
+ struct cgroup_subsys_state css;
+ unsigned int weight;
+ unsigned short ioprio_class;
+ spinlock_t lock;
+ struct hlist_head blkg_list;
+};
+
+struct blkio_group {
+ /* An rcu protected unique identifier for the group */
+ void *key;
+ struct hlist_node blkcg_node;
+};
+
+#define BLKIO_WEIGHT_MIN 100
+#define BLKIO_WEIGHT_MAX 1000
+#define BLKIO_WEIGHT_DEFAULT 500
+
+struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
+void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
+ struct blkio_group *blkg, void *key);
+int blkiocg_del_blkio_group(struct blkio_group *blkg);
+struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 323ed12..bc99163 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,6 +12,7 @@
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
 #include <linux/blktrace_api.h>
+#include "blk-cgroup.h"
 
 /*
  * tunables
@@ -29,9 +30,6 @@ static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
 
 #define IO_IOPRIO_CLASSES 3
-#define CFQ_WEIGHT_MIN 100
-#define CFQ_WEIGHT_MAX 1000
-#define CFQ_WEIGHT_DEFAULT 500
 #define CFQ_SERVICE_SHIFT       12
 
 /*
@@ -139,6 +137,9 @@ struct cfq_queue {
 /* Per cgroup grouping structure */
 struct cfq_group {
  struct cfq_sched_data sched_data;
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ struct blkio_group blkg;
+#endif
 };
 
 /*
@@ -296,7 +297,7 @@ cfq_delta_fair(unsigned long delta, struct cfq_entity *cfqe)
 {
  u64 d = delta << CFQ_SERVICE_SHIFT;
 
- return cfq_delta(d, CFQ_WEIGHT_DEFAULT, cfqe->weight);
+ return cfq_delta(d, BLKIO_WEIGHT_DEFAULT, cfqe->weight);
 }
 
 static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
@@ -338,7 +339,7 @@ static inline unsigned int cfq_ioprio_to_weight(int ioprio)
 {
  WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
  /* Map prio 7 - 0 to weights 200 to 900 */
- return CFQ_WEIGHT_DEFAULT + (CFQ_WEIGHT_DEFAULT/5 * (4 - ioprio));
+ return BLKIO_WEIGHT_DEFAULT + (BLKIO_WEIGHT_DEFAULT/5 * (4 - ioprio));
 }
 
 static inline int
@@ -346,9 +347,9 @@ cfq_weight_slice(struct cfq_data *cfqd, int sync, unsigned int weight)
 {
  const int base_slice = cfqd->cfq_slice[sync];
 
- WARN_ON(weight > CFQ_WEIGHT_MAX);
+ WARN_ON(weight > BLKIO_WEIGHT_MAX);
 
- return cfq_delta(base_slice, weight, CFQ_WEIGHT_DEFAULT);
+ return cfq_delta(base_slice, weight, BLKIO_WEIGHT_DEFAULT);
 }
 
 static inline int rq_in_driver(struct cfq_data *cfqd)
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..ccefff0 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,9 @@ SUBSYS(net_cls)
 #endif
 
 /* */
+
+#ifdef CONFIG_BLK_CGROUP
+SUBSYS(blkio)
+#endif
+
+/* */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 4da4a75..5357d5c 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -73,6 +73,10 @@ struct io_context {
  unsigned short ioprio;
  unsigned short ioprio_changed;
 
+#ifdef CONFIG_BLK_CGROUP
+ unsigned short cgroup_changed;
+#endif
+
  /*
  * For request batching
  */
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 07/20] blkio: Provide capablity to enqueue/dequeue group entities

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

o This patch embeds cfq_entity object in cfq_group and provides helper routines
  so that group entities can be scheduled.

Signed-off-by: Vivek Goyal <vgoyal@...>
---
 block/cfq-iosched.c |  110 +++++++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 93 insertions(+), 17 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index bc99163..8ec8a82 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -79,6 +79,7 @@ struct cfq_service_tree {
 #define CFQ_RB_ROOT (struct cfq_service_tree) { RB_ROOT, NULL, 0, NULL}
 
 struct cfq_sched_data {
+ unsigned int nr_active;
  struct cfq_service_tree service_tree[IO_IOPRIO_CLASSES];
 };
 
@@ -89,6 +90,10 @@ struct cfq_entity {
  struct cfq_service_tree *st;
  unsigned short ioprio_class;
  bool ioprio_class_changed;
+ struct cfq_entity *parent;
+ bool on_st;
+ /* Points to the sched_data of group entity. Null for cfqq */
+ struct cfq_sched_data *my_sd;
 };
 
 /*
@@ -136,6 +141,7 @@ struct cfq_queue {
 
 /* Per cgroup grouping structure */
 struct cfq_group {
+ struct cfq_entity entity;
  struct cfq_sched_data sched_data;
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
  struct blkio_group blkg;
@@ -260,9 +266,23 @@ static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 static void cfq_put_queue(struct cfq_queue *cfqq);
 static struct cfq_entity *__cfq_get_next_entity(struct cfq_service_tree *st);
 
+static inline struct cfq_entity *parent_entity(struct cfq_entity *cfqe)
+{
+ return cfqe->parent;
+}
+
 static inline struct cfq_queue *cfqq_of(struct cfq_entity *cfqe)
 {
- return container_of(cfqe, struct cfq_queue, entity);
+ if (!cfqe->my_sd)
+ return container_of(cfqe, struct cfq_queue, entity);
+ return NULL;
+}
+
+static inline struct cfq_group *cfqg_of(struct cfq_entity *cfqe)
+{
+ if (cfqe->my_sd)
+ return container_of(cfqe, struct cfq_group, entity);
+ return NULL;
 }
 
 static inline void
@@ -352,6 +372,33 @@ cfq_weight_slice(struct cfq_data *cfqd, int sync, unsigned int weight)
  return cfq_delta(base_slice, weight, BLKIO_WEIGHT_DEFAULT);
 }
 
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+/* check for entity->parent so that loop is not executed for root entity. */
+#define for_each_entity(entity) \
+ for (; entity && entity->parent; entity = entity->parent)
+
+static inline struct cfq_sched_data *
+cfq_entity_sched_data(struct cfq_entity *cfqe)
+{
+ return &cfqg_of(parent_entity(cfqe))->sched_data;
+}
+#else /* CONFIG_CFQ_GROUP_IOSCHED */
+#define for_each_entity(entity) \
+ for (; entity != NULL; entity = NULL)
+static inline struct cfq_data *cfqd_of(struct cfq_entity *cfqe)
+{
+ return cfqq_of(cfqe)->cfqd;
+}
+
+static inline struct cfq_sched_data *
+cfq_entity_sched_data(struct cfq_entity *cfqe)
+{
+ struct cfq_data *cfqd = cfqd_of(cfqe);
+
+ return &cfqd->root_group.sched_data;
+}
+#endif /* CONFIG_CFQ_GROUP_IOSCHED */
+
 static inline int rq_in_driver(struct cfq_data *cfqd)
 {
  return cfqd->rq_in_driver[0] + cfqd->rq_in_driver[1];
@@ -606,16 +653,28 @@ static void __dequeue_cfqe(struct cfq_service_tree *st, struct cfq_entity *cfqe)
 static void dequeue_cfqe(struct cfq_entity *cfqe)
 {
  struct cfq_service_tree *st = cfqe->st;
+ struct cfq_sched_data *sd = cfq_entity_sched_data(cfqe);
 
  if (st->active == cfqe)
  st->active = NULL;
 
  __dequeue_cfqe(st, cfqe);
+ sd->nr_active--;
+ cfqe->on_st = 0;
 }
 
 static void dequeue_cfqq(struct cfq_queue *cfqq)
 {
- dequeue_cfqe(&cfqq->entity);
+ struct cfq_entity *cfqe = &cfqq->entity;
+
+ for_each_entity(cfqe) {
+ struct cfq_sched_data *sd = cfq_entity_sched_data(cfqe);
+
+ dequeue_cfqe(cfqe);
+ /* Do not dequeue parent if it has other entities under it */
+ if (sd->nr_active)
+ break;
+ }
 }
 
 static void __enqueue_cfqe(struct cfq_service_tree *st, struct cfq_entity *cfqe,
@@ -653,6 +712,10 @@ static void __enqueue_cfqe(struct cfq_service_tree *st, struct cfq_entity *cfqe,
 
 static void enqueue_cfqe(struct cfq_entity *cfqe)
 {
+ struct cfq_sched_data *sd = cfq_entity_sched_data(cfqe);
+
+ cfqe->on_st = 1;
+ sd->nr_active++;
  cfqe_update_ioprio_class(cfqe);
  place_cfqe(cfqe->st, cfqe, 0);
  __enqueue_cfqe(cfqe->st, cfqe, 0);
@@ -660,7 +723,13 @@ static void enqueue_cfqe(struct cfq_entity *cfqe)
 
 static void enqueue_cfqq(struct cfq_queue *cfqq)
 {
- enqueue_cfqe(&cfqq->entity);
+ struct cfq_entity *cfqe = &cfqq->entity;
+
+ for_each_entity(cfqe) {
+ if (cfqe->on_st)
+ break;
+ enqueue_cfqe(cfqe);
+ }
 }
 
 /* Requeue a cfqq which is already on the service tree */
@@ -687,14 +756,22 @@ static void requeue_cfqq(struct cfq_queue *cfqq, int add_front)
 
 static void cfqe_served(struct cfq_entity *cfqe, unsigned long served)
 {
- /*
- * Can't update entity disk time while it is on sorted rb-tree
- * as vdisktime is used as key.
- */
- __dequeue_cfqe(cfqe->st, cfqe);
- cfqe->vdisktime += cfq_delta_fair(served, cfqe);
- update_min_vdisktime(cfqe->st);
- __enqueue_cfqe(cfqe->st, cfqe, 0);
+ for_each_entity(cfqe) {
+ /*
+ * Can't update entity disk time while it is on sorted rb-tree
+ * as vdisktime is used as key.
+ */
+ __dequeue_cfqe(cfqe->st, cfqe);
+ cfqe->vdisktime += cfq_delta_fair(served, cfqe);
+ update_min_vdisktime(cfqe->st);
+ __enqueue_cfqe(cfqe->st, cfqe, 0);
+
+ /* If entity prio class has changed, take that into account */
+ if (unlikely(cfqe->ioprio_class_changed)) {
+ dequeue_cfqe(cfqe);
+ enqueue_cfqe(cfqe);
+ }
+ }
 }
 
 static void cfqq_served(struct cfq_queue *cfqq, unsigned long served)
@@ -708,12 +785,6 @@ static void cfqq_served(struct cfq_queue *cfqq, unsigned long served)
  */
  served = cfq_prio_to_slice(cfqq->cfqd, cfqq);
  cfqe_served(&cfqq->entity, served);
-
- /* If cfqq prio class has changed, take that into account */
- if (unlikely(cfqq->entity.ioprio_class_changed)) {
- dequeue_cfqq(cfqq);
- enqueue_cfqq(cfqq);
- }
 }
 
 /*
@@ -1941,6 +2012,8 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
  cfq_mark_cfqq_sync(cfqq);
  }
  cfqq->pid = pid;
+ cfqq->entity.parent = &cfqd->root_group.entity;
+ cfqq->entity.my_sd = NULL;
 }
 
 static struct cfq_queue *
@@ -2707,6 +2780,9 @@ static void cfq_init_root_group(struct cfq_data *cfqd)
  struct cfq_group *cfqg = &cfqd->root_group;
  int i;
 
+ cfqg->entity.parent = NULL;
+ cfqg->entity.my_sd = &cfqg->sched_data;
+
  for (i = 0; i < IO_IOPRIO_CLASSES; i++)
  cfqg->sched_data.service_tree[i] = CFQ_RB_ROOT;
 }
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 08/20] blkio: Add support for dynamic creation of cfq_groups

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

o So far we assumed there is one cfq_group in the system (root group). This
  patch introduces the code to map requests to their cgroup and create more
  cfq_groups dynamically and keep track of these groups.

Signed-off-by: Vivek Goyal <vgoyal@...>
---
 block/cfq-iosched.c |  123 ++++++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 111 insertions(+), 12 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 8ec8a82..4481917 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -145,6 +145,7 @@ struct cfq_group {
  struct cfq_sched_data sched_data;
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
  struct blkio_group blkg;
+ struct hlist_node cfqd_node;
 #endif
 };
 
@@ -212,6 +213,9 @@ struct cfq_data {
  struct cfq_queue oom_cfqq;
 
  unsigned long last_end_sync_rq;
+
+ /* List of cfq groups being managed on this device*/
+ struct hlist_head cfqg_list;
 };
 
 enum cfqq_state_flags {
@@ -286,13 +290,14 @@ static inline struct cfq_group *cfqg_of(struct cfq_entity *cfqe)
 }
 
 static inline void
-init_cfqe_service_tree(struct cfq_data *cfqd, struct cfq_entity *cfqe)
+init_cfqe_service_tree(struct cfq_entity *cfqe, struct cfq_entity *p_cfqe)
 {
+ struct cfq_group *p_cfqg = cfqg_of(p_cfqe);
  unsigned short idx = cfqe->ioprio_class - 1;
 
  BUG_ON(idx >= IO_IOPRIO_CLASSES);
 
- cfqe->st = &cfqd->root_group.sched_data.service_tree[idx];
+ cfqe->st = &p_cfqg->sched_data.service_tree[idx];
 }
 
 static inline s64
@@ -372,16 +377,93 @@ cfq_weight_slice(struct cfq_data *cfqd, int sync, unsigned int weight)
  return cfq_delta(base_slice, weight, BLKIO_WEIGHT_DEFAULT);
 }
 
+static inline void
+cfq_init_cfqe_parent(struct cfq_entity *cfqe, struct cfq_entity *p_cfqe)
+{
+ cfqe->parent = p_cfqe;
+ init_cfqe_service_tree(cfqe, p_cfqe);
+}
+
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
 /* check for entity->parent so that loop is not executed for root entity. */
 #define for_each_entity(entity) \
  for (; entity && entity->parent; entity = entity->parent)
 
+static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
+{
+ if (blkg)
+ return container_of(blkg, struct cfq_group, blkg);
+ return NULL;
+}
+
 static inline struct cfq_sched_data *
 cfq_entity_sched_data(struct cfq_entity *cfqe)
 {
  return &cfqg_of(parent_entity(cfqe))->sched_data;
 }
+
+static void cfq_init_cfqg(struct cfq_group *cfqg, struct blkio_cgroup *blkcg)
+{
+ struct cfq_entity *cfqe = &cfqg->entity;
+
+ cfqe->weight = blkcg->weight;
+ cfqe->ioprio_class = blkcg->ioprio_class;
+ cfqe->ioprio_class_changed = 1;
+ cfqe->my_sd = &cfqg->sched_data;
+}
+
+static struct cfq_group *
+cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
+{
+ struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+ struct cfq_group *cfqg = NULL;
+ void *key = cfqd;
+
+ /* Do we need to take this reference */
+ if (!css_tryget(&blkcg->css))
+ return NULL;;
+
+ cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
+ if (cfqg || !create)
+ goto done;
+
+ cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC |  __GFP_ZERO,
+ cfqd->queue->node);
+ if (!cfqg)
+ goto done;
+
+ cfq_init_cfqg(cfqg, blkcg);
+ cfq_init_cfqe_parent(&cfqg->entity, &cfqd->root_group.entity);
+
+ /* Add group onto cgroup list */
+ blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd);
+
+ /* Add group on cfqd list */
+ hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
+
+done:
+ css_put(&blkcg->css);
+ return cfqg;
+}
+
+/*
+ * Search for the cfq group current task belongs to. If create = 1, then also
+ * create the cfq group if it does not exist.
+ * Should be called under request queue lock.
+ */
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+{
+ struct cgroup *cgroup;
+ struct cfq_group *cfqg = NULL;
+
+ rcu_read_lock();
+ cgroup = task_cgroup(current, blkio_subsys_id);
+ cfqg = cfq_find_alloc_cfqg(cfqd, cgroup, create);
+ if (!cfqg && create)
+ cfqg = &cfqd->root_group;
+ rcu_read_unlock();
+ return cfqg;
+}
 #else /* CONFIG_CFQ_GROUP_IOSCHED */
 #define for_each_entity(entity) \
  for (; entity != NULL; entity = NULL)
@@ -397,6 +479,11 @@ cfq_entity_sched_data(struct cfq_entity *cfqe)
 
  return &cfqd->root_group.sched_data;
 }
+
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+{
+ return &cfqd->root_group;
+}
 #endif /* CONFIG_CFQ_GROUP_IOSCHED */
 
 static inline int rq_in_driver(struct cfq_data *cfqd)
@@ -624,14 +711,11 @@ place_cfqe(struct cfq_service_tree *st, struct cfq_entity *cfqe, int add_front)
 static inline void cfqe_update_ioprio_class(struct cfq_entity *cfqe)
 {
  if (unlikely(cfqe->ioprio_class_changed)) {
- struct cfq_queue *cfqq = cfqq_of(cfqe);
- struct cfq_data *cfqd = cfqq->cfqd;
-
  /*
  * Re-initialize the service tree pointer as ioprio class
  * change will lead to service tree change.
  */
- init_cfqe_service_tree(cfqd, cfqe);
+ init_cfqe_service_tree(cfqe, parent_entity(cfqe));
  cfqe->ioprio_class_changed = 0;
  cfqe->vdisktime = 0;
  }
@@ -1229,16 +1313,19 @@ static struct cfq_entity *cfq_get_next_entity(struct cfq_sched_data *sd)
 static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
 {
  struct cfq_entity *cfqe = NULL;
+ struct cfq_sched_data *sd;
 
  if (!cfqd->rq_queued)
  return NULL;
 
- cfqe = cfq_get_next_entity(&cfqd->root_group.sched_data);
+ sd = &cfqd->root_group.sched_data;
+ for (; sd ; sd = cfqe->my_sd) {
+ cfqe = cfq_get_next_entity(sd);
+ if (!cfqe)
+ return NULL;
+ }
 
- if (cfqe)
- return cfqq_of(cfqe);
- else
- return NULL;
+ return cfqq_of(cfqe);
 }
 
 /*
@@ -2012,8 +2099,17 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
  cfq_mark_cfqq_sync(cfqq);
  }
  cfqq->pid = pid;
- cfqq->entity.parent = &cfqd->root_group.entity;
+}
+
+static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
+{
  cfqq->entity.my_sd = NULL;
+
+ /* Currently, all async queues are mapped to root group */
+ if (!cfq_cfqq_sync(cfqq))
+ cfqg = &cfqq->cfqd->root_group;
+
+ cfq_init_cfqe_parent(&cfqq->entity, &cfqg->entity);
 }
 
 static struct cfq_queue *
@@ -2022,8 +2118,10 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
 {
  struct cfq_queue *cfqq, *new_cfqq = NULL;
  struct cfq_io_context *cic;
+ struct cfq_group *cfqg;
 
 retry:
+ cfqg = cfq_get_cfqg(cfqd, 1);
  cic = cfq_cic_lookup(cfqd, ioc);
  /* cic always exists here */
  cfqq = cic_to_cfqq(cic, is_sync);
@@ -2054,6 +2152,7 @@ retry:
  if (cfqq) {
  cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
  cfq_init_prio_data(cfqq, ioc);
+ cfq_link_cfqq_cfqg(cfqq, cfqg);
  cfq_log_cfqq(cfqd, cfqq, "alloced");
  } else
  cfqq = &cfqd->oom_cfqq;
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 09/20] blkio: Porpogate blkio cgroup weight or ioprio class updation to cfq groups

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

o If a user decides the change the weight or ioprio class of a cgroup, this
  information needs to be passed on to io controlling policy module also so
  that new information can take effect.

Signed-off-by: Vivek Goyal <vgoyal@...>
---
 block/blk-cgroup.c  |   16 ++++++++++++++++
 block/cfq-iosched.c |   18 ++++++++++++++++++
 2 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 7bde5c4..0d52a2c 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -13,6 +13,10 @@
 #include <linux/ioprio.h>
 #include "blk-cgroup.h"
 
+extern void cfq_update_blkio_group_weight(struct blkio_group *, unsigned int);
+extern void cfq_update_blkio_group_ioprio_class(struct blkio_group *,
+ unsigned short);
+
 struct blkio_cgroup blkio_root_cgroup = {
  .weight = BLKIO_WEIGHT_DEFAULT,
  .ioprio_class = IOPRIO_CLASS_BE,
@@ -75,12 +79,18 @@ static int
 blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
 {
  struct blkio_cgroup *blkcg;
+ struct blkio_group *blkg;
+ struct hlist_node *n;
 
  if (val < BLKIO_WEIGHT_MIN || val > BLKIO_WEIGHT_MAX)
  return -EINVAL;
 
  blkcg = cgroup_to_blkio_cgroup(cgroup);
+ spin_lock_irq(&blkcg->lock);
  blkcg->weight = (unsigned int)val;
+ hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node)
+ cfq_update_blkio_group_weight(blkg, blkcg->weight);
+ spin_unlock_irq(&blkcg->lock);
  return 0;
 }
 
@@ -88,12 +98,18 @@ static int blkiocg_ioprio_class_write(struct cgroup *cgroup,
  struct cftype *cftype, u64 val)
 {
  struct blkio_cgroup *blkcg;
+ struct blkio_group *blkg;
+ struct hlist_node *n;
 
  if (val < IOPRIO_CLASS_RT || val > IOPRIO_CLASS_IDLE)
  return -EINVAL;
 
  blkcg = cgroup_to_blkio_cgroup(cgroup);
+ spin_lock_irq(&blkcg->lock);
  blkcg->ioprio_class = (unsigned int)val;
+ hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node)
+ cfq_update_blkio_group_weight(blkg, blkcg->weight);
+ spin_unlock_irq(&blkcg->lock);
  return 0;
 }
 
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4481917..3c0fa1b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -464,6 +464,24 @@ static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
  rcu_read_unlock();
  return cfqg;
 }
+
+void
+cfq_update_blkio_group_weight(struct blkio_group *blkg, unsigned int weight)
+{
+ struct cfq_group *cfqg = cfqg_of_blkg(blkg);
+
+ cfqg->entity.weight = weight;
+}
+
+void cfq_update_blkio_group_ioprio_class(struct blkio_group *blkg,
+ unsigned short ioprio_class)
+{
+ struct cfq_group *cfqg = cfqg_of_blkg(blkg);
+
+ cfqg->entity.ioprio_class = ioprio_class;
+ smp_wmb();
+ cfqg->entity.ioprio_class_changed = 1;
+}
 #else /* CONFIG_CFQ_GROUP_IOSCHED */
 #define for_each_entity(entity) \
  for (; entity != NULL; entity = NULL)
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 10/20] blkio: Implement cfq group deletion and reference counting support

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

o With dynamic cfq_groups, comes the need of making sure cfq_groups can be
  freed when either elevator exits or one decides to delete the cgroup.

o This patch takes care of elevator exit and cgroup deletion paths and also
  implements cfq_group reference counting so that a cgroup can be removed
  even if there are backlogged requests in the associated cfq_groups.

Signed-off-by: Vivek Goyal <vgoyal@...>
Signed-off-by: Nauman Rafique <nauman@...>
---
 block/blk-cgroup.c  |   66 +++++++++++++++++++++++-
 block/blk-cgroup.h  |    2 +
 block/cfq-iosched.c |  143 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 208 insertions(+), 3 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 0d52a2c..a62b8a3 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -16,6 +16,7 @@
 extern void cfq_update_blkio_group_weight(struct blkio_group *, unsigned int);
 extern void cfq_update_blkio_group_ioprio_class(struct blkio_group *,
  unsigned short);
+extern void cfq_delink_blkio_group(void *, struct blkio_group *);
 
 struct blkio_cgroup blkio_root_cgroup = {
  .weight = BLKIO_WEIGHT_DEFAULT,
@@ -35,14 +36,43 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
 
  spin_lock_irqsave(&blkcg->lock, flags);
  rcu_assign_pointer(blkg->key, key);
+ blkg->blkcg_id = css_id(&blkcg->css);
  hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
  spin_unlock_irqrestore(&blkcg->lock, flags);
 }
 
+static void __blkiocg_del_blkio_group(struct blkio_group *blkg)
+{
+ hlist_del_init_rcu(&blkg->blkcg_node);
+ blkg->blkcg_id = 0;
+}
+
+/*
+ * returns 0 if blkio_group was still on cgroup list. Otherwise returns 1
+ * indicating that blk_group was unhashed by the time we got to it.
+ */
 int blkiocg_del_blkio_group(struct blkio_group *blkg)
 {
- /* Implemented later */
- return 0;
+ struct blkio_cgroup *blkcg;
+ unsigned long flags;
+ struct cgroup_subsys_state *css;
+ int ret = 1;
+
+ rcu_read_lock();
+ css = css_lookup(&blkio_subsys, blkg->blkcg_id);
+ if (!css)
+ goto out;
+
+ blkcg = container_of(css, struct blkio_cgroup, css);
+ spin_lock_irqsave(&blkcg->lock, flags);
+ if (!hlist_unhashed(&blkg->blkcg_node)) {
+ __blkiocg_del_blkio_group(blkg);
+ ret = 0;
+ }
+ spin_unlock_irqrestore(&blkcg->lock, flags);
+out:
+ rcu_read_unlock();
+ return ret;
 }
 
 /* called under rcu_read_lock(). */
@@ -135,8 +165,40 @@ static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 {
  struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+ unsigned long flags;
+ struct blkio_group *blkg;
+ void *key;
 
+ rcu_read_lock();
+remove_entry:
+ spin_lock_irqsave(&blkcg->lock, flags);
+
+ if (hlist_empty(&blkcg->blkg_list)) {
+ spin_unlock_irqrestore(&blkcg->lock, flags);
+ goto done;
+ }
+
+ blkg = hlist_entry(blkcg->blkg_list.first, struct blkio_group,
+ blkcg_node);
+ key = rcu_dereference(blkg->key);
+ __blkiocg_del_blkio_group(blkg);
+
+ spin_unlock_irqrestore(&blkcg->lock, flags);
+
+ /*
+ * This blkio_group is being delinked as associated cgroup is going
+ * away. Let all the IO controlling policies know about this event.
+ *
+ * Currently this is static call to one io controlling policy. Once
+ * we have more policies in place, we need some dynamic registration
+ * of callback function.
+ */
+ cfq_delink_blkio_group(key, blkg);
+ goto remove_entry;
+done:
  free_css_id(&blkio_subsys, &blkcg->css);
+ rcu_read_unlock();
+
  kfree(blkcg);
 }
 
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 49ca84b..2bf736b 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -25,12 +25,14 @@ struct blkio_group {
  /* An rcu protected unique identifier for the group */
  void *key;
  struct hlist_node blkcg_node;
+ unsigned short blkcg_id;
 };
 
 #define BLKIO_WEIGHT_MIN 100
 #define BLKIO_WEIGHT_MAX 1000
 #define BLKIO_WEIGHT_DEFAULT 500
 
+extern struct blkio_cgroup blkio_root_cgroup;
 struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
 void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
  struct blkio_group *blkg, void *key);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 3c0fa1b..b9a052b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -146,6 +146,7 @@ struct cfq_group {
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
  struct blkio_group blkg;
  struct hlist_node cfqd_node;
+ atomic_t ref;
 #endif
 };
 
@@ -295,8 +296,18 @@ init_cfqe_service_tree(struct cfq_entity *cfqe, struct cfq_entity *p_cfqe)
  struct cfq_group *p_cfqg = cfqg_of(p_cfqe);
  unsigned short idx = cfqe->ioprio_class - 1;
 
- BUG_ON(idx >= IO_IOPRIO_CLASSES);
+ /*
+ * ioprio class of the entity has not been initialized yet, don't
+ * init service tree right now. This can happen in the case of
+ * oom_cfqq which will inherit its class and prio once first request
+ * gets queued in and at that point of time prio update will make
+ * sure that service tree gets initialized before queue gets onto
+ * tree.
+ */
+ if (cfqe->ioprio_class == IOPRIO_CLASS_NONE)
+ return;
 
+ BUG_ON(idx >= IO_IOPRIO_CLASSES);
  cfqe->st = &p_cfqg->sched_data.service_tree[idx];
 }
 
@@ -402,6 +413,16 @@ cfq_entity_sched_data(struct cfq_entity *cfqe)
  return &cfqg_of(parent_entity(cfqe))->sched_data;
 }
 
+static inline struct cfq_group *cfqq_to_cfqg(struct cfq_queue *cfqq)
+{
+ return cfqg_of(parent_entity(&cfqq->entity));
+}
+
+static inline void cfq_get_cfqg_ref(struct cfq_group *cfqg)
+{
+ atomic_inc(&cfqg->ref);
+}
+
 static void cfq_init_cfqg(struct cfq_group *cfqg, struct blkio_cgroup *blkcg)
 {
  struct cfq_entity *cfqe = &cfqg->entity;
@@ -435,6 +456,14 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
  cfq_init_cfqg(cfqg, blkcg);
  cfq_init_cfqe_parent(&cfqg->entity, &cfqd->root_group.entity);
 
+ /*
+ * Take the initial reference that will be released on destroy
+ * This can be thought of a joint reference by cgroup and
+ * elevator which will be dropped by either elevator exit
+ * or cgroup deletion path depending on who is exiting first.
+ */
+ cfq_get_cfqg_ref(cfqg);
+
  /* Add group onto cgroup list */
  blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd);
 
@@ -482,9 +511,87 @@ void cfq_update_blkio_group_ioprio_class(struct blkio_group *blkg,
  smp_wmb();
  cfqg->entity.ioprio_class_changed = 1;
 }
+
+static void cfq_put_cfqg(struct cfq_group *cfqg)
+{
+ struct cfq_service_tree *st;
+ int i;
+
+ BUG_ON(atomic_read(&cfqg->ref) <= 0);
+ if (!atomic_dec_and_test(&cfqg->ref))
+ return;
+
+ for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+ st = cfqg->sched_data.service_tree + i;
+ BUG_ON(!RB_EMPTY_ROOT(&st->rb));
+ BUG_ON(st->active != NULL);
+ }
+
+ kfree(cfqg);
+}
+
+static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
+{
+ /* Something wrong if we are trying to remove same group twice */
+ BUG_ON(hlist_unhashed(&cfqg->cfqd_node));
+
+ hlist_del_init(&cfqg->cfqd_node);
+
+ /*
+ * Put the reference taken at the time of creation so that when all
+ * queues are gone, group can be destroyed.
+ */
+ cfq_put_cfqg(cfqg);
+}
+
+static void cfq_release_cfq_groups(struct cfq_data *cfqd)
+{
+ struct hlist_node *pos, *n;
+ struct cfq_group *cfqg;
+
+ hlist_for_each_entry_safe(cfqg, pos, n, &cfqd->cfqg_list, cfqd_node) {
+ /*
+ * If cgroup removal path got to blk_group first and removed
+ * it from cgroup list, then it will take care of destroying
+ * cfqg also.
+ */
+ if (!blkiocg_del_blkio_group(&cfqg->blkg))
+ cfq_destroy_cfqg(cfqd, cfqg);
+ }
+}
+
+/*
+ * Blk cgroup controller notification saying that blkio_group object is being
+ * delinked as associated cgroup object is going away. That also means that
+ * no new IO will come in this group. So get rid of this group as soon as
+ * any pending IO in the group is finished.
+ *
+ * This function is called under rcu_read_lock(). key is the rcu protected
+ * pointer. That means "key" is a valid cfq_data pointer as long as we are rcu
+ * read lock.
+ *
+ * "key" was fetched from blkio_group under blkio_cgroup->lock. That means
+ * it should not be NULL as even if elevator was exiting, cgroup deltion
+ * path got to it first.
+ */
+void cfq_delink_blkio_group(void *key, struct blkio_group *blkg)
+{
+ unsigned long  flags;
+ struct cfq_data *cfqd = key;
+
+ spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+ cfq_destroy_cfqg(cfqd, cfqg_of_blkg(blkg));
+ spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+}
+
 #else /* CONFIG_CFQ_GROUP_IOSCHED */
 #define for_each_entity(entity) \
  for (; entity != NULL; entity = NULL)
+
+static void cfq_release_cfq_groups(struct cfq_data *cfqd) {}
+static inline void cfq_get_cfqg_ref(struct cfq_group *cfqg) {}
+static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
+
 static inline struct cfq_data *cfqd_of(struct cfq_entity *cfqe)
 {
  return cfqq_of(cfqe)->cfqd;
@@ -498,6 +605,11 @@ cfq_entity_sched_data(struct cfq_entity *cfqe)
  return &cfqd->root_group.sched_data;
 }
 
+static inline struct cfq_group *cfqq_to_cfqg(struct cfq_queue *cfqq)
+{
+ return &cfqq->cfqd->root_group;
+}
+
 static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
 {
  return &cfqd->root_group;
@@ -1818,11 +1930,13 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
  * task holds one reference to the queue, dropped when task exits. each rq
  * in-flight on this queue also holds a reference, dropped when rq is freed.
  *
+ * Each cfq queue took a reference on the parent group. Drop it now.
  * queue lock must be held here.
  */
 static void cfq_put_queue(struct cfq_queue *cfqq)
 {
  struct cfq_data *cfqd = cfqq->cfqd;
+ struct cfq_group *cfqg;
 
  BUG_ON(atomic_read(&cfqq->ref) <= 0);
 
@@ -1832,6 +1946,7 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
  cfq_log_cfqq(cfqd, cfqq, "put_queue");
  BUG_ON(rb_first(&cfqq->sort_list));
  BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
+ cfqg = cfqq_to_cfqg(cfqq);
 
  if (unlikely(cfqd->active_queue == cfqq)) {
  __cfq_slice_expired(cfqd, cfqq);
@@ -1841,6 +1956,7 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
  BUG_ON(cfq_cfqq_on_rr(cfqq));
 
  kmem_cache_free(cfq_pool, cfqq);
+ cfq_put_cfqg(cfqg);
 }
 
 /*
@@ -2128,6 +2244,9 @@ static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
  cfqg = &cfqq->cfqd->root_group;
 
  cfq_init_cfqe_parent(&cfqq->entity, &cfqg->entity);
+
+ /* cfqq reference on cfqg */
+ cfq_get_cfqg_ref(cfqg);
 }
 
 static struct cfq_queue *
@@ -2902,6 +3021,23 @@ static void cfq_init_root_group(struct cfq_data *cfqd)
 
  for (i = 0; i < IO_IOPRIO_CLASSES; i++)
  cfqg->sched_data.service_tree[i] = CFQ_RB_ROOT;
+
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ atomic_set(&cfqg->ref, 0);
+ /*
+ * Take a reference to root group which we never drop. This is just
+ * to make sure that cfq_put_cfqg() does not try to kfree root group
+ */
+ cfq_get_cfqg_ref(cfqg);
+ blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg, (void *)cfqd);
+#endif
+}
+
+static void cfq_exit_root_group(struct cfq_data *cfqd)
+{
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ blkiocg_del_blkio_group(&cfqd->root_group.blkg);
+#endif
 }
 
 static void cfq_exit_queue(struct elevator_queue *e)
@@ -2926,10 +3062,14 @@ static void cfq_exit_queue(struct elevator_queue *e)
 
  cfq_put_async_queues(cfqd);
 
+ cfq_release_cfq_groups(cfqd);
+ cfq_exit_root_group(cfqd);
  spin_unlock_irq(q->queue_lock);
 
  cfq_shutdown_timer_wq(cfqd);
 
+ /* Wait for cfqg->blkg->key accessors to exit their grace periods. */
+ synchronize_rcu();
  kfree(cfqd);
 }
 
@@ -2959,6 +3099,7 @@ static void *cfq_init_queue(struct request_queue *q)
  */
  cfq_init_cfqq(cfqd, &cfqd->oom_cfqq, 1, 0);
  atomic_inc(&cfqd->oom_cfqq.ref);
+ cfq_link_cfqq_cfqg(&cfqd->oom_cfqq, &cfqd->root_group);
 
  INIT_LIST_HEAD(&cfqd->cic_list);
 
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 11/20] blkio: Some CFQ debugging Aid

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

o Some CFQ debugging Aid.

Signed-off-by: Vivek Goyal <vgoyal@...>
---
 block/Kconfig         |    9 +++++++++
 block/Kconfig.iosched |    9 +++++++++
 block/blk-cgroup.c    |    4 ++++
 block/blk-cgroup.h    |   13 +++++++++++++
 block/cfq-iosched.c   |   33 +++++++++++++++++++++++++++++++++
 5 files changed, 68 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index 6ba1a8e..e20fbde 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -90,6 +90,15 @@ config BLK_CGROUP
  control disk bandwidth allocation (proportional time slice allocation)
  to such task groups.
 
+config DEBUG_BLK_CGROUP
+ bool
+ depends on BLK_CGROUP
+ default n
+ ---help---
+ Enable some debugging help. Currently it stores the cgroup path
+ in the blk group which can be used by cfq for tracing various
+ group related activity.
+
 endif # BLOCK
 
 config BLOCK_COMPAT
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a521c69..9c5f0b5 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -48,6 +48,15 @@ config CFQ_GROUP_IOSCHED
  ---help---
   Enable group IO scheduling in CFQ.
 
+config DEBUG_CFQ_IOSCHED
+ bool "Debug CFQ Scheduling"
+ depends on CFQ_GROUP_IOSCHED
+ select DEBUG_BLK_CGROUP
+ default n
+ ---help---
+  Enable CFQ IO scheduling debugging in CFQ. Currently it makes
+  blktrace output more verbose.
+
 choice
  prompt "Default I/O scheduler"
  default DEFAULT_CFQ
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index a62b8a3..4c68682 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -39,6 +39,10 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
  blkg->blkcg_id = css_id(&blkcg->css);
  hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
  spin_unlock_irqrestore(&blkcg->lock, flags);
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+ /* Need to take css reference ? */
+ cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
+#endif
 }
 
 static void __blkiocg_del_blkio_group(struct blkio_group *blkg)
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 2bf736b..cb72c35 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -26,12 +26,25 @@ struct blkio_group {
  void *key;
  struct hlist_node blkcg_node;
  unsigned short blkcg_id;
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+ /* Store cgroup path */
+ char path[128];
+#endif
 };
 
 #define BLKIO_WEIGHT_MIN 100
 #define BLKIO_WEIGHT_MAX 1000
 #define BLKIO_WEIGHT_DEFAULT 500
 
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+static inline char *blkg_path(struct blkio_group *blkg)
+{
+ return blkg->path;
+}
+#else
+static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
+#endif
+
 extern struct blkio_cgroup blkio_root_cgroup;
 struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
 void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b9a052b..2fde3c4 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -258,8 +258,29 @@ CFQ_CFQQ_FNS(sync);
 CFQ_CFQQ_FNS(coop);
 #undef CFQ_CFQQ_FNS
 
+#ifdef CONFIG_DEBUG_CFQ_IOSCHED
+#define cfq_log_cfqq(cfqd, cfqq, fmt, args...) \
+ blk_add_trace_msg((cfqd)->queue, "cfq%d%c %s " fmt, (cfqq)->pid, \
+ cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \
+ blkg_path(&cfqq_to_cfqg((cfqq))->blkg), ##args);
+
+#define cfq_log_cfqe(cfqd, cfqe, fmt, args...) \
+ if (cfqq_of(cfqe)) { \
+ struct cfq_queue *cfqq = cfqq_of(cfqe); \
+ blk_add_trace_msg((cfqd)->queue, "cfq%d%c %s " fmt, \
+ (cfqq)->pid, cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \
+ blkg_path(&cfqq_to_cfqg((cfqq))->blkg), ##args);\
+ } else { \
+ struct cfq_group *cfqg = cfqg_of(cfqe); \
+ blk_add_trace_msg((cfqd)->queue, "%s " fmt, \
+ blkg_path(&(cfqg)->blkg), ##args); \
+ }
+#else
 #define cfq_log_cfqq(cfqd, cfqq, fmt, args...) \
  blk_add_trace_msg((cfqd)->queue, "cfq%d " fmt, (cfqq)->pid, ##args)
+#define cfq_log_cfqe(cfqd, cfqe, fmt, args...)
+#endif
+
 #define cfq_log(cfqd, fmt, args...) \
  blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
 
@@ -400,6 +421,8 @@ cfq_init_cfqe_parent(struct cfq_entity *cfqe, struct cfq_entity *p_cfqe)
 #define for_each_entity(entity) \
  for (; entity && entity->parent; entity = entity->parent)
 
+#define cfqe_is_cfqq(cfqe)     (!(cfqe)->my_sd)
+
 static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
 {
  if (blkg)
@@ -588,6 +611,8 @@ void cfq_delink_blkio_group(void *key, struct blkio_group *blkg)
 #define for_each_entity(entity) \
  for (; entity != NULL; entity = NULL)
 
+#define cfqe_is_cfqq(cfqe)     1
+
 static void cfq_release_cfq_groups(struct cfq_data *cfqd) {}
 static inline void cfq_get_cfqg_ref(struct cfq_group *cfqg) {}
 static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
@@ -885,6 +910,10 @@ static void dequeue_cfqq(struct cfq_queue *cfqq)
  struct cfq_sched_data *sd = cfq_entity_sched_data(cfqe);
 
  dequeue_cfqe(cfqe);
+ if (!cfqe_is_cfqq(cfqe)) {
+ cfq_log_cfqe(cfqq->cfqd, cfqe, "del_from_rr group");
+ }
+
  /* Do not dequeue parent if it has other entities under it */
  if (sd->nr_active)
  break;
@@ -970,6 +999,8 @@ static void requeue_cfqq(struct cfq_queue *cfqq, int add_front)
 
 static void cfqe_served(struct cfq_entity *cfqe, unsigned long served)
 {
+ struct cfq_data *cfqd = cfqq_of(cfqe)->cfqd;
+
  for_each_entity(cfqe) {
  /*
  * Can't update entity disk time while it is on sorted rb-tree
@@ -979,6 +1010,8 @@ static void cfqe_served(struct cfq_entity *cfqe, unsigned long served)
  cfqe->vdisktime += cfq_delta_fair(served, cfqe);
  update_min_vdisktime(cfqe->st);
  __enqueue_cfqe(cfqe->st, cfqe, 0);
+ cfq_log_cfqe(cfqd, cfqe, "served: vt=%llx min_vt=%llx",
+ cfqe->vdisktime, cfqe->st->min_vdisktime);
 
  /* If entity prio class has changed, take that into account */
  if (unlikely(cfqe->ioprio_class_changed)) {
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 12/20] blkio: Export disk time and sectors dispatched from cgroup interface

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Signed-off-by: Vivek Goyal <vgoyal@...>
---
 block/blk-cgroup.c  |   47 ++++++++++++++++++++++++++++++++++++++++++++++-
 block/blk-cgroup.h  |   10 +++++++++-
 block/cfq-iosched.c |   29 +++++++++++++++++++++++++++--
 3 files changed, 82 insertions(+), 4 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4c68682..47c0ce7 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -11,6 +11,8 @@
  *              Nauman Rafique <nauman@...>
  */
 #include <linux/ioprio.h>
+#include <linux/seq_file.h>
+#include <linux/kdev_t.h>
 #include "blk-cgroup.h"
 
 extern void cfq_update_blkio_group_weight(struct blkio_group *, unsigned int);
@@ -29,8 +31,15 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
     struct blkio_cgroup, css);
 }
 
+void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
+ unsigned long time, unsigned long sectors)
+{
+ blkg->time += time;
+ blkg->sectors += sectors;
+}
+
 void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
- struct blkio_group *blkg, void *key)
+ struct blkio_group *blkg, void *key, dev_t dev)
 {
  unsigned long flags;
 
@@ -43,6 +52,7 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
  /* Need to take css reference ? */
  cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
 #endif
+ blkg->dev = dev;
 }
 
 static void __blkiocg_del_blkio_group(struct blkio_group *blkg)
@@ -147,6 +157,33 @@ static int blkiocg_ioprio_class_write(struct cgroup *cgroup,
  return 0;
 }
 
+#define SHOW_FUNCTION_PER_GROUP(__VAR) \
+static int blkiocg_##__VAR##_read(struct cgroup *cgroup, \
+ struct cftype *cftype, struct seq_file *m) \
+{ \
+ struct blkio_cgroup *blkcg; \
+ struct blkio_group *blkg; \
+ struct hlist_node *n; \
+ \
+ if (!cgroup_lock_live_group(cgroup)) \
+ return -ENODEV; \
+ \
+ blkcg = cgroup_to_blkio_cgroup(cgroup); \
+ rcu_read_lock(); \
+ hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {\
+ if (blkg->dev) \
+ seq_printf(m, "%u:%u %lu\n", MAJOR(blkg->dev), \
+ MINOR(blkg->dev), blkg->__VAR); \
+ } \
+ rcu_read_unlock(); \
+ cgroup_unlock(); \
+ return 0; \
+}
+
+SHOW_FUNCTION_PER_GROUP(time);
+SHOW_FUNCTION_PER_GROUP(sectors);
+#undef SHOW_FUNCTION_PER_GROUP
+
 struct cftype blkio_files[] = {
  {
  .name = "weight",
@@ -158,6 +195,14 @@ struct cftype blkio_files[] = {
  .read_u64 = blkiocg_ioprio_class_read,
  .write_u64 = blkiocg_ioprio_class_write,
  },
+ {
+ .name = "time",
+ .read_seq_string = blkiocg_time_read,
+ },
+ {
+ .name = "sectors",
+ .read_seq_string = blkiocg_sectors_read,
+ },
 };
 
 static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index cb72c35..08f4ef8 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -30,6 +30,12 @@ struct blkio_group {
  /* Store cgroup path */
  char path[128];
 #endif
+ /* The device MKDEV(major, minor), this group has been created for */
+ dev_t   dev;
+
+ /* total disk time and nr sectors dispatched by this group */
+ unsigned long time;
+ unsigned long sectors;
 };
 
 #define BLKIO_WEIGHT_MIN 100
@@ -48,6 +54,8 @@ static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
 extern struct blkio_cgroup blkio_root_cgroup;
 struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
 void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
- struct blkio_group *blkg, void *key);
+ struct blkio_group *blkg, void *key, dev_t dev);
 int blkiocg_del_blkio_group(struct blkio_group *blkg);
 struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key);
+void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
+ unsigned long time, unsigned long sectors);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 2fde3c4..21d487f 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -137,6 +137,8 @@ struct cfq_queue {
  unsigned short org_ioprio_class;
 
  pid_t pid;
+ /* Sectors dispatched in current dispatch round */
+ unsigned long nr_sectors;
 };
 
 /* Per cgroup grouping structure */
@@ -462,6 +464,8 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
  struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
  struct cfq_group *cfqg = NULL;
  void *key = cfqd;
+ unsigned int major, minor;
+ struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
 
  /* Do we need to take this reference */
  if (!css_tryget(&blkcg->css))
@@ -488,7 +492,9 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
  cfq_get_cfqg_ref(cfqg);
 
  /* Add group onto cgroup list */
- blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd);
+ sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+ blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
+ MKDEV(major, minor));
 
  /* Add group on cfqd list */
  hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
@@ -607,6 +613,18 @@ void cfq_delink_blkio_group(void *key, struct blkio_group *blkg)
  spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
 }
 
+static void cfq_update_cfqq_stats(struct cfq_queue *cfqq,
+ unsigned long slice_used)
+{
+ struct cfq_entity *cfqe = &cfqq->entity;
+
+ for_each_entity(cfqe) {
+ struct cfq_group *cfqg = cfqg_of(parent_entity(cfqe));
+ blkiocg_update_blkio_group_stats(&cfqg->blkg, slice_used,
+ cfqq->nr_sectors);
+ }
+}
+
 #else /* CONFIG_CFQ_GROUP_IOSCHED */
 #define for_each_entity(entity) \
  for (; entity != NULL; entity = NULL)
@@ -639,6 +657,9 @@ static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
 {
  return &cfqd->root_group;
 }
+
+static inline void cfq_update_cfqq_stats(struct cfq_queue *cfqq,
+ unsigned long slice_used) {}
 #endif /* CONFIG_CFQ_GROUP_IOSCHED */
 
 static inline int rq_in_driver(struct cfq_data *cfqd)
@@ -1380,6 +1401,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
  cfqq->slice_start = 0;
  cfqq->slice_end = 0;
  cfqq->slice_dispatch = 0;
+ cfqq->nr_sectors = 0;
 
  cfq_clear_cfqq_wait_request(cfqq);
  cfq_clear_cfqq_must_dispatch(cfqq);
@@ -1418,6 +1440,7 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq)
  slice_used = jiffies - cfqq->slice_start;
 
  cfq_log_cfqq(cfqd, cfqq, "sl_used=%ld", slice_used);
+ cfq_update_cfqq_stats(cfqq, slice_used);
 
  if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
  cfq_del_cfqq_rr(cfqd, cfqq);
@@ -1688,6 +1711,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 
  if (cfq_cfqq_sync(cfqq))
  cfqd->sync_flight++;
+ cfqq->nr_sectors += blk_rq_sectors(rq);
 }
 
 /*
@@ -3062,7 +3086,8 @@ static void cfq_init_root_group(struct cfq_data *cfqd)
  * to make sure that cfq_put_cfqg() does not try to kfree root group
  */
  cfq_get_cfqg_ref(cfqg);
- blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg, (void *)cfqd);
+ blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg, (void *)cfqd,
+ 0);
 #endif
 }
 
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 13/20] blkio: Add a group dequeue interface in cgroup for debugging

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

o "dequeue" is a debugging interface which keeps track how many times a group
  was dequeued from service tree. This helps if a group is not getting its
  fair share.

Signed-off-by: Vivek Goyal <vgoyal@...>
---
 block/blk-cgroup.c  |   17 +++++++++++++++++
 block/blk-cgroup.h  |    6 ++++++
 block/cfq-iosched.c |    6 ++++++
 3 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 47c0ce7..6a46156 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -182,8 +182,19 @@ static int blkiocg_##__VAR##_read(struct cgroup *cgroup, \
 
 SHOW_FUNCTION_PER_GROUP(time);
 SHOW_FUNCTION_PER_GROUP(sectors);
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+SHOW_FUNCTION_PER_GROUP(dequeue);
+#endif
 #undef SHOW_FUNCTION_PER_GROUP
 
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
+ unsigned long dequeue)
+{
+ blkg->dequeue += dequeue;
+}
+#endif
+
 struct cftype blkio_files[] = {
  {
  .name = "weight",
@@ -203,6 +214,12 @@ struct cftype blkio_files[] = {
  .name = "sectors",
  .read_seq_string = blkiocg_sectors_read,
  },
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+       {
+ .name = "dequeue",
+ .read_seq_string = blkiocg_dequeue_read,
+       },
+#endif
 };
 
 static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 08f4ef8..4ca101d 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -29,6 +29,8 @@ struct blkio_group {
 #ifdef CONFIG_DEBUG_BLK_CGROUP
  /* Store cgroup path */
  char path[128];
+ /* How many times this group has been removed from service tree */
+ unsigned long dequeue;
 #endif
  /* The device MKDEV(major, minor), this group has been created for */
  dev_t   dev;
@@ -47,8 +49,12 @@ static inline char *blkg_path(struct blkio_group *blkg)
 {
  return blkg->path;
 }
+void blkiocg_update_blkio_group_dequeue_stats(struct blkio_group *blkg,
+ unsigned long dequeue);
 #else
 static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
+static inline void blkiocg_update_blkio_group_dequeue_stats(
+ struct blkio_group *blkg, unsigned long dequeue) {}
 #endif
 
 extern struct blkio_cgroup blkio_root_cgroup;
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 21d487f..6936519 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -921,6 +921,12 @@ static void dequeue_cfqe(struct cfq_entity *cfqe)
  __dequeue_cfqe(st, cfqe);
  sd->nr_active--;
  cfqe->on_st = 0;
+
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+ if (!cfqe_is_cfqq(cfqe))
+ blkiocg_update_blkio_group_dequeue_stats(&cfqg_of(cfqe)->blkg,
+ 1);
+#endif
 }
 
 static void dequeue_cfqq(struct cfq_queue *cfqq)
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 14/20] blkio: Do not allow request merging across cfq groups

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

o Do not allow request merging across cfq groups.

Signed-off-by: Vivek Goyal <vgoyal@...>
---
 block/cfq-iosched.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 6936519..87b1799 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1381,6 +1381,9 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
  struct cfq_io_context *cic;
  struct cfq_queue *cfqq;
 
+ /* Deny merge if bio and rq don't belong to same cfq group */
+ if (cfqq_to_cfqg(RQ_CFQQ(rq)) != cfq_get_cfqg(cfqd, 0))
+ return false;
  /*
  * Disallow merge of a sync bio into an async request.
  */
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 15/20] blkio: Take care of preemptions across groups

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

o Additional preemption checks for groups where we travel up the hierarchy
  and see if one queue should preempt other or not.

o Also prevents preemption across groups in some cases to provide isolation
  between groups.

Signed-off-by: Vivek Goyal <vgoyal@...>
---
 block/cfq-iosched.c |   33 +++++++++++++++++++++++++++++++++
 1 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 87b1799..98dbead 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2636,6 +2636,36 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
  }
 }
 
+static bool cfq_should_preempt_group(struct cfq_data *cfqd,
+ struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
+{
+ struct cfq_entity *cfqe = &cfqq->entity;
+ struct cfq_entity *new_cfqe = &new_cfqq->entity;
+
+ if (cfqq_to_cfqg(cfqq) != &cfqd->root_group)
+ cfqe = parent_entity(&cfqq->entity);
+
+ if (cfqq_to_cfqg(new_cfqq) != &cfqd->root_group)
+ new_cfqe = parent_entity(&new_cfqq->entity);
+
+ /*
+ * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+ */
+
+ if (new_cfqe->ioprio_class == IOPRIO_CLASS_RT
+    && cfqe->ioprio_class != IOPRIO_CLASS_RT)
+ return true;
+ /*
+ * Allow an BE request to pre-empt an ongoing IDLE clas timeslice.
+ */
+
+ if (new_cfqe->ioprio_class == IOPRIO_CLASS_BE
+    && cfqe->ioprio_class == IOPRIO_CLASS_IDLE)
+ return true;
+
+ return false;
+}
+
 /*
  * Check if new_cfqq should preempt the currently active queue. Return 0 for
  * no or if we aren't sure, a 1 will cause a preempt.
@@ -2666,6 +2696,9 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
  if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
  return true;
 
+ if (cfqq_to_cfqg(new_cfqq) != cfqq_to_cfqg(cfqq))
+ return cfq_should_preempt_group(cfqd, cfqq, new_cfqq);
+
  /*
  * So both queues are sync. Let the new request get disk time if
  * it's a metadata request and the current queue is doing regular IO.
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 16/20] blkio: do not select co-operating queues from different cfq groups

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

o Select co-operating queue from same group not from a different cfq_group
  to maintain the notion of fairness and isolation between groups.

Signed-off-by: Vivek Goyal <vgoyal@...>
---
 block/cfq-iosched.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 98dbead..020d6dd 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1635,6 +1635,10 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
  if (!cfqq)
  return NULL;
 
+ /* If new queue belongs to different cfq_group, don't choose it */
+ if (cfqq_to_cfqg(cur_cfqq) != cfqq_to_cfqg(cfqq))
+ return NULL;
+
  if (cfq_cfqq_coop(cfqq))
  return NULL;
 
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 17/20] blkio: Wait for queue to get backlogged before it expires

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

o CFQ expires a cfqq if it has consumed its time slice. Expiry also means that
  queue gets deleted from service tree. For the sequential IO, most of the time
  a new IO comes almost immediately and cfqq gets backlogged again.

o This additiona dequeuing creates issues. dequeuing means that associated
  group will also be removed from service tree and we select a new queue and
  new group for dispatch and vdisktime jump takes place and group looses its
  fair share.

o One solution is to wait for queue to get busy if it is empty at the time
  of expiry and cfq plans to idle on the queue (it expects new request to come
  with-in 8ms).

Signed-off-by: Vivek Goyal <vgoyal@...>
---
 block/cfq-iosched.c |   81 ++++++++++++++++++++++++++++++++++----------------
 1 files changed, 55 insertions(+), 26 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 020d6dd..b7ef953 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -411,6 +411,21 @@ cfq_weight_slice(struct cfq_data *cfqd, int sync, unsigned int weight)
  return cfq_delta(base_slice, weight, BLKIO_WEIGHT_DEFAULT);
 }
 
+/*
+ * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
+ * isn't valid until the first request from the dispatch is activated
+ * and the slice time set.
+ */
+static inline bool cfq_slice_used(struct cfq_queue *cfqq)
+{
+ if (cfq_cfqq_slice_new(cfqq))
+ return 0;
+ if (time_before(jiffies, cfqq->slice_end))
+ return 0;
+
+ return 1;
+}
+
 static inline void
 cfq_init_cfqe_parent(struct cfq_entity *cfqe, struct cfq_entity *p_cfqe)
 {
@@ -425,6 +440,17 @@ cfq_init_cfqe_parent(struct cfq_entity *cfqe, struct cfq_entity *p_cfqe)
 
 #define cfqe_is_cfqq(cfqe)     (!(cfqe)->my_sd)
 
+static inline bool cfqq_should_wait_busy(struct cfq_queue *cfqq)
+{
+ if (!RB_EMPTY_ROOT(&cfqq->sort_list) || !cfq_cfqq_idle_window(cfqq))
+ return false;
+
+ if (cfqq->dispatched && !cfq_slice_used(cfqq))
+ return false;
+
+ return true;
+}
+
 static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
 {
  if (blkg)
@@ -635,6 +661,11 @@ static void cfq_release_cfq_groups(struct cfq_data *cfqd) {}
 static inline void cfq_get_cfqg_ref(struct cfq_group *cfqg) {}
 static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
 
+static inline bool cfqq_should_wait_busy(struct cfq_queue *cfqq)
+{
+ return false;
+}
+
 static inline struct cfq_data *cfqd_of(struct cfq_entity *cfqe)
 {
  return cfqq_of(cfqe)->cfqd;
@@ -722,21 +753,6 @@ cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 }
 
 /*
- * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
- * isn't valid until the first request from the dispatch is activated
- * and the slice time set.
- */
-static inline bool cfq_slice_used(struct cfq_queue *cfqq)
-{
- if (cfq_cfqq_slice_new(cfqq))
- return 0;
- if (time_before(jiffies, cfqq->slice_end))
- return 0;
-
- return 1;
-}
-
-/*
  * Lifted from AS - choose which of rq1 and rq2 that is best served now.
  * We choose the request that is closest to the head right now. Distance
  * behind the head is penalized and only allowed to a certain extent.
@@ -1647,19 +1663,22 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
  return cfqq;
 }
 
-static void cfq_arm_slice_timer(struct cfq_data *cfqd)
+static bool cfq_arm_slice_timer(struct cfq_data *cfqd, int reset)
 {
  struct cfq_queue *cfqq = cfqd->active_queue;
  struct cfq_io_context *cic;
  unsigned long sl;
 
+ /* If idle timer is already armed, nothing to do */
+ if (!reset && timer_pending(&cfqd->idle_slice_timer))
+ return true;
  /*
  * SSD device without seek penalty, disable idling. But only do so
  * for devices that support queuing, otherwise we still have a problem
  * with sync vs async workloads.
  */
  if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
- return;
+ return false;
 
  WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
  WARN_ON(cfq_cfqq_slice_new(cfqq));
@@ -1668,20 +1687,20 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
  * idle is disabled, either manually or by past process history
  */
  if (!cfqd->cfq_slice_idle || !cfq_cfqq_idle_window(cfqq))
- return;
+ return false;
 
  /*
  * still requests with the driver, don't idle
  */
  if (rq_in_driver(cfqd))
- return;
+ return false;
 
  /*
  * task has exited, don't wait
  */
  cic = cfqd->active_cic;
  if (!cic || !atomic_read(&cic->ioc->nr_tasks))
- return;
+ return false;
 
  /*
  * If our average think time is larger than the remaining time
@@ -1690,7 +1709,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
  */
  if (sample_valid(cic->ttime_samples) &&
     (cfqq->slice_end - jiffies < cic->ttime_mean))
- return;
+ return false;
 
  cfq_mark_cfqq_wait_request(cfqq);
 
@@ -1704,7 +1723,8 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
  sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
 
  mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
- cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
+ cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu reset=%d", sl, reset);
+ return true;
 }
 
 /*
@@ -1775,6 +1795,12 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
  if (!cfqd->rq_queued)
  return NULL;
 
+ /* Wait for a queue to get busy before we expire it */
+ if (cfqq_should_wait_busy(cfqq) && cfq_arm_slice_timer(cfqd, 0)) {
+ cfqq = NULL;
+ goto keep_queue;
+ }
+
  /*
  * The active queue has run out of time, expire it and select new.
  */
@@ -2786,8 +2812,8 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
     cfqd->busy_queues > 1) {
  del_timer(&cfqd->idle_slice_timer);
  __blk_run_queue(cfqd->queue);
- }
- cfq_mark_cfqq_must_dispatch(cfqq);
+ } else
+ cfq_mark_cfqq_must_dispatch(cfqq);
  }
  } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
  /*
@@ -2886,10 +2912,13 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
  * of idling.
  */
  if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
- cfq_slice_expired(cfqd);
+ if (!cfqq_should_wait_busy(cfqq))
+ cfq_slice_expired(cfqd);
+ else
+ cfq_arm_slice_timer(cfqd, 1);
  else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
  sync && !rq_noidle(rq))
- cfq_arm_slice_timer(cfqd);
+ cfq_arm_slice_timer(cfqd, 1);
  }
 
  if (!rq_in_driver(cfqd))
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 18/20] blkio: arm idle timer even if think time is great then time slice left

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

o Now we plan to wait for a queue to get backlogged before we expire it. So
  we need to arm slice timer even if think time is greater than slice left.
  if process sends next IO early and time slice is left, we will dispatch it
  otherwise we will expire the queue and move on to next queue.

Signed-off-by: Vivek Goyal <vgoyal@...>
---
 block/cfq-iosched.c |    9 ---------
 1 files changed, 0 insertions(+), 9 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b7ef953..963659a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1702,15 +1702,6 @@ static bool cfq_arm_slice_timer(struct cfq_data *cfqd, int reset)
  if (!cic || !atomic_read(&cic->ioc->nr_tasks))
  return false;
 
- /*
- * If our average think time is larger than the remaining time
- * slice, then don't idle. This avoids overrunning the allotted
- * time slice.
- */
- if (sample_valid(cic->ttime_samples) &&
-    (cfqq->slice_end - jiffies < cic->ttime_mean))
- return false;
-
  cfq_mark_cfqq_wait_request(cfqq);
 
  /*
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 19/20] blkio: Arm slice timer even if there are requests in driver

by Vivek Goyal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

o To ensure fairness for a group, we need to make sure at the time of expiry
  queue is backlogged and does not get deleted from the service tree. That
  means for sequential workload, wait for next request before expiry.

o Sometimes we dispatch a request from a queue and we do not wait busy on the
  queue because arm_slice_timer() does not arm slice idle timer because it
  thinks there are requests in driver. Further down in select_cfq_queue()
  we expire the cfqq because time slice expired and queue looses its share
  (vtime jump). Hence idle timer even if there are requests in the driver.

Signed-off-by: Vivek Goyal <vgoyal@...>
---
 block/cfq-iosched.c |    6 ------
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 963659a..d609a10 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1690,12 +1690,6 @@ static bool cfq_arm_slice_timer(struct cfq_data *cfqd, int reset)
  return false;
 
  /*
- * still requests with the driver, don't idle
- */
- if (rq_in_driver(cfqd))
- return false;
-
- /*
  * task has exited, don't wait
  */
  cic = cfqd->active_cic;
--
1.6.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
< Prev | 1 - 2 - 3 - 4 - 5 | Next >