Patches contributed by Eötvös Lorand University


commit aea25401c3347d9f3a64ebdc81043be246a9f631
Author: Ingo Molnar <mingo@elte.hu>
Date:   Thu Aug 9 11:16:46 2007 +0200

    sched: document nice levels
    
    Document the design thinking behind nice levels.
    
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/Documentation/sched-nice-design.txt b/Documentation/sched-nice-design.txt
new file mode 100644
index 000000000000..e2bae5a577e3
--- /dev/null
+++ b/Documentation/sched-nice-design.txt
@@ -0,0 +1,108 @@
+This document explains the thinking about the revamped and streamlined
+nice-levels implementation in the new Linux scheduler.
+
+Nice levels were always pretty weak under Linux and people continuously
+pestered us to make nice +19 tasks use up much less CPU time.
+
+Unfortunately that was not that easy to implement under the old
+scheduler, (otherwise we'd have done it long ago) because nice level
+support was historically coupled to timeslice length, and timeslice
+units were driven by the HZ tick, so the smallest timeslice was 1/HZ.
+
+In the O(1) scheduler (in 2003) we changed negative nice levels to be
+much stronger than they were before in 2.4 (and people were happy about
+that change), and we also intentionally calibrated the linear timeslice
+rule so that nice +19 level would be _exactly_ 1 jiffy. To better
+understand it, the timeslice graph went like this (cheesy ASCII art
+alert!):
+
+
+                   A
+             \     | [timeslice length]
+              \    |
+               \   |
+                \  |
+                 \ |
+                  \|___100msecs
+                   |^ . _
+                   |      ^ . _
+                   |            ^ . _
+ -*----------------------------------*-----> [nice level]
+ -20               |                +19
+                   |
+                   |
+
+So that if someone wanted to really renice tasks, +19 would give a much
+bigger hit than the normal linear rule would do. (The solution of
+changing the ABI to extend priorities was discarded early on.)
+
+This approach worked to some degree for some time, but later on with
+HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
+we felt to be a bit excessive. Excessive _not_ because it's too small of
+a CPU utilization, but because it causes too frequent (once per
+millisec) rescheduling. (and would thus trash the cache, etc. Remember,
+this was long ago when hardware was weaker and caches were smaller, and
+people were running number crunching apps at nice +19.)
+
+So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the
+right minimal granularity - and this translates to 5% CPU utilization.
+But the fundamental HZ-sensitive property for nice+19 still remained,
+and we never got a single complaint about nice +19 being too _weak_ in
+terms of CPU utilization, we only got complaints about it (still) being
+too _strong_ :-)
+
+To sum it up: we always wanted to make nice levels more consistent, but
+within the constraints of HZ and jiffies and their nasty design level
+coupling to timeslices and granularity it was not really viable.
+
+The second (less frequent but still periodically occuring) complaint
+about Linux's nice level support was its assymetry around the origo
+(which you can see demonstrated in the picture above), or more
+accurately: the fact that nice level behavior depended on the _absolute_
+nice level as well, while the nice API itself is fundamentally
+"relative":
+
+   int nice(int inc);
+
+   asmlinkage long sys_nice(int increment)
+
+(the first one is the glibc API, the second one is the syscall API.)
+Note that the 'inc' is relative to the current nice level. Tools like
+bash's "nice" command mirror this relative API.
+
+With the old scheduler, if you for example started a niced task with +1
+and another task with +2, the CPU split between the two tasks would
+depend on the nice level of the parent shell - if it was at nice -10 the
+CPU split was different than if it was at +5 or +10.
+
+A third complaint against Linux's nice level support was that negative
+nice levels were not 'punchy enough', so lots of people had to resort to
+run audio (and other multimedia) apps under RT priorities such as
+SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation
+proof, and a buggy SCHED_FIFO app can also lock up the system for good.
+
+The new scheduler in v2.6.23 addresses all three types of complaints:
+
+To address the first complaint (of nice levels being not "punchy"
+enough), the scheduler was decoupled from 'time slice' and HZ concepts
+(and granularity was made a separate concept from nice levels) and thus
+it was possible to implement better and more consistent nice +19
+support: with the new scheduler nice +19 tasks get a HZ-independent
+1.5%, instead of the variable 3%-5%-9% range they got in the old
+scheduler.
+
+To address the second complaint (of nice levels not being consistent),
+the new scheduler makes nice(1) have the same CPU utilization effect on
+tasks, regardless of their absolute nice levels. So on the new
+scheduler, running a nice +10 and a nice 11 task has the same CPU
+utilization "split" between them as running a nice -5 and a nice -4
+task. (one will get 55% of the CPU, the other 45%.) That is why nice
+levels were changed to be "multiplicative" (or exponential) - that way
+it does not matter which nice level you start out from, the 'relative
+result' will always be the same.
+
+The third complaint (of negative nice levels not being "punchy" enough
+and forcing audio apps to run under the more dangerous SCHED_FIFO
+scheduling policy) is addressed by the new scheduler almost
+automatically: stronger negative nice levels are an automatic
+side-effect of the recalibrated dynamic range of nice levels.

commit fd8bb43e27bbba1b6d49552c3d588cf741dd44af
Author: Ingo Molnar <mingo@elte.hu>
Date:   Thu Aug 9 11:16:46 2007 +0200

    sched: delta_exec accounting fix
    
    small delta_exec accounting fix: increase delta_exec and increase
    sum_exec_runtime even if the task is not on the runqueue anymore.
    
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 037b8245e533..16511e9e5528 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -287,15 +287,15 @@ __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr, u64 now)
 	struct load_weight *lw = &cfs_rq->load;
 	unsigned long load = lw->weight;
 
-	if (unlikely(!load))
-		return;
-
 	delta_exec = curr->delta_exec;
 	schedstat_set(curr->exec_max, max((u64)delta_exec, curr->exec_max));
 
 	curr->sum_exec_runtime += delta_exec;
 	cfs_rq->exec_clock += delta_exec;
 
+	if (unlikely(!load))
+		return;
+
 	delta_fair = calc_delta_fair(delta_exec, lw);
 	delta_mine = calc_delta_mine(delta_exec, curr->load.weight, lw);
 

commit c5dcfe72aa8d26e924cccca9725a9f7be0d4ab01
Author: Ingo Molnar <mingo@elte.hu>
Date:   Thu Aug 9 11:16:46 2007 +0200

    sched: clean up delta_mine
    
    cleanup: delta_mine is an unsigned value.
    
    no code impact:
    
       text    data     bss     dec     hex filename
       27823    2726      16   30565    7765 sched.o.before
       27823    2726      16   30565    7765 sched.o.after
    
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index edcb4b542bca..037b8245e533 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -283,8 +283,7 @@ add_wait_runtime(struct cfs_rq *cfs_rq, struct sched_entity *se, long delta)
 static inline void
 __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr, u64 now)
 {
-	unsigned long delta, delta_exec, delta_fair;
-	long delta_mine;
+	unsigned long delta, delta_exec, delta_fair, delta_mine;
 	struct load_weight *lw = &cfs_rq->load;
 	unsigned long load = lw->weight;
 

commit 8e717b194ce3f3ac9e6acc63f66fe274cdf9cde1
Author: Ingo Molnar <mingo@elte.hu>
Date:   Thu Aug 9 11:16:46 2007 +0200

    sched: schedule() speedup
    
    speed up schedule(): share the 'now' parameter that deactivate_task()
    was calculating internally.
    
    ( this also fixes the small accounting window between the deactivate
      call and the pick_next_task() call. )
    
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/kernel/sched.c b/kernel/sched.c
index 0112f63ad376..49f5b281c561 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -940,10 +940,9 @@ static inline void activate_idle_task(struct task_struct *p, struct rq *rq)
 /*
  * deactivate_task - remove a task from the runqueue.
  */
-static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep)
+static void
+deactivate_task(struct rq *rq, struct task_struct *p, int sleep, u64 now)
 {
-	u64 now = rq_clock(rq);
-
 	if (p->state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible++;
 
@@ -2122,7 +2121,7 @@ void sched_exec(void)
 static void pull_task(struct rq *src_rq, struct task_struct *p,
 		      struct rq *this_rq, int this_cpu)
 {
-	deactivate_task(src_rq, p, 0);
+	deactivate_task(src_rq, p, 0, rq_clock(src_rq));
 	set_task_cpu(p, this_cpu);
 	activate_task(this_rq, p, 0);
 	/*
@@ -3446,13 +3445,14 @@ asmlinkage void __sched schedule(void)
 
 	spin_lock_irq(&rq->lock);
 	clear_tsk_need_resched(prev);
+	now = __rq_clock(rq);
 
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
 		if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
 				unlikely(signal_pending(prev)))) {
 			prev->state = TASK_RUNNING;
 		} else {
-			deactivate_task(rq, prev, 1);
+			deactivate_task(rq, prev, 1, now);
 		}
 		switch_count = &prev->nvcsw;
 	}
@@ -3460,7 +3460,6 @@ asmlinkage void __sched schedule(void)
 	if (unlikely(!rq->nr_running))
 		idle_balance(cpu, rq);
 
-	now = __rq_clock(rq);
 	prev->sched_class->put_prev_task(rq, prev, now);
 	next = pick_next_task(rq, prev, now);
 
@@ -4220,7 +4219,7 @@ int sched_setscheduler(struct task_struct *p, int policy,
 	}
 	on_rq = p->se.on_rq;
 	if (on_rq)
-		deactivate_task(rq, p, 0);
+		deactivate_task(rq, p, 0, rq_clock(rq));
 	oldprio = p->prio;
 	__setscheduler(rq, p, policy, param->sched_priority);
 	if (on_rq) {
@@ -4973,7 +4972,7 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 
 	on_rq = p->se.on_rq;
 	if (on_rq)
-		deactivate_task(rq_src, p, 0);
+		deactivate_task(rq_src, p, 0, rq_clock(rq_src));
 	set_task_cpu(p, dest_cpu);
 	if (on_rq) {
 		activate_task(rq_dest, p, 0);
@@ -5387,7 +5386,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 		rq->migration_thread = NULL;
 		/* Idle task back to normal (off runqueue, low prio) */
 		rq = task_rq_lock(rq->idle, &flags);
-		deactivate_task(rq, rq->idle, 0);
+		deactivate_task(rq, rq->idle, 0, rq_clock(rq));
 		rq->idle->static_prio = MAX_PRIO;
 		__setscheduler(rq, rq->idle, SCHED_NORMAL, 0);
 		rq->idle->sched_class = &idle_sched_class;
@@ -6626,7 +6625,7 @@ void normalize_rt_tasks(void)
 
 		on_rq = p->se.on_rq;
 		if (on_rq)
-			deactivate_task(task_rq(p), p, 0);
+			deactivate_task(task_rq(p), p, 0, rq_clock(task_rq(p)));
 		__setscheduler(rq, p, SCHED_NORMAL, 0);
 		if (on_rq) {
 			activate_task(task_rq(p), p, 0);

commit 7bfd0485871df01764ca89d5679f128d870aef1a
Author: Ingo Molnar <mingo@elte.hu>
Date:   Thu Aug 9 11:16:46 2007 +0200

    sched: uninline rq_clock()
    
    uninline rq_clock() to save 263 bytes of code:
    
       text    data     bss     dec     hex filename
       39561    3642      24   43227    a8db sched.o.before
       39298    3642      24   42964    a7d4 sched.o.after
    
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/kernel/sched.c b/kernel/sched.c
index 50c3587b06cb..0112f63ad376 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -353,7 +353,7 @@ static unsigned long long __rq_clock(struct rq *rq)
 	return clock;
 }
 
-static inline unsigned long long rq_clock(struct rq *rq)
+static unsigned long long rq_clock(struct rq *rq)
 {
 	int this_cpu = smp_processor_id();
 

commit f1a438d813d416fa9f4be4e6dbd10b54c5938d89
Author: Ingo Molnar <mingo@elte.hu>
Date:   Thu Aug 9 11:16:45 2007 +0200

    sched: reorder update_cpu_load(rq) with the ->task_tick() call
    
    Peter Williams suggested to flip the order of update_cpu_load(rq) with
    the ->task_tick() call. This is a NOP for the current scheduler (the
    two functions are independent of each other), ->task_tick() might
    create some state for update_cpu_load() in the future (or in PlugSched).
    
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/kernel/sched.c b/kernel/sched.c
index 72bb9483d949..4680f52974e3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3298,9 +3298,9 @@ void scheduler_tick(void)
 	struct task_struct *curr = rq->curr;
 
 	spin_lock(&rq->lock);
+	update_cpu_load(rq);
 	if (curr != rq->idle) /* FIXME: needed? */
 		curr->sched_class->task_tick(rq, curr);
-	update_cpu_load(rq);
 	spin_unlock(&rq->lock);
 
 #ifdef CONFIG_SMP

commit 0915c4e89d311948b67cdd4c183a2efbcafcc9f9
Author: Ingo Molnar <mingo@elte.hu>
Date:   Thu Aug 9 11:16:45 2007 +0200

    sched: batch sleeper bonus
    
    batch up the sleeper bonus sum a bit more. Anything below
    sched-granularity is too small to make a practical difference
    anyway.
    
    this optimization reduces the math in high-frequency scheduling
    scenarios.
    
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 6f579ff5a9bc..9f401588d509 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -300,7 +300,7 @@ __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr, u64 now)
 	delta_fair = calc_delta_fair(delta_exec, lw);
 	delta_mine = calc_delta_mine(delta_exec, curr->load.weight, lw);
 
-	if (cfs_rq->sleeper_bonus > sysctl_sched_stat_granularity) {
+	if (cfs_rq->sleeper_bonus > sysctl_sched_granularity) {
 		delta = calc_delta_mine(cfs_rq->sleeper_bonus,
 					curr->load.weight, lw);
 		if (unlikely(delta > cfs_rq->sleeper_bonus))

commit 5845b677cf7f64a0f104609e1dfe02a439f69f71
Author: Ingo Molnar <mingo@elte.hu>
Date:   Tue Jul 31 19:07:02 2007 -0500

    atl1: use spin_trylock_irqsave()
    
    use the simpler spin_trylock_irqsave() API to get the adapter lock.
    
    [ this is also a fix for -rt where adapter->lock is a sleeping lock. ]
    
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Jay Cliburn <jacliburn@bellsouth.net>
    Signed-off-by: Jeff Garzik <jeff@garzik.org>

diff --git a/drivers/net/atl1/atl1_main.c b/drivers/net/atl1/atl1_main.c
index 56f6389a300e..3c1984ecf36c 100644
--- a/drivers/net/atl1/atl1_main.c
+++ b/drivers/net/atl1/atl1_main.c
@@ -1704,10 +1704,8 @@ static int atl1_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 		}
 	}
 
-	local_irq_save(flags);
-	if (!spin_trylock(&adapter->lock)) {
+	if (!spin_trylock_irqsave(&adapter->lock, flags)) {
 		/* Can't get lock - tell upper layer to requeue */
-		local_irq_restore(flags);
 		dev_printk(KERN_DEBUG, &adapter->pdev->dev, "tx locked\n");
 		return NETDEV_TX_LOCKED;
 	}

commit 94c18227d1e3f02de5b345bd3cd5c960214dc9c8
Author: Ingo Molnar <mingo@elte.hu>
Date:   Thu Aug 2 17:41:40 2007 +0200

    [PATCH] sched: reduce task_struct size
    
    more task_struct size reduction, by moving the debugging/instrumentation
    fields to under CONFIG_SCHEDSTATS:
    
     (i386, nodebug):
    
                              size
                              ----
         pre-CFS              1328
             CFS              1472
             CFS+patch        1376
    
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c9e0c2a6a950..17249fae5014 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -904,23 +904,28 @@ struct sched_entity {
 	struct rb_node		run_node;
 	unsigned int		on_rq;
 
+	u64			exec_start;
+	u64			sum_exec_runtime;
 	u64			wait_start_fair;
+	u64			sleep_start_fair;
+
+#ifdef CONFIG_SCHEDSTATS
 	u64			wait_start;
-	u64			exec_start;
+	u64			wait_max;
+	s64			sum_wait_runtime;
+
 	u64			sleep_start;
-	u64			sleep_start_fair;
-	u64			block_start;
 	u64			sleep_max;
+	s64			sum_sleep_runtime;
+
+	u64			block_start;
 	u64			block_max;
 	u64			exec_max;
-	u64			wait_max;
-	u64			last_ran;
 
-	u64			sum_exec_runtime;
-	s64			sum_wait_runtime;
-	s64			sum_sleep_runtime;
 	unsigned long		wait_runtime_overruns;
 	unsigned long		wait_runtime_underruns;
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	struct sched_entity	*parent;
 	/* rq on which this entity is (to be) queued: */

commit 6cfb0d5d06bea2b8791f32145eae539d524e5f6c
Author: Ingo Molnar <mingo@elte.hu>
Date:   Thu Aug 2 17:41:40 2007 +0200

    [PATCH] sched: reduce debug code
    
    move the rest of the debugging/instrumentation code to under
    CONFIG_SCHEDSTATS too. This reduces code size and speeds code up:
    
        text    data     bss     dec     hex filename
       33044    4122      28   37194    914a sched.o.before
       32708    4122      28   36858    8ffa sched.o.after
    
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/kernel/sched.c b/kernel/sched.c
index a9d374061a46..72bb9483d949 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -983,18 +983,21 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	u64 clock_offset, fair_clock_offset;
 
 	clock_offset = old_rq->clock - new_rq->clock;
-	fair_clock_offset = old_rq->cfs.fair_clock -
-						 new_rq->cfs.fair_clock;
-	if (p->se.wait_start)
-		p->se.wait_start -= clock_offset;
+	fair_clock_offset = old_rq->cfs.fair_clock - new_rq->cfs.fair_clock;
+
 	if (p->se.wait_start_fair)
 		p->se.wait_start_fair -= fair_clock_offset;
+	if (p->se.sleep_start_fair)
+		p->se.sleep_start_fair -= fair_clock_offset;
+
+#ifdef CONFIG_SCHEDSTATS
+	if (p->se.wait_start)
+		p->se.wait_start -= clock_offset;
 	if (p->se.sleep_start)
 		p->se.sleep_start -= clock_offset;
 	if (p->se.block_start)
 		p->se.block_start -= clock_offset;
-	if (p->se.sleep_start_fair)
-		p->se.sleep_start_fair -= fair_clock_offset;
+#endif
 
 	__set_task_cpu(p, new_cpu);
 }
@@ -1555,17 +1558,19 @@ int fastcall wake_up_state(struct task_struct *p, unsigned int state)
 static void __sched_fork(struct task_struct *p)
 {
 	p->se.wait_start_fair		= 0;
-	p->se.wait_start		= 0;
 	p->se.exec_start		= 0;
 	p->se.sum_exec_runtime		= 0;
 	p->se.delta_exec		= 0;
 	p->se.delta_fair_run		= 0;
 	p->se.delta_fair_sleep		= 0;
 	p->se.wait_runtime		= 0;
+	p->se.sleep_start_fair		= 0;
+
+#ifdef CONFIG_SCHEDSTATS
+	p->se.wait_start		= 0;
 	p->se.sum_wait_runtime		= 0;
 	p->se.sum_sleep_runtime		= 0;
 	p->se.sleep_start		= 0;
-	p->se.sleep_start_fair		= 0;
 	p->se.block_start		= 0;
 	p->se.sleep_max			= 0;
 	p->se.block_max			= 0;
@@ -1573,6 +1578,7 @@ static void __sched_fork(struct task_struct *p)
 	p->se.wait_max			= 0;
 	p->se.wait_runtime_overruns	= 0;
 	p->se.wait_runtime_underruns	= 0;
+#endif
 
 	INIT_LIST_HEAD(&p->run_list);
 	p->se.on_rq = 0;
@@ -6579,12 +6585,14 @@ void normalize_rt_tasks(void)
 	do_each_thread(g, p) {
 		p->se.fair_key			= 0;
 		p->se.wait_runtime		= 0;
+		p->se.exec_start		= 0;
 		p->se.wait_start_fair		= 0;
+		p->se.sleep_start_fair		= 0;
+#ifdef CONFIG_SCHEDSTATS
 		p->se.wait_start		= 0;
-		p->se.exec_start		= 0;
 		p->se.sleep_start		= 0;
-		p->se.sleep_start_fair		= 0;
 		p->se.block_start		= 0;
+#endif
 		task_rq(p)->cfs.fair_clock	= 0;
 		task_rq(p)->clock		= 0;
 
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 0eca442b7792..1c61e5315ad2 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -44,11 +44,16 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p, u64 now)
 		(long long)p->se.wait_runtime,
 		(long long)(p->nvcsw + p->nivcsw),
 		p->prio,
+#ifdef CONFIG_SCHEDSTATS
 		(long long)p->se.sum_exec_runtime,
 		(long long)p->se.sum_wait_runtime,
 		(long long)p->se.sum_sleep_runtime,
 		(long long)p->se.wait_runtime_overruns,
-		(long long)p->se.wait_runtime_underruns);
+		(long long)p->se.wait_runtime_underruns
+#else
+		0LL, 0LL, 0LL, 0LL, 0LL
+#endif
+	);
 }
 
 static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu, u64 now)
@@ -171,7 +176,7 @@ static int sched_debug_show(struct seq_file *m, void *v)
 	u64 now = ktime_to_ns(ktime_get());
 	int cpu;
 
-	SEQ_printf(m, "Sched Debug Version: v0.05, %s %.*s\n",
+	SEQ_printf(m, "Sched Debug Version: v0.05-v20, %s %.*s\n",
 		init_utsname()->release,
 		(int)strcspn(init_utsname()->version, " "),
 		init_utsname()->version);
@@ -235,21 +240,24 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 #define P(F) \
 	SEQ_printf(m, "%-25s:%20Ld\n", #F, (long long)p->F)
 
-	P(se.wait_start);
+	P(se.wait_runtime);
 	P(se.wait_start_fair);
 	P(se.exec_start);
-	P(se.sleep_start);
 	P(se.sleep_start_fair);
+	P(se.sum_exec_runtime);
+
+#ifdef CONFIG_SCHEDSTATS
+	P(se.wait_start);
+	P(se.sleep_start);
 	P(se.block_start);
 	P(se.sleep_max);
 	P(se.block_max);
 	P(se.exec_max);
 	P(se.wait_max);
-	P(se.wait_runtime);
 	P(se.wait_runtime_overruns);
 	P(se.wait_runtime_underruns);
 	P(se.sum_wait_runtime);
-	P(se.sum_exec_runtime);
+#endif
 	SEQ_printf(m, "%-25s:%20Ld\n",
 		   "nr_switches", (long long)(p->nvcsw + p->nivcsw));
 	P(se.load.weight);
@@ -269,7 +277,9 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 
 void proc_sched_set_task(struct task_struct *p)
 {
+#ifdef CONFIG_SCHEDSTATS
 	p->se.sleep_max = p->se.block_max = p->se.exec_max = p->se.wait_max = 0;
 	p->se.wait_runtime_overruns = p->se.wait_runtime_underruns = 0;
+#endif
 	p->se.sum_exec_runtime = 0;
 }
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5bf7285ad02c..6f579ff5a9bc 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -349,7 +349,7 @@ static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
 {
 	se->wait_start_fair = cfs_rq->fair_clock;
-	se->wait_start = now;
+	schedstat_set(se->wait_start, now);
 }
 
 /*
@@ -447,7 +447,7 @@ update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 now)
 	}
 
 	se->wait_start_fair = 0;
-	se->wait_start = 0;
+	schedstat_set(se->wait_start, 0);
 }
 
 static inline void
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index ade20dc422f1..002fcf8d3f64 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -18,8 +18,8 @@ static inline void update_curr_rt(struct rq *rq, u64 now)
 	delta_exec = now - curr->se.exec_start;
 	if (unlikely((s64)delta_exec < 0))
 		delta_exec = 0;
-	if (unlikely(delta_exec > curr->se.exec_max))
-		curr->se.exec_max = delta_exec;
+
+	schedstat_set(curr->se.exec_max, max(curr->se.exec_max, delta_exec));
 
 	curr->se.sum_exec_runtime += delta_exec;
 	curr->se.exec_start = now;