Patches contributed by Eötvös Lorand University
commit e4233dec749a3519069d9390561b5636a75c7579
Author: Ingo Molnar <mingo@elte.hu>
Date: Fri Jan 26 00:56:55 2007 -0800
[PATCH] ACPI: fix cpufreq regression
Recently cpufreq support on my laptop (Lenovo T60) broke completely: when
it's plugged into AC it would never go higher than 1 GHz - neither 1.3 GHz
nor 1.83 GHz is possible - no matter which governor (userspace, speed or
ondemand) is used.
After some cpufreq debugging i tracked the regression back to the following
(totally correct) bug-fix commit:
commit 0916bd3ebb7cefdd0f432e8491abe24f4b5a101e
Author: Dave Jones <davej@redhat.com>
Date: Wed Nov 22 20:42:01 2006 -0500
[PATCH] Correct bound checking from the value returned from _PPC method.
This bugfix, which makes other laptops work, made a previously hidden
(BIOS) bug visible on my laptop.
The bug is the following: if the _PPC (Performance Present Capabilities)
optional ACPI object is queried /after/ bootup then the BIOS reports an
incorrect value of '2'.
My laptop (Lenovo T60) has the following performance states supported:
0: 1833000
1: 1333000
2: 1000000
Per ACPI specification, a _PPC value of '0' means that all 3 performance
states are usable. A _PPC value of '1' means states 1 .. 2 are usable, a
value of '2' means only state '2' (slowest) is usable.
now, the _PPC object is optional, and it also comes with notification.
Furthermore, when a CPU object is initialized, the _PPC object is
initialized as well. So the following evaluation of the _PPC object is
superfluous:
[<c028ba5f>] acpi_processor_get_platform_limit+0xa1/0xaf
[<c028c040>] acpi_processor_register_performance+0x3b9/0x3ef
[<c0111a85>] acpi_cpufreq_cpu_init+0xb7/0x596
[<c03dab74>] cpufreq_add_dev+0x160/0x4a8
[<c02bed90>] sysdev_driver_register+0x5a/0xa0
[<c03d9c4c>] cpufreq_register_driver+0xb4/0x176
[<c068ac08>] acpi_cpufreq_init+0xe5/0xeb
[<c010056e>] init+0x14f/0x3dd
And this is the point where my laptop's BIOS returns the incorrect value of
'2'. Note that it has not sent any notification event, so the value is
probably not really intentional (possibly spurious), and Windows likely
doesnt query it after bootup either. Maybe the value is kept at '2'
normally, and is only set to the real value when a true asynchronous event
(such as AC plug event, battery switch, etc.) occurs.
So i /think/ this is a grey area of the ACPI spec: per the letter of the
spec the _PPC value only changes when notified, so there's no reason to
query it after the system has booted up. So in my opinion the best (and
most compatible) strategy would be to do the change below, and to not
evaluate the _PPC object in the acpi_processor_get_performance_info() call,
but only evaluate it if _PPC is present during CPU object init, or if it's
notified during an asynchronous event. This change is more permissive than
the previous logic, so it definitely shouldnt break any existing system.
This also happens to fix my laptop, which is merrily chugging along at
1.83 GHz now. Yay!
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Dave Jones <davej@redhat.com>
Acked-by: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff --git a/drivers/acpi/processor_perflib.c b/drivers/acpi/processor_perflib.c
index 5207f9e4b443..cbb6f0814ce2 100644
--- a/drivers/acpi/processor_perflib.c
+++ b/drivers/acpi/processor_perflib.c
@@ -322,10 +322,6 @@ static int acpi_processor_get_performance_info(struct acpi_processor *pr)
if (result)
return result;
- result = acpi_processor_get_platform_limit(pr);
- if (result)
- return result;
-
return 0;
}
commit 1b5180b65122666a36a1a232b7b9b38b21a9dcdd
Author: Ingo Molnar <mingo@elte.hu>
Date: Tue Jan 23 10:45:50 2007 +0100
[PATCH] notifiers: fix blocking_notifier_call_chain() scalability
while lock-profiling the -rt kernel i noticed weird contention during
mmap-intense workloads, and the tracer showed the following gem, in one
of our MM hotpaths:
threaded-2771 1.... 65us : sys_munmap (sysenter_do_call)
threaded-2771 1.... 66us : profile_munmap (sys_munmap)
threaded-2771 1.... 66us : blocking_notifier_call_chain (profile_munmap)
threaded-2771 1.... 66us : rt_down_read (blocking_notifier_call_chain)
ouch! a global rw-semaphore taken in one of the most performance-
sensitive codepaths of the kernel. And i dont even have oprofile
enabled! All distro kernels have CONFIG_PROFILING enabled, so this
scalability problem affects the majority of Linux users.
The fix is to enhance blocking_notifier_call_chain() to only take the
lock if there appears to be work on the call-chain.
With this patch applied i get nicely saturated system, and much higher
munmap performance, on SMP systems.
And as a bonus this also fixes a similar scalability bottleneck in the
thread-exit codepath: profile_task_exit() ...
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff --git a/kernel/sys.c b/kernel/sys.c
index c7675c1bfdf2..6e2101dec0fc 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -323,11 +323,18 @@ EXPORT_SYMBOL_GPL(blocking_notifier_chain_unregister);
int blocking_notifier_call_chain(struct blocking_notifier_head *nh,
unsigned long val, void *v)
{
- int ret;
+ int ret = NOTIFY_DONE;
- down_read(&nh->rwsem);
- ret = notifier_call_chain(&nh->head, val, v);
- up_read(&nh->rwsem);
+ /*
+ * We check the head outside the lock, but if this access is
+ * racy then it does not matter what the result of the test
+ * is, we re-check the list after having taken the lock anyway:
+ */
+ if (rcu_dereference(nh->head)) {
+ down_read(&nh->rwsem);
+ ret = notifier_call_chain(&nh->head, val, v);
+ up_read(&nh->rwsem);
+ }
return ret;
}
commit 0dbe5a111382fd1320ff4b1d889e5b8c41290619
Author: Ingo Molnar <mingo@elte.hu>
Date: Mon Jan 22 20:40:36 2007 -0800
[PATCH] paravirt: mark the paravirt_ops export internal
The paravirt subsystem is still in flux so all exports from it are
definitely internal use only. The APIs around this /will/ change.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@suse.de>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff --git a/arch/i386/kernel/paravirt.c b/arch/i386/kernel/paravirt.c
index 3dceab5828f1..e55fd05da0f5 100644
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -566,4 +566,11 @@ struct paravirt_ops paravirt_ops = {
.irq_enable_sysexit = native_irq_enable_sysexit,
.iret = native_iret,
};
-EXPORT_SYMBOL(paravirt_ops);
+
+/*
+ * NOTE: CONFIG_PARAVIRT is experimental and the paravirt_ops
+ * semantics are subject to change. Hence we only do this
+ * internal-only export of this, until it gets sorted out and
+ * all lowlevel CPU ops used by modules are separately exported.
+ */
+EXPORT_SYMBOL_GPL(paravirt_ops);
commit 07031e14c1127fc7e1a5b98dfcc59f434e025104
Author: Ingo Molnar <mingo@elte.hu>
Date: Wed Jan 10 23:15:38 2007 -0800
[PATCH] KVM: add VM-exit profiling
This adds the profile=kvm boot option, which enables KVM to profile VM
exits.
Use: "readprofile -m ./System.map | sort -n" to see the resulting
output:
[...]
18246 serial_out 148.3415
18945 native_flush_tlb 378.9000
23618 serial_in 212.7748
29279 __spin_unlock_irq 622.9574
43447 native_apic_write 2068.9048
52702 enable_8259A_irq 742.2817
54250 vgacon_scroll 89.3740
67394 ide_inb 6126.7273
79514 copy_page_range 98.1654
84868 do_wp_page 86.6000
140266 pit_read 783.6089
151436 ide_outb 25239.3333
152668 native_io_delay 21809.7143
174783 mask_and_ack_8259A 783.7803
362404 native_set_pte_at 36240.4000
1688747 total 0.5009
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c
index ccc06b1b91b5..714f6a7841cd 100644
--- a/drivers/kvm/svm.c
+++ b/drivers/kvm/svm.c
@@ -17,6 +17,7 @@
#include <linux/module.h>
#include <linux/vmalloc.h>
#include <linux/highmem.h>
+#include <linux/profile.h>
#include <asm/desc.h>
#include "kvm_svm.h"
@@ -1558,6 +1559,13 @@ static int svm_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
reload_tss(vcpu);
+ /*
+ * Profile KVM exit RIPs:
+ */
+ if (unlikely(prof_on == KVM_PROFILING))
+ profile_hit(KVM_PROFILING,
+ (void *)(unsigned long)vcpu->svm->vmcb->save.rip);
+
stgi();
kvm_reput_irq(vcpu);
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index d4701cb4c654..ce219e3f557f 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -21,6 +21,7 @@
#include <linux/module.h>
#include <linux/mm.h>
#include <linux/highmem.h>
+#include <linux/profile.h>
#include <asm/io.h>
#include <asm/desc.h>
@@ -1859,6 +1860,12 @@ static int vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
asm ("mov %0, %%ds; mov %0, %%es" : : "r"(__USER_DS));
#endif
+ /*
+ * Profile KVM exit RIPs:
+ */
+ if (unlikely(prof_on == KVM_PROFILING))
+ profile_hit(KVM_PROFILING, (void *)vmcs_readl(GUEST_RIP));
+
kvm_run->exit_type = 0;
if (fail) {
kvm_run->exit_type = KVM_EXIT_TYPE_FAIL_ENTRY;
diff --git a/include/linux/profile.h b/include/linux/profile.h
index 5670b340c4ef..eec48f5f9348 100644
--- a/include/linux/profile.h
+++ b/include/linux/profile.h
@@ -15,6 +15,7 @@ extern int prof_on __read_mostly;
#define CPU_PROFILING 1
#define SCHED_PROFILING 2
#define SLEEP_PROFILING 3
+#define KVM_PROFILING 4
struct proc_dir_entry;
struct pt_regs;
diff --git a/kernel/profile.c b/kernel/profile.c
index 11550b2290b6..a6574a18514e 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -40,7 +40,10 @@ int (*timer_hook)(struct pt_regs *) __read_mostly;
static atomic_t *prof_buffer;
static unsigned long prof_len, prof_shift;
+
int prof_on __read_mostly;
+EXPORT_SYMBOL_GPL(prof_on);
+
static cpumask_t prof_cpu_mask = CPU_MASK_ALL;
#ifdef CONFIG_SMP
static DEFINE_PER_CPU(struct profile_hit *[2], cpu_profile_hits);
@@ -52,6 +55,7 @@ static int __init profile_setup(char * str)
{
static char __initdata schedstr[] = "schedule";
static char __initdata sleepstr[] = "sleep";
+ static char __initdata kvmstr[] = "kvm";
int par;
if (!strncmp(str, sleepstr, strlen(sleepstr))) {
@@ -72,6 +76,15 @@ static int __init profile_setup(char * str)
printk(KERN_INFO
"kernel schedule profiling enabled (shift: %ld)\n",
prof_shift);
+ } else if (!strncmp(str, kvmstr, strlen(kvmstr))) {
+ prof_on = KVM_PROFILING;
+ if (str[strlen(kvmstr)] == ',')
+ str += strlen(kvmstr) + 1;
+ if (get_option(&str, &par))
+ prof_shift = par;
+ printk(KERN_INFO
+ "kernel KVM profiling enabled (shift: %ld)\n",
+ prof_shift);
} else if (get_option(&str, &par)) {
prof_shift = par;
prof_on = CPU_PROFILING;
@@ -318,6 +331,7 @@ void profile_hits(int type, void *__pc, unsigned int nr_hits)
local_irq_restore(flags);
put_cpu();
}
+EXPORT_SYMBOL_GPL(profile_hits);
static int __devinit profile_cpu_callback(struct notifier_block *info,
unsigned long action, void *__cpu)
commit 68a99f6d37aa65e848e09ec6ea52848e93bd5de2
Author: Ingo Molnar <mingo@elte.hu>
Date: Fri Jan 5 16:36:59 2007 -0800
[PATCH] KVM: Simplify mmu_alloc_roots()
Small optimization/cleanup:
page == page_header(page->page_hpa)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
diff --git a/drivers/kvm/mmu.c b/drivers/kvm/mmu.c
index 4118e504e647..c6f972914f08 100644
--- a/drivers/kvm/mmu.c
+++ b/drivers/kvm/mmu.c
@@ -820,9 +820,9 @@ static void mmu_alloc_roots(struct kvm_vcpu *vcpu)
hpa_t root = vcpu->mmu.root_hpa;
ASSERT(!VALID_PAGE(root));
- root = kvm_mmu_get_page(vcpu, root_gfn, 0,
- PT64_ROOT_LEVEL, 0, NULL)->page_hpa;
- page = page_header(root);
+ page = kvm_mmu_get_page(vcpu, root_gfn, 0,
+ PT64_ROOT_LEVEL, 0, NULL);
+ root = page->page_hpa;
++page->root_count;
vcpu->mmu.root_hpa = root;
return;
@@ -836,10 +836,10 @@ static void mmu_alloc_roots(struct kvm_vcpu *vcpu)
root_gfn = vcpu->pdptrs[i] >> PAGE_SHIFT;
else if (vcpu->mmu.root_level == 0)
root_gfn = 0;
- root = kvm_mmu_get_page(vcpu, root_gfn, i << 30,
+ page = kvm_mmu_get_page(vcpu, root_gfn, i << 30,
PT32_ROOT_LEVEL, !is_paging(vcpu),
- NULL)->page_hpa;
- page = page_header(root);
+ NULL);
+ root = page->page_hpa;
++page->root_count;
vcpu->mmu.pae_root[i] = root | PT_PRESENT_MASK;
}
commit d21225ee2b6fa9f7669526927f2e0bedebd90940
Author: Ingo Molnar <mingo@elte.hu>
Date: Fri Jan 5 16:36:59 2007 -0800
[PATCH] KVM: Make loading cr3 more robust
Prevent the guest's loading of a corrupt cr3 (pointing at no guest phsyical
page) from crashing the host.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 0675d3e51692..67c1154960f0 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -463,7 +463,19 @@ void set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
vcpu->cr3 = cr3;
spin_lock(&vcpu->kvm->lock);
- vcpu->mmu.new_cr3(vcpu);
+ /*
+ * Does the new cr3 value map to physical memory? (Note, we
+ * catch an invalid cr3 even in real-mode, because it would
+ * cause trouble later on when we turn on paging anyway.)
+ *
+ * A real CPU would silently accept an invalid cr3 and would
+ * attempt to use it - with largely undefined (and often hard
+ * to debug) behavior on the guest side.
+ */
+ if (unlikely(!gfn_to_memslot(vcpu->kvm, cr3 >> PAGE_SHIFT)))
+ inject_gp(vcpu);
+ else
+ vcpu->mmu.new_cr3(vcpu);
spin_unlock(&vcpu->kvm->lock);
}
EXPORT_SYMBOL_GPL(set_cr3);
commit 7f7417d67ea6c1538469e3ea005484e807642c0a
Author: Ingo Molnar <mingo@elte.hu>
Date: Fri Jan 5 16:36:57 2007 -0800
[PATCH] KVM: Avoid oom on cr3 switch
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
diff --git a/drivers/kvm/mmu.c b/drivers/kvm/mmu.c
index 6e7381bc2218..4118e504e647 100644
--- a/drivers/kvm/mmu.c
+++ b/drivers/kvm/mmu.c
@@ -905,6 +905,8 @@ static void paging_new_cr3(struct kvm_vcpu *vcpu)
{
pgprintk("%s: cr3 %lx\n", __FUNCTION__, vcpu->cr3);
mmu_free_roots(vcpu);
+ if (unlikely(vcpu->kvm->n_free_mmu_pages < KVM_MIN_FREE_MMU_PAGES))
+ kvm_mmu_free_some_pages(vcpu);
mmu_alloc_roots(vcpu);
kvm_mmu_flush_tlb(vcpu);
kvm_arch_ops->set_cr3(vcpu, vcpu->mmu.root_hpa);
commit a75acf850ca80136a4f845cf9a7cd26e7465c1f4
Author: Ingo Molnar <mingo@elte.hu>
Date: Fri Jan 5 16:36:29 2007 -0800
[PATCH] profiling: fix sched profiling typo
Fix sched profiling typo, introduced by the sleep profiling patch. This
bug caused profile=sched to not work.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
diff --git a/kernel/profile.c b/kernel/profile.c
index fb5e03d57e9d..11550b2290b6 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -63,7 +63,7 @@ static int __init profile_setup(char * str)
printk(KERN_INFO
"kernel sleep profiling enabled (shift: %ld)\n",
prof_shift);
- } else if (!strncmp(str, sleepstr, strlen(sleepstr))) {
+ } else if (!strncmp(str, schedstr, strlen(schedstr))) {
prof_on = SCHED_PROFILING;
if (str[strlen(schedstr)] == ',')
str += strlen(schedstr) + 1;
commit d3b2c33860d4acdfe3ac29b40b03e655eb8d1e2c
Author: Ingo Molnar <mingo@elte.hu>
Date: Fri Jan 5 16:36:23 2007 -0800
[PATCH] KVM: Use raw_smp_processor_id() instead of smp_processor_id() where applicable
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index fbab07af6574..2d204fd45972 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -116,7 +116,7 @@ static void vmcs_clear(struct vmcs *vmcs)
static void __vcpu_clear(void *arg)
{
struct kvm_vcpu *vcpu = arg;
- int cpu = smp_processor_id();
+ int cpu = raw_smp_processor_id();
if (vcpu->cpu == cpu)
vmcs_clear(vcpu->vmcs);
@@ -541,7 +541,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
static struct vmcs *alloc_vmcs(void)
{
- return alloc_vmcs_cpu(smp_processor_id());
+ return alloc_vmcs_cpu(raw_smp_processor_id());
}
static void free_vmcs(struct vmcs *vmcs)
commit 965b58a550b6f84815cb555e6abb953e863f1610
Author: Ingo Molnar <mingo@elte.hu>
Date: Fri Jan 5 16:36:23 2007 -0800
[PATCH] KVM: Fix GFP_KERNEL alloc in atomic section bug
KVM does kmalloc() in an atomic section while having preemption disabled via
vcpu_load(). Fix this by moving the ->*_msr setup from the vcpu_setup method
to the vcpu_create method.
(This is also a small speedup for setting up a vcpu, which can in theory be
more frequent than the vcpu_create method).
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index d0a2c2d5342a..fbab07af6574 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -1094,14 +1094,6 @@ static int vmx_vcpu_setup(struct kvm_vcpu *vcpu)
rdmsrl(MSR_IA32_SYSENTER_EIP, a);
vmcs_writel(HOST_IA32_SYSENTER_EIP, a); /* 22.2.3 */
- ret = -ENOMEM;
- vcpu->guest_msrs = kmalloc(PAGE_SIZE, GFP_KERNEL);
- if (!vcpu->guest_msrs)
- goto out;
- vcpu->host_msrs = kmalloc(PAGE_SIZE, GFP_KERNEL);
- if (!vcpu->host_msrs)
- goto out_free_guest_msrs;
-
for (i = 0; i < NR_VMX_MSR; ++i) {
u32 index = vmx_msr_index[i];
u32 data_low, data_high;
@@ -1155,8 +1147,6 @@ static int vmx_vcpu_setup(struct kvm_vcpu *vcpu)
return 0;
-out_free_guest_msrs:
- kfree(vcpu->guest_msrs);
out:
return ret;
}
@@ -1906,13 +1896,33 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
{
struct vmcs *vmcs;
+ vcpu->guest_msrs = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!vcpu->guest_msrs)
+ return -ENOMEM;
+
+ vcpu->host_msrs = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!vcpu->host_msrs)
+ goto out_free_guest_msrs;
+
vmcs = alloc_vmcs();
if (!vmcs)
- return -ENOMEM;
+ goto out_free_msrs;
+
vmcs_clear(vmcs);
vcpu->vmcs = vmcs;
vcpu->launched = 0;
+
return 0;
+
+out_free_msrs:
+ kfree(vcpu->host_msrs);
+ vcpu->host_msrs = NULL;
+
+out_free_guest_msrs:
+ kfree(vcpu->guest_msrs);
+ vcpu->guest_msrs = NULL;
+
+ return -ENOMEM;
}
static struct kvm_arch_ops vmx_arch_ops = {