[RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

[RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Dan Magenheimer-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Tmem [PATCH 0/4] (Take 2): Transcendent memory
Transcendent memory - Take 2
Changes since take 1:
1) Patches can be applied serially; function names in diff (Rik van Riel)
2) Descriptions and diffstats for individual patches (Rik van Riel)
3) Restructure of tmem_ops to be more Linux-like (Jeremy Fitzhardinge)
4) Drop shared pools until security implications are understood (Pavel
   Machek and Jeremy Fitzhardinge)
5) Documentation/transcendent-memory.txt added including API description
   (see also below for API description).

Signed-off-by: Dan Magenheimer <dan.magenheimer@...>

Normal memory is directly addressable by the kernel, of a known
normally-fixed size, synchronously accessible, and persistent (though
not across a reboot).

What if there was a class of memory that is of unknown and dynamically
variable size, is addressable only indirectly by the kernel, can be
configured either as persistent or as "ephemeral" (meaning it will be
around for awhile, but might disappear without warning), and is still
fast enough to be synchronously accessible?

We call this latter class "transcendent memory" and it provides an
interesting opportunity to more efficiently utilize RAM in a virtualized
environment.  However this "memory but not really memory" may also have
applications in NON-virtualized environments, such as hotplug-memory
deletion, SSDs, and page cache compression.  Others have suggested ideas
such as allowing use of highmem memory without a highmem kernel, or use
of spare video memory.

Transcendent memory, or "tmem" for short, provides a well-defined API to
access this unusual class of memory.  (A summary of the API is provided
below.)  The basic operations are page-copy-based and use a flexible
object-oriented addressing mechanism.  Tmem assumes that some "privileged
entity" is capable of executing tmem requests and storing pages of data;
this entity is currently a hypervisor and operations are performed via
hypercalls, but the entity could be a kernel policy, or perhaps a
"memory node" in a cluster of blades connected by a high-speed
interconnect such as hypertransport or QPI.

Since tmem is not directly accessible and because page copying is done
to/from physical pageframes, it more suitable for in-kernel memory needs
than for userland applications.  However, there may be yet undiscovered
userland possibilities.

With the tmem concept outlined vaguely and its broader potential hinted,
we will overview two existing examples of how tmem can be used by the
kernel.

"Precache" can be thought of as a page-granularity victim cache for clean
pages that the kernel's pageframe replacement algorithm (PFRA) would like
to keep around, but can't since there isn't enough memory.   So when the
PFRA "evicts" a page, it first puts it into the precache via a call to
tmem.  And any time a filesystem reads a page from disk, it first attempts
to get the page from precache.  If it's there, a disk access is eliminated.
If not, the filesystem just goes to the disk like normal.  Precache is
"ephemeral" so whether a page is kept in precache (between the "put" and
the "get") is dependent on a number of factors that are invisible to
the kernel.

"Preswap" IS persistent, but for various reasons may not always be
available for use, again due to factors that may not be visible to the
kernel (but, briefly, if the kernel is being "good" and has shared its
resources nicely, then it will be able to use preswap, else it will not).
Once a page is put, a get on the page will always succeed.  So when the
kernel finds itself in a situation where it needs to swap out a page, it
first attempts to use preswap.  If the put works, a disk write and
(usually) a disk read are avoided.  If it doesn't, the page is written
to swap as usual.  Unlike precache, whether a page is stored in preswap
vs swap is recorded in kernel data structures, so when a page needs to
be fetched, the kernel does a get if it is in preswap and reads from
swap if it is not in preswap.

Both precache and preswap may be optionally compressed, trading off 2x
space reduction vs 10x performance for access.  Precache also has a
sharing feature, which allows different nodes in a "virtual cluster"
to share a local page cache.

Tmem has some similarity to IBM's Collaborative Memory Management, but
creates more of a partnership between the kernel and the "privileged
entity" and is not very invasive.  Tmem may be applicable for KVM and
containers; there is some disagreement on the extent of its value.
Tmem is highly complementary to ballooning (aka page granularity hot
plug) and memory deduplication (aka transparent content-based page
sharing) but still has value when neither are present.

Performance is difficult to quantify because some benchmarks respond
very favorably to increases in memory and tmem may do quite well on
those, depending on how much tmem is available which may vary widely
and dynamically, depending on conditions completely outside of the
system being measured.  Ideas on how best to provide useful metrics
would be appreciated.

Tmem is now supported in Xen's unstable tree (targeted for the Xen 3.5
release) and in Xen's Linux 2.6.18-xen source tree.  Again, Xen is not
necessarily a requirement, but currently provides the only existing
implementation of tmem.

Lots more information about tmem can be found at:
http://oss.oracle.com/projects/tmem and there will be
a talk about it on the first day of Linux Symposium in July 2009.
Tmem is the result of a group effort, including Dan Magenheimer,
Chris Mason, Dave McCracken, Kurt Hackel and Zhigang Wang, with helpful
input from Jeremy Fitzhardinge, Keir Fraser, Ian Pratt, Sunil Mushran,
Joel Becker, and Jan Beulich.

THE TRANSCENDENT MEMORY API

Transcendent memory is made up of a set of pools.  Each pool is made
up of a set of objects.  And each object contains a set of pages.
The combination of a 32-bit pool id, a 64-bit object id, and a 32-bit
page id, uniquely identify a page of tmem data, and this tuple is called
a "handle." Commonly, the three parts of a handle are used to address
a filesystem, a file within that filesystem, and a page within that file;
however an OS can use any values as long as they uniquely identify
a page of data.

When a tmem pool is created, it is given certain attributes: It can
be private or shared, and it can be persistent or ephemeral.  Each
combination of these attributes provides a different set of useful
functionality and also defines a slightly different set of semantics
for the various operations on the pool.  Other pool attributes include
the size of the page and a version number.

Once a pool is created, operations are performed on the pool.  Pages
are copied between the OS and tmem and are addressed using a handle.
Pages and/or objects may also be flushed from the pool.  When all
operations are completed, a pool can be destroyed.

The specific tmem functions are called in Linux through a set of
accessor functions:

int (*new_pool)(struct tmem_pool_uuid uuid, u32 flags);
int (*destroy_pool)(u32 pool_id);
int (*put_page)(u32 pool_id, u64 object, u32 index, unsigned long pfn);
int (*get_page)(u32 pool_id, u64 object, u32 index, unsigned long pfn);
int (*flush_page)(u32 pool_id, u64 object, u32 index);
int (*flush_object)(u32 pool_id, u64 object);

The new_pool accessor creates a new pool and returns a pool id
which is a non-negative 32-bit integer.  If the flags parameter
specifies that the pool is to be shared, the uuid is a 128-bit "shared
secret" else it is ignored.  The destroy_pool accessor destroys the pool.
(Note: shared pools are not supported until security implications
are better understood.)

The put_page accessor copies a page of data from the specified pageframe
and associates it with the specified handle.

The get_page accessor looks up a page of data in tmem associated with
the specified handle and, if found, copies it to the specified pageframe.

The flush_page accessor ensures that subsequent gets of a page with
the specified handle will fail.  The flush_object accessor ensures
that subsequent gets of any page matching the pool id and object
will fail.

There are many subtle but critical behaviors for get_page and put_page:
- Any put_page (with one notable exception) may be rejected and the client
  must be prepared to deal with that failure.  A put_page copies, NOT moves,
  data; that is the data exists in both places.  Linux is responsible for
  destroying or overwriting its own copy, or alternately managing any
  coherency between the copies.
- Every page successfully put to a persistent pool must be found by a
  subsequent get_page that specifies the same handle.  A page successfully
  put to an ephemeral pool has an indeterminate lifetime and even an
  immediately subsequent get_page may fail.
- A get_page to a private pool is destructive, that is it behaves as if
  the get_page were atomically followed by a flush_page.  A get_page
  to a shared pool is non-destructive.  A flush_page behaves just like
  a get_page to a private pool except the data is thrown away.
- Put-put-get coherency is guaranteed.  For example, after the sequence:
        put_page(ABC,D1);
        put_page(ABC,D2);
        get_page(ABC,E)
  E may never contain the data from D1.  However, even for a persistent
  pool, the get_page may fail if the second put_page indicates failure.
- Get-get coherency is guaranteed.  For example, in the sequence:
        put_page(ABC,D);
        get_page(ABC,E1);
        get_page(ABC,E2)
  if the first get_page fails, the second must also fail.
- A tmem implementation provides no serialization guarantees (e.g. to
  an SMP Linux).  So if different Linux threads are putting and flushing
  the same page, the results are indeterminate.
  guaranteed and must be synchronized by Linux.

Changed core kernel files:
 fs/buffer.c                              |    5 +
 fs/ext3/super.c                          |    2
 fs/mpage.c                               |    8 ++
 fs/super.c                               |    5 +
 include/linux/fs.h                       |    7 ++
 include/linux/swap.h                     |   57 +++++++++++++++++++++
 include/linux/sysctl.h                   |    1
 kernel/sysctl.c                          |   12 ++++
 mm/Kconfig                               |   26 +++++++++
 mm/Makefile                              |    3 +
 mm/filemap.c                             |   11 ++++
 mm/page_io.c                             |   12 ++++
 mm/swapfile.c                            |   46 ++++++++++++++--
 mm/truncate.c                            |   10 +++
 14 files changed, 199 insertions(+), 6 deletions(-)

Newly added core kernel files:
 Documentation/transcendent-memory.txt    |  175 +++++++++++++
 include/linux/tmem.h                     |   88 ++++++
 mm/precache.c                            |  134 ++++++++++
 mm/preswap.c                             |  273 +++++++++++++++++++++
 4 files changed, 670 insertions(+)

Changed xen-specific files:
 arch/x86/include/asm/xen/hypercall.h     |    8 +++
 drivers/xen/Makefile                     |    1
 include/xen/interface/tmem.h             |   43 +++++++++++++++++++++
 include/xen/interface/xen.h              |   22 ++++++++++
 4 files changed, 74 insertions(+)

Newly added xen-specific files:
 drivers/xen/tmem.c                       |   97 +++++++++++++++++++++
 include/xen/interface/tmem.h             |   43 +++++++++
 2 files changed, 140 insertions(+)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Rik van Riel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dan Magenheimer wrote:

> "Preswap" IS persistent, but for various reasons may not always be
> available for use, again due to factors that may not be visible to the
> kernel (but, briefly, if the kernel is being "good" and has shared its
> resources nicely, then it will be able to use preswap, else it will not).
> Once a page is put, a get on the page will always succeed.

What happens when all of the free memory on a system
has been consumed by preswap by a few guests?

Will the system be unable to start another guest,
or is there some way to free the preswap memory?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Dan Magenheimer-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> From: Rik van Riel [mailto:riel@...]

> Dan Magenheimer wrote:
> > "Preswap" IS persistent, but for various reasons may not always be
> > available for use, again due to factors that may not be
> visible to the
> > kernel (but, briefly, if the kernel is being "good" and has
> shared its
> > resources nicely, then it will be able to use preswap, else
> it will not).
> > Once a page is put, a get on the page will always succeed.
>
> What happens when all of the free memory on a system
> has been consumed by preswap by a few guests?
> Will the system be unable to start another guest,

The default policy (and only policy implemented as of now) is
that no guest is allowed to use more than max_mem for the
sum of directly-addressable memory (e.g. RAM) and persistent
tmem (e.g. preswap).  So if a guest is using its default
memory==max_mem and is doing no ballooning, nothing can
be put in preswap by that guest.
 
> or is there some way to free the preswap memory?

Yes and no.  There is no way externally to free preswap
memory, but an in-guest userland root service can write to sysfs
to affect preswap size.  This essentially does a partial
swapoff on preswap if there is sufficient (directly addressable)
guest RAM available.  (I have this prototyped as part of
the xenballoond self-ballooning service in xen-unstable.)

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Anthony Liguori-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dan Magenheimer wrote:

> Tmem [PATCH 0/4] (Take 2): Transcendent memory
> Transcendent memory - Take 2
> Changes since take 1:
> 1) Patches can be applied serially; function names in diff (Rik van Riel)
> 2) Descriptions and diffstats for individual patches (Rik van Riel)
> 3) Restructure of tmem_ops to be more Linux-like (Jeremy Fitzhardinge)
> 4) Drop shared pools until security implications are understood (Pavel
>    Machek and Jeremy Fitzhardinge)
> 5) Documentation/transcendent-memory.txt added including API description
>    (see also below for API description).
>
> Signed-off-by: Dan Magenheimer <dan.magenheimer@...>
>
> Normal memory is directly addressable by the kernel, of a known
> normally-fixed size, synchronously accessible, and persistent (though
> not across a reboot).
>
> What if there was a class of memory that is of unknown and dynamically
> variable size, is addressable only indirectly by the kernel, can be
> configured either as persistent or as "ephemeral" (meaning it will be
> around for awhile, but might disappear without warning), and is still
> fast enough to be synchronously accessible?

I have trouble mapping this to a VMM capable of overcommit without just
coming back to CMM2.

In CMM2 parlance, ephemeral tmem pools is just normal kernel memory
marked in the volatile state, no?

It seems to me that an architecture built around hinting would be more
robust than having to use separate memory pools for this type of memory
(especially since you are requiring a copy to/from the pool).

For instance, you can mark data DMA'd from disk (perhaps by read-ahead)
as volatile without ever bringing it into the CPU cache.  With tmem, if
you wanted to use a tmem pool for all of the page cache, you'd likely
suffer significant overhead due to copying.

Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Dan Magenheimer-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Anthony --

Thanks for the comments.

> I have trouble mapping this to a VMM capable of overcommit
> without just coming back to CMM2.
>
> In CMM2 parlance, ephemeral tmem pools is just normal kernel memory
> marked in the volatile state, no?

They are similar in concept, but a volatile-marked kernel page
is still a kernel page, can be changed by a kernel (or user)
store instruction, and counts as part of the memory used
by the VM.  An ephemeral tmem page cannot be directly written
by a kernel (or user) store, can only be read via a "get" (which
may or may not succeed), and doesn't count against the memory
used by the VM (even though it likely contains -- for awhile --
data useful to the VM).

> It seems to me that an architecture built around hinting
> would be more
> robust than having to use separate memory pools for this type
> of memory
> (especially since you are requiring a copy to/from the pool).

Depends on what you mean by robust, I suppose.  Once you
understand the basics of tmem, it is very simple and this
is borne out in the low invasiveness of the Linux patch.
Simplicity is another form of robustness.

> For instance, you can mark data DMA'd from disk (perhaps by
> read-ahead)
> as volatile without ever bringing it into the CPU cache.  
> With tmem, if
> you wanted to use a tmem pool for all of the page cache, you'd likely
> suffer significant overhead due to copying.

The copy may be expensive on an older machine, but on newer
machines copying a page is relatively inexpensive.  On a reasonable
multi-VM-kernbench-like benchmark I'll be presenting at Linux
Symposium next week, the overhead is on the order of 0.01%
for a fairly significant savings in IOs.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Anthony Liguori-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dan Magenheimer wrote:

> Hi Anthony --
>
> Thanks for the comments.
>
>  
>> I have trouble mapping this to a VMM capable of overcommit
>> without just coming back to CMM2.
>>
>> In CMM2 parlance, ephemeral tmem pools is just normal kernel memory
>> marked in the volatile state, no?
>>    
>
> They are similar in concept, but a volatile-marked kernel page
> is still a kernel page, can be changed by a kernel (or user)
> store instruction, and counts as part of the memory used
> by the VM.  An ephemeral tmem page cannot be directly written
> by a kernel (or user) store,

Why does tmem require a special store?

A VMM can trap write operations pages can be stored on disk
transparently by the VMM if necessary.  I guess that's the bit I'm missing.

>> It seems to me that an architecture built around hinting
>> would be more
>> robust than having to use separate memory pools for this type
>> of memory
>> (especially since you are requiring a copy to/from the pool).
>>    
>
> Depends on what you mean by robust, I suppose.  Once you
> understand the basics of tmem, it is very simple and this
> is borne out in the low invasiveness of the Linux patch.
> Simplicity is another form of robustness.
>  

The main disadvantage I see is that you need to explicitly convert
portions of the kernel to use a data copying API.  That seems like an
invasive change to me.  Hinting on the other hand can be done in a
less-invasive way.

I'm not really arguing against tmem, just the need to have explicit
get/put mechanisms for the transcendent memory areas.

> The copy may be expensive on an older machine, but on newer
> machines copying a page is relatively inexpensive.

I don't think that's a true statement at all :-)  If you had a workload
where data never came into the CPU cache (zero-copy) and now you
introduce a copy, even with new system, you're going to see a
significant performance hit.

>   On a reasonable
> multi-VM-kernbench-like benchmark I'll be presenting at Linux
> Symposium next week, the overhead is on the order of 0.01%
> for a fairly significant savings in IOs.
>  
But how would something like specweb do where you should be doing
zero-copy IO from the disk to the network?  This is the area where I
would be concerned.  For something like kernbench, you're already
bringing the disk data into the CPU cache anyway so I can appreciate
that the copy could get lost in the noise.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Jeremy Fitzhardinge :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 07/08/09 16:57, Anthony Liguori wrote:
> Why does tmem require a special store?
>
> A VMM can trap write operations pages can be stored on disk
> transparently by the VMM if necessary.  I guess that's the bit I'm
> missing.

tmem doesn't store anything to disk.  It's more about making sure that
free host memory can be quickly and efficiently be handed out to guests
as they need it; to increase "memory liquidity" as it were.  Guests need
to explicitly ask to use tmem, rather than having the host/hypervisor
try to intuit what to do based on access patterns and hints; typically
they'll use tmem as the first line storage for memory which they were
about to swap out anyway.  There's no point in making tmem swappable,
because the guest is perfectly capable of swapping its own memory.

The copying interface avoids a lot of the delicate corners of the CMM
code, in which subtle races can lurk in fairly hard-to-test-for ways.

>> The copy may be expensive on an older machine, but on newer
>> machines copying a page is relatively inexpensive.
>
> I don't think that's a true statement at all :-)  If you had a
> workload where data never came into the CPU cache (zero-copy) and now
> you introduce a copy, even with new system, you're going to see a
> significant performance hit.

If the copy helps avoid physical disk IO, then it is cheap at the
price.  A guest generally wouldn't push a page into tmem unless it was
about to evict it anyway, so it has already determined the page is
cold/unwanted, and the copy isn't a great cost.  Hot/busy pages
shouldn't be anywhere near tmem; if they are, it suggests you've cut
your domain's memory too aggressively.

    J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Anthony Liguori-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Jeremy Fitzhardinge wrote:

> On 07/08/09 16:57, Anthony Liguori wrote:
>  
>> Why does tmem require a special store?
>>
>> A VMM can trap write operations pages can be stored on disk
>> transparently by the VMM if necessary.  I guess that's the bit I'm
>> missing.
>>    
>
> tmem doesn't store anything to disk.  It's more about making sure that
> free host memory can be quickly and efficiently be handed out to guests
> as they need it; to increase "memory liquidity" as it were.  Guests need
> to explicitly ask to use tmem, rather than having the host/hypervisor
> try to intuit what to do based on access patterns and hints; typically
> they'll use tmem as the first line storage for memory which they were
> about to swap out anyway.

If the primary use of tmem is to avoid swapping when measure pressure
would have forced it, how is this different using ballooning along with
a shrinker callback?

With virtio-balloon, a guest can touch any of the memory it's ballooned
to immediately reclaim that memory.  I think the main difference with
tmem is that you can also mark a page as being volatile.  The hypervisor
can then reclaim that page without swapping it (it can always reclaim
memory and swap it) and generate a special fault to the guest if it
attempts to access it.

You can fail to put with tmem, right?  You can also fail to get?  In
both cases though, these failures can be handled because Linux is able
to recreate the page on it's on (by doing disk IO).  So why not just
generate a special fault instead of having to introduce special accessors?

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Rik van Riel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Anthony Liguori wrote:

> I have trouble mapping this to a VMM capable of overcommit without just
> coming back to CMM2.

Same for me.  CMM2 has a more complex mechanism, but way
easier policy than anything else out there.

> In CMM2 parlance, ephemeral tmem pools is just normal kernel memory
> marked in the volatile state, no?

Basically.

> It seems to me that an architecture built around hinting would be more
> robust than having to use separate memory pools for this type of memory
> (especially since you are requiring a copy to/from the pool).

I agree.  Something along the lines of CMM2 needs more
infrastructure, but will be infinitely easier to get right
from the policy side.

Automatic ballooning is an option too, with fairly simple
infrastructure, but potentially insanely complex policy
issues to sort out...

--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Dan Magenheimer-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> > I have trouble mapping this to a VMM capable of overcommit
> without just
> > coming back to CMM2.
>
> Same for me.  CMM2 has a more complex mechanism, but way
> easier policy than anything else out there.

Although tmem and CMS have similar conceptual objectives,
let me try to describe what I see as a fundamental
difference in approach.

The primary objective of both is to utilize RAM more
efficiently.  Both are ideally complemented with some
longer term "memory shaping" mechanism such as automatic
ballooning or hotplug.

CMM2's focus is on increasing the number of VM's that
can run on top of the hypervisor.  To do this, it
depends on hints provided by Linux to surreptitiously
steal memory away from Linux.  The stolen memory still
"belongs" to Linux and if Linux goes to use it but the
hypervisor has already given it to another Linux, the
hypervisor must jump through hoops to give it back.
If it guesses wrong and overcommits too aggressively,
the hypervisor must swap some memory to a "hypervisor
swap disk" (which btw has some policy challenges).
IMHO this is more of a "mainframe" model.

Tmem's focus is on helping Linux to aggressively manage
the amount of memory it uses (and thus reduce the amount
of memory it would get "billed" for using).  To do this, it
provides two "safety valve" services, one to reduce the
cost of "refaults" (Rik's term) and the other to reduce
the cost of swapping.  Both services are almost
always available, but if the memory of the physical
machine get overcommitted, the most aggressive Linux
guests must fall back to using their disks (because the
hypervisor does not have a "hypervisor swap disk").  But
when physical memory is undercommitted, it is still being
used usefully without compromising "memory liquidity".
(I like this term Jeremy!) IMHO this is more of a "cloud"
model.

In other words, CMM2, despite its name, is more of a
"subservient" memory management system (Linux is
subservient to the hypervisor) and tmem is more
collaborative (Linux and the hypervisor share the
responsibilities and the benefits/costs).

I'm not saying either one is bad or good -- and I'm sure
each can be adapted to approximately deliver the value
of the other -- they are just approaching the same problem
from different perspectives.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Rik van Riel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dan Magenheimer wrote:

> I'm not saying either one is bad or good -- and I'm sure
> each can be adapted to approximately deliver the value
> of the other -- they are just approaching the same problem
> from different perspectives.

Indeed.  Tmem and auto-ballooning have a simple mechanism,
but the policy required to make it work right could well
be too complex to ever get right.

CMM2 has a more complex mechanism, but the policy is
absolutely trivial.

CMM2 and auto-ballooning seem to give about similar
performance gains on zSystem.

I suspect that for Xen and KVM, we'll want to choose
for the approach that has the simpler policy, because
relying on different versions of different operating
systems to all get the policy of auto-ballooning or
tmem right is likely to result in bad interactions
between guests and other intractable issues.

--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Anthony Liguori-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dan Magenheimer wrote:
> CMM2's focus is on increasing the number of VM's that
> can run on top of the hypervisor.  To do this, it
> depends on hints provided by Linux to surreptitiously
> steal memory away from Linux.  The stolen memory still
> "belongs" to Linux and if Linux goes to use it but the
> hypervisor has already given it to another Linux, the
> hypervisor must jump through hoops to give it back.
>  

It depends on how you define "jump through hoops".

> If it guesses wrong and overcommits too aggressively,
> the hypervisor must swap some memory to a "hypervisor
> swap disk" (which btw has some policy challenges).
> IMHO this is more of a "mainframe" model.
>  

No, not at all.  A guest marks a page as being "volatile", which tells
the hypervisor it never needs to swap that page.  It can discard it
whenever it likes.

If the guest later tries to access that page, it will get a special
"discard fault".  For a lot of types of memory, the discard fault
handler can then restore that page transparently to the code that
generated the discard fault.

AFAICT, ephemeral tmem has the exact same characteristics as volatile
CMM2 pages.  The difference is that tmem introduces an API to explicitly
manage this memory behind a copy interface whereas CMM2 uses hinting and
a special fault handler to allow any piece of memory to be marked in
this way.

> In other words, CMM2, despite its name, is more of a
> "subservient" memory management system (Linux is
> subservient to the hypervisor) and tmem is more
> collaborative (Linux and the hypervisor share the
> responsibilities and the benefits/costs).
>  

I don't really agree with your analysis of CMM2.  We can map CMM2
operations directly to ephemeral tmem interfaces so tmem is a subset of
CMM2, no?

What's appealing to me about CMM2 is that it doesn't change the guest
semantically but rather just gives the VMM more information about how
the VMM is using it's memory.  This suggests that it allows greater
flexibility in the long term to the VMM and more importantly, provides
an easier implementation across a wide range of guests.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Dan Magenheimer-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> > I'm not saying either one is bad or good -- and I'm sure
> > each can be adapted to approximately deliver the value
> > of the other -- they are just approaching the same problem
> > from different perspectives.
>
> Indeed.  Tmem and auto-ballooning have a simple mechanism,
> but the policy required to make it work right could well
> be too complex to ever get right.
>
> CMM2 has a more complex mechanism, but the policy is
> absolutely trivial.

Could you elaborate a bit more on what policy you
are referring to and what decisions the policies are
trying to guide?  And are you looking at the policies
in Linux or in the hypervisor or the sum of both?

The Linux-side policies in the tmem patch seem trivial
to me and the Xen-side implementation is certainly
working correctly, though "working right" is a hard
objective to measure.  But depending on how you define
"working right", the pageframe replacement algorithm
in Linux may also be "too complex to ever get right"
but it's been working well enough for a long time.

> CMM2 and auto-ballooning seem to give about similar
> performance gains on zSystem.

Tmem provides a huge advantage over my self-ballooning
implementation, but maybe that's because it is more
aggressive than the CMM auto-ballooning, resulting
in more refaults that must be "fixed".

> I suspect that for Xen and KVM, we'll want to choose
> for the approach that has the simpler policy, because
> relying on different versions of different operating
> systems to all get the policy of auto-ballooning or
> tmem right is likely to result in bad interactions
> between guests and other intractable issues.

Again, not sure what tmem policy in Linux you are referring
to or what bad interactions you foresee.  Could you
clarify?

Auto-ballooning policy is certainly a challenge, but
that's true whether CMM or tmem, right?

Thanks,
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Dan Magenheimer-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> > If it guesses wrong and overcommits too aggressively,
> > the hypervisor must swap some memory to a "hypervisor
> > swap disk" (which btw has some policy challenges).
> > IMHO this is more of a "mainframe" model.
>
> No, not at all.  A guest marks a page as being "volatile",
> which tells
> the hypervisor it never needs to swap that page.  It can discard it
> whenever it likes.
>
> If the guest later tries to access that page, it will get a special
> "discard fault".  For a lot of types of memory, the discard fault
> handler can then restore that page transparently to the code that
> generated the discard fault.

But this means that either the content of that page must have been
preserved somewhere or the discard fault handler has sufficient
information to go back and get the content from the source (e.g.
the filesystem).  Or am I misunderstanding?

With tmem, the equivalent of the "failure to access a discarded page"
is inline and synchronous, so if the tmem access "fails", the
normal code immediately executes.

> AFAICT, ephemeral tmem has the exact same characteristics as volatile
> CMM2 pages.  The difference is that tmem introduces an API to
> explicitly
> manage this memory behind a copy interface whereas CMM2 uses
> hinting and
> a special fault handler to allow any piece of memory to be marked in
> this way.
> :
> I don't really agree with your analysis of CMM2.  We can map CMM2
> operations directly to ephemeral tmem interfaces so tmem is a
> subset of CMM2, no?

Not really.  I suppose one *could* use tmem that way, immediately
writing every page read from disk into tmem, though that would
probably cause some real coherency challenges.  But the patch as
proposed only puts ready-to-be-replaced pages (as determined by
Linux's PFRA) into ephemeral tmem.

The two services provided to Linux (in the proposed patch) by
tmem are:

1) "I have a page of memory that I'm about to throw away because
    I'm not sure I need it any more and I have a better use for
    that pageframe right now.  Mr Tmem might you have someplace
    you can squirrel it away for me in case I need it again?
    Oh, and by the way, if you can't or you lose it, no big deal
    as I can go get it from disk if I need to."
2) "I'm out of memory and have to put this page somewhere.  Mr
    Tmem, can you take it?  But if you do take it, you have to
    promise to give it back when I ask for it!  If you can't
    promise, never mind, I'll find something else to do with it."

> > In other words, CMM2, despite its name, is more of a
> > "subservient" memory management system (Linux is
> > subservient to the hypervisor) and tmem is more
> > collaborative (Linux and the hypervisor share the
> > responsibilities and the benefits/costs).
>
> What's appealing to me about CMM2 is that it doesn't change the guest
> semantically but rather just gives the VMM more information about how
> the VMM is using it's memory.  This suggests that it allows greater
> flexibility in the long term to the VMM and more importantly,
> provides an easier implementation across a wide range of guests.

I suppose changing Linux to utilize the two tmem services
as described above is a semantic change.  But to me it
seems no more of a semantic change than requiring a new
special page fault handler because a page of memory might
disappear behind the OS's back.

But IMHO this is a corollary of the fundamental difference.  CMM2's
is more the "VMware" approach which is that OS's should never have
to be modified to run in a virtual environment.  (Oh, but maybe
modified just slightly to make the hypervisor a little less
clueless about the OS's resource utilization.)  Tmem asks: If an
OS is going to often run in a virtualized environment, what
can be done to share the responsibility for resource management
so that the OS does what it can with the knowledge that it has
and the hypervisor can most flexibly manage resources across
all the guests?  I do agree that adding an additional API
binds the user and provider of the API less flexibly then without
the API, but as long as the API is optional (as it is for both
tmem and CMM2), I don't see why CMM2 provides more flexibility.

Thanks,
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Rik van Riel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dan Magenheimer wrote:

> But this means that either the content of that page must have been
> preserved somewhere or the discard fault handler has sufficient
> information to go back and get the content from the source (e.g.
> the filesystem).  Or am I misunderstanding?

The latter.  Only pages which can be fetched from
source again are marked as volatile.

> But IMHO this is a corollary of the fundamental difference.  CMM2's
> is more the "VMware" approach which is that OS's should never have
> to be modified to run in a virtual environment.

Actually, the CMM2 mechanism is quite invasive in
the guest operating system's kernel.

> ( I don't see why CMM2 provides more flexibility.

I don't think anyone is arguing that.  One thing
that people have argued is that CMM2 can be more
efficient, and easier to get the policy right in
the face of multiple guest operating systems.

--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Anthony Liguori-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dan Magenheimer wrote:
> But this means that either the content of that page must have been
> preserved somewhere or the discard fault handler has sufficient
> information to go back and get the content from the source (e.g.
> the filesystem).  Or am I misunderstanding?
>  

As Rik said, it's the later.

> With tmem, the equivalent of the "failure to access a discarded page"
> is inline and synchronous, so if the tmem access "fails", the
> normal code immediately executes.
>  

Yup.  This is the main difference AFAICT.  It's really just API
semantics within Linux.

You could clearly use the volatile state of CMM2 to implement tmem as an
API in Linux.  The get/put functions would set a flag such that if the
discard handler was invoked as long as that operation happened, the
operation could safely fail.  That's why I claimed tmem is a subset of CMM2.

> I suppose changing Linux to utilize the two tmem services
> as described above is a semantic change.  But to me it
> seems no more of a semantic change than requiring a new
> special page fault handler because a page of memory might
> disappear behind the OS's back.
>
> But IMHO this is a corollary of the fundamental difference.  CMM2's
> is more the "VMware" approach which is that OS's should never have
> to be modified to run in a virtual environment.  (Oh, but maybe
> modified just slightly to make the hypervisor a little less
> clueless about the OS's resource utilization.)

While I always enjoy a good holy war, I'd like to avoid one here because
I want to stay on the topic at hand.

If there was one change to tmem that would make it more palatable, for
me it would be changing the way pools are "allocated".  Instead of
getting an opaque handle from the hypervisor, I would force the guest to
allocate it's own memory and to tell the hypervisor that it's a tmem
pool.  You could then introduce semantics about whether the guest was
allowed to directly manipulate the memory as long as it was in the
pool.  It would be required to access the memory via get/put functions
that under Xen, would end up being a hypercall and a copy.  Presumably
you would do some tricks with ballooning to allocate empty memory in Xen
and then use those addresses as tmem pools.  On KVM, we could do
something more clever.

The big advantage of keeping the tmem pool part of the normal set of
guest memory is that you don't introduce new challenges with respect to
memory accounting.  Whether or not tmem is directly accessible from the
guest, it is another memory resource.  I'm certain that you'll want to
do accounting of how much tmem is being consumed by each guest, and I
strongly suspect that you'll want to do tmem accounting on a per-process
basis.  I also suspect that doing tmem limiting for things like cgroups
would be desirable.

That all points to making tmem normal memory so that all that
infrastructure can be reused.  I'm not sure how well this maps to Xen
guests, but it works out fine when the VMM is capable of presenting
memory to the guest without actually allocating it (via overcommit).

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Dan Magenheimer-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> > But IMHO this is a corollary of the fundamental difference.  CMM2's
> > is more the "VMware" approach which is that OS's should never have
> > to be modified to run in a virtual environment.  (Oh, but maybe
> > modified just slightly to make the hypervisor a little less
> > clueless about the OS's resource utilization.)
>
> While I always enjoy a good holy war, I'd like to avoid one
> here because
> I want to stay on the topic at hand.

Oops, sorry, I guess that was a bit inflammatory.  What I meant to
say is that inferring resource utilization efficiency is a very
hard problem and VMware (and I'm sure IBM too) has done a fine job
with it; CMM2 explicitly provides some very useful information from
within the OS to the hypervisor so that it doesn't have to infer
that information; but tmem is trying to go a step further by making
the cooperation between the OS and hypervisor more explicit
and directly beneficial to the OS.

> If there was one change to tmem that would make it more
> palatable, for
> me it would be changing the way pools are "allocated".  Instead of
> getting an opaque handle from the hypervisor, I would force
> the guest to
> allocate it's own memory and to tell the hypervisor that it's a tmem
> pool.

An interesting idea but one of the nice advantages of tmem being
completely external to the OS is that the tmem pool may be much
larger than the total memory available to the OS.  As an extreme
example, assume you have one 1GB guest on a physical machine that
has 64GB physical RAM.  The guest now has 1GB of directly-addressable
memory and 63GB of indirectly-addressable memory through tmem.
That 63GB requires no page structs or other data structures in the
guest.  And in the current (external) implementation, the size
of each pool is constantly changing, sometimes dramatically so
the guest would have to be prepared to handle this.  I also wonder
if this would make shared-tmem-pools more difficult.

I can see how it might be useful for KVM though.  Once the
core API and all the hooks are in place, a KVM implementation of
tmem could attempt something like this.

> The big advantage of keeping the tmem pool part of the normal set of
> guest memory is that you don't introduce new challenges with
> respect to memory accounting.  Whether or not tmem is directly
> accessible from the guest, it is another memory resource.  I'm
> certain that you'll want to do accounting of how much tmem is being
> consumed by each guest

Yes, the Xen implementation of tmem does accounting on a per-pool
and a per-guest basis and exposes the data via a privileged
"tmem control" hypercall.

> and I strongly suspect that you'll want to do tmem accounting on a
> per-process
> basis.  I also suspect that doing tmem limiting for things
> like cgroups would be desirable.

This can be done now if each process or cgroup creates a different
tmem pool.  The proposed patch doesn't do this, but it certainly
seems possible.

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Avi Kivity-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 07/10/2009 06:23 PM, Dan Magenheimer wrote:

>> If there was one change to tmem that would make it more
>> palatable, for
>> me it would be changing the way pools are "allocated".  Instead of
>> getting an opaque handle from the hypervisor, I would force
>> the guest to
>> allocate it's own memory and to tell the hypervisor that it's a tmem
>> pool.
>>      
>
> An interesting idea but one of the nice advantages of tmem being
> completely external to the OS is that the tmem pool may be much
> larger than the total memory available to the OS.  As an extreme
> example, assume you have one 1GB guest on a physical machine that
> has 64GB physical RAM.  The guest now has 1GB of directly-addressable
> memory and 63GB of indirectly-addressable memory through tmem.
> That 63GB requires no page structs or other data structures in the
> guest.  And in the current (external) implementation, the size
> of each pool is constantly changing, sometimes dramatically so
> the guest would have to be prepared to handle this.  I also wonder
> if this would make shared-tmem-pools more difficult.
>    

Having no struct pages is also a downside; for example this guest cannot
have more than 1GB of anonymous memory without swapping like mad.  
Swapping to tmem is fast but still a lot slower than having the memory
available.

tmem makes life a lot easier to the hypervisor and to the guest, but
also gives up a lot of flexibility.  There's a difference between memory
and a very fast synchronous backing store.

> I can see how it might be useful for KVM though.  Once the
> core API and all the hooks are in place, a KVM implementation of
> tmem could attempt something like this.
>    

My worry is that tmem for kvm leaves a lot of niftiness on the table,
since it was designed for a hypervisor with much simpler memory
management.  kvm can already use spare memory for backing guest swap,
and can already convert unused guest memory to free memory (by swapping
it).  tmem doesn't really integrate well with these capabilities.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Anthony Liguori-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dan Magenheimer wrote:
> Oops, sorry, I guess that was a bit inflammatory.  What I meant to
> say is that inferring resource utilization efficiency is a very
> hard problem and VMware (and I'm sure IBM too) has done a fine job
> with it; CMM2 explicitly provides some very useful information from
> within the OS to the hypervisor so that it doesn't have to infer
> that information; but tmem is trying to go a step further by making
> the cooperation between the OS and hypervisor more explicit
> and directly beneficial to the OS.
>  

KVM definitely falls into the camp of trying to minimize modification to
the guest.

>> If there was one change to tmem that would make it more
>> palatable, for
>> me it would be changing the way pools are "allocated".  Instead of
>> getting an opaque handle from the hypervisor, I would force
>> the guest to
>> allocate it's own memory and to tell the hypervisor that it's a tmem
>> pool.
>>    
>
> An interesting idea but one of the nice advantages of tmem being
> completely external to the OS is that the tmem pool may be much
> larger than the total memory available to the OS.  As an extreme
> example, assume you have one 1GB guest on a physical machine that
> has 64GB physical RAM.  The guest now has 1GB of directly-addressable
> memory and 63GB of indirectly-addressable memory through tmem.
> That 63GB requires no page structs or other data structures in the
> guest.  And in the current (external) implementation, the size
> of each pool is constantly changing, sometimes dramatically so
> the guest would have to be prepared to handle this.  I also wonder
> if this would make shared-tmem-pools more difficult.
>
> I can see how it might be useful for KVM though.  Once the
> core API and all the hooks are in place, a KVM implementation of
> tmem could attempt something like this.
>  

It's the core API that is really the issue.  The semantics of tmem
(external memory pool with copy interface) is really what is problematic.

The basic concept, notifying the VMM about memory that can be recreated
by the guest to avoid the VMM having to swap before reclaim, is great
and I'd love to see Linux support it in some way.

>> The big advantage of keeping the tmem pool part of the normal set of
>> guest memory is that you don't introduce new challenges with
>> respect to memory accounting.  Whether or not tmem is directly
>> accessible from the guest, it is another memory resource.  I'm
>> certain that you'll want to do accounting of how much tmem is being
>> consumed by each guest
>>    
>
> Yes, the Xen implementation of tmem does accounting on a per-pool
> and a per-guest basis and exposes the data via a privileged
> "tmem control" hypercall.
>  

I was talking about accounting within the guest.  It's not just a matter
of accounting within the mm, it's also about accounting in userspace.  A
lot of software out there depends on getting detailed statistics from
Linux about how much memory is in use in order to determine things like
memory pressure.  If you introduce a new class of memory, you need a new
class of statistics to expose to userspace and all those tools need
updating.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux

by Dan Magenheimer-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> > that information; but tmem is trying to go a step further by making
> > the cooperation between the OS and hypervisor more explicit
> > and directly beneficial to the OS.
>
> KVM definitely falls into the camp of trying to minimize
> modification to the guest.

No argument there.  Well, maybe one :-) Yes, but KVM
also heavily encourages unmodified guests.  Tmem is
philosophically in favor of finding a balance between
things that work well with no changes to any OS (and
thus work just fine regardless of whether the OS is
running in a virtual environment or not), and things
that could work better if the OS is knowledgable that
it is running in a virtual environment.

For those that believe virtualization is a flash-in-
the-pan, no modifications to the OS is the right answer.
For those that believe it will be pervasive in the
future, finding the right balance is a critical step
in operating system evolution.

(Sorry for the Sunday morning evangelizing :-)

> >> If there was one change to tmem that would make it more
> >> palatable, for
> >> me it would be changing the way pools are "allocated".  Instead of
> >> getting an opaque handle from the hypervisor, I would force
> >> the guest to
> >> allocate it's own memory and to tell the hypervisor that
> it's a tmem
> >> pool.
> >
> > I can see how it might be useful for KVM though.  Once the
> > core API and all the hooks are in place, a KVM implementation of
> > tmem could attempt something like this.
>
> It's the core API that is really the issue.  The semantics of tmem
> (external memory pool with copy interface) is really what is
> problematic.
> The basic concept, notifying the VMM about memory that can be
> recreated
> by the guest to avoid the VMM having to swap before reclaim, is great
> and I'd love to see Linux support it in some way.

Is it the tmem API or the precache/preswap API layered on
top of it that is problematic?  Both currently assume copying
but perhaps the precache/preswap API could, with minor
modifications, meet KVM's needs better?

> > Yes, the Xen implementation of tmem does accounting on a per-pool
> > and a per-guest basis and exposes the data via a privileged
> > "tmem control" hypercall.
>
> I was talking about accounting within the guest.  It's not
> just a matter
> of accounting within the mm, it's also about accounting in
> userspace.  A
> lot of software out there depends on getting detailed statistics from
> Linux about how much memory is in use in order to determine
> things like
> memory pressure.  If you introduce a new class of memory, you
> need a new
> class of statistics to expose to userspace and all those tools need
> updating.

OK, I see.

Well, first, tmem's very name means memory that is "beyond the
range of normal perception".  This is certainly not the first class
of memory in use in data centers that can't be accounted at
process granularity.  I'm thinking disk array caches as the
primary example.  Also lots of tools that work great in a
non-virtualized OS are worthless or misleading in a virtual
environment.

Second, CPUs are getting much more complicated with massive
pipelines, many layers of caches each with different characteristics,
etc, and its getting increasingly impossible to accurately and
reproducibly measure performance at a very fine granularity.
One could only expect that other resources, such as memory,
would move in that direction.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
< Prev | 1 - 2 | Next >