« Return to Thread: 3530bis: Add a new error code for RENAME

Re: PNFS Lustre layout justification discussion pre-meeting

by faibish_sorin :: Rate this Message:

| View in Thread



Sent from my iPad

On Mar 28, 2012, at 4:13 PM, "Myklebust, Trond" <Trond.Myklebust@...> wrote:

> On Wed, 2012-03-28 at 01:51 -0700, Mike Eisler wrote:
>> As David Black said in the WG meeting, this is the crux of the issues
>> around this proposal.
>>
>> It seems to me that a pNFS layout type for a clustered file system is
>> going to "reach" in to the file systems internals, using internal
>> interfaces that are might not stable today. For example, when the Lustre
>> file system software is revised, is it required that each node in cluster
>> be upgraded all at once or is it a rolling upgrade? If the former, that
>> implies unstable internal interfaces. Supporting a Lustre layout type
>> would require the Lustre community to either commit to stablizing internal
>> interfaces, or commit to supporting a data protocol to specifically
>> support pNFS that is guaranteed to be forward compatible with future
>> revisions of the Lustre file system.
>>
>> One thing that has not been clear to me is why can the files-layout
>> (LAYOUT4_NFSV4_1_FILES) be used for Lustre?
>
> Or the objects layout, which should be a closer fit to the Lustre
> OSD-based model...
If you read my presentation you will see Lustre layout model it is further away from OSD than from file. There are more common parts with file than with object. My presentation shows it.

/Sorin


>
>> On Tue, March 27, 2012 2:08 pm, Rick Macklem wrote:
>>> Spencer Shepler wrote:
>>>> Thanks for raising the issue, Brent. I am not sure that we defined
>>>> such a requirement when allowing for separately defined NFSv4.1 layout
>>>> types.
>>>> There is a requirement that the layout type be captured in a
>>>> standards-track
>>>> RFC that is approved/reviewed by the working group. However, I believe
>>>> there is a lack of definition about how "deep" that definition
>>>> requirement
>>>> is to be followed.
>>>>
>>> From my point of view, it would be nice if the RFC went deep enough
>>> that a client implementation could be done without needing to look at
>>> Lustre sources. (Reusing sources would be nice, but in the FreeBSD
>>> world, anything GPL'd is a "no go".)
>>>
>>> I would also hope that the client->DS protocol would not be a "moving
>>> target",
>>> as can easily happen in a "our sources only" situation.
>>>
>>> At some point, having a pNFS client able to use Lustre servers would
>>> seem to me to be a useful goal, imho.
>>>
>>> rick
>>>
>>>> So, good question. As said, we should discuss this and determine what
>>>> would work best for the community.
>>>>
>>>> Spencer
>>>>
>>>>> -----Original Message-----
>>>>> From: nfsv4-bounces@... [mailto:nfsv4-bounces@...] On
>>>>> Behalf Of
>>>>> Welch, Brent
>>>>> Sent: Tuesday, March 27, 2012 8:50 AM
>>>>> To: faibish_sorin@...; sshepler@...
>>>>> Cc: nfsv4@...; sfaibish@...
>>>>> Subject: Re: [nfsv4] PNFS Lustre layout justification discussion
>>>>> pre-
>>>>> meeting
>>>>>
>>>>> Can you guys put a "Standards" agenda item into the Lustre layout
>>>>> discussion?
>>>>>
>>>>> I think it's fine for the Lustre community to define a pnfs layout
>>>>> backend, but I'm not sure it fits into an IETF RFC because no other
>>>>> part
>>>>> of the Lustre protocols are defined anywhere, except the code. In
>>>>> contrast, with the NFSv4, SCSI/block, and iSCSI/OSDv2 backends,
>>>>> those
>>>>> interfaces are covered by a standard and there is an implied
>>>>> commitment to
>>>>> adhere to them. As an example, Panasas has had to do real work to
>>>>> support
>>>>> the OSDv2 standard, which was codified several years after we
>>>>> started
>>>>> shipping an iSCSI/OSD component in our system. We put in the effort
>>>>> to
>>>>> get OSD standardized, not to mention pNFS.
>>>>>
>>>>> So, to put it precisely, what is the normative reference to the
>>>>> Lustre
>>>>> back end? Version 1.8.1? Version 2.1?
>>>>>
>>>>> --
>>>>> Brent
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: nfsv4-bounces@... [mailto:nfsv4-bounces@...] On
>>>>> Behalf Of
>>>>> faibish_sorin@...
>>>>> Sent: Tuesday, March 27, 2012 5:52 AM
>>>>> To: faibish_sorin@...; sshepler@...
>>>>> Cc: nfsv4@...; sfaibish@...
>>>>> Subject: Re: [nfsv4] PNFS Lustre layout justification discussion
>>>>> pre-
>>>>> meeting
>>>>>
>>>>> Slides for tomorrow discussion are attached.
>>>>>
>>>>> /Sorin
>>>>>
>>>>> -----Original Message-----
>>>>> From: nfsv4-bounces@... [mailto:nfsv4-bounces@...] On
>>>>> Behalf Of
>>>>> faibish_sorin@...
>>>>> Sent: Monday, March 26, 2012 1:59 PM
>>>>> To: sshepler@...
>>>>> Cc: nfsv4@...
>>>>> Subject: [nfsv4] PNFS Lustre layout justification discussion
>>>>> pre-meeting
>>>>>
>>>>> In preparation for the discussion on a new layout for Lustre I want
>>>>> to
>>>>> bring some pertinent discussion points to the nfsv4 list.
>>>>>
>>>>> (i) Lustre layout is similar yet different from 5664:
>>>>>
>>>>> From rfc5664, I think the most important differences are about how
>>>>> object
>>>>> layout maps the file to pNFS object layout and how it makes use of
>>>>> standard SCSI OSD/OSD-2 commands. As it is known from 5664 the
>>>>> workflow of
>>>>> pNFS object layout is like following:
>>>>> - send layoutget to MDS to retrieve pnfs_osd_layout4 structure.
>>>>> - map IO pages to different objects according to pnfs_osd_data_map4
>>>>> - for each related objects, send proper OSD/OSD-2 commands to
>>>>> accomplish
>>>>> the IO requests.
>>>>>
>>>>> In summary if we compare the 2 layouts Lustre and pNFS object layout
>>>>> we
>>>>> can see similarities and differences that would be able to justify a
>>>>> new
>>>>> layout.
>>>>>
>>>>> Similarities:
>>>>> 1. Both Lustre and pNFS object layout use layout information to map
>>>>> large
>>>>> file onto several object files residing on OST/OSD.
>>>>> 2. By design, they both support several RAID algorithms for data
>>>>> redundancy.
>>>>>
>>>>> Differences:
>>>>> 1. They use different data path protocols. Lustre uses PtlRPC and
>>>>> Lustre
>>>>> protocol to send/receive data, while object layout is tight with
>>>>> OSD/OSD-2
>>>>> commands.
>>>>> 2. Lustre file extent locks are decoupled and managed by OSTs, while
>>>>> pNFS
>>>>> use layouts to manage read/write permissions and they are managed
>>>>> solely
>>>>> by MDS.
>>>>>
>>>>> (ii) Lustre layout is similar yet different from pNFS file layout in
>>>>> 5661:
>>>>>
>>>>> As it is known from 5661 the workflow of pNFS file layout is like
>>>>> the
>>>>> following:
>>>>>
>>>>> - layoutget to retrieve layout information from server and possibly
>>>>> getdeviceinfo to retrieve ds_list for the file
>>>>> - when a page need to read/write, calculate corresponding data
>>>>> server via
>>>>> stripping and ds_list information
>>>>> - use corresponding file handle to read/write to data servers via
>>>>> NFSv41
>>>>>
>>>>> According to 5661 pNFS file layout supports byte range locks. It's
>>>>> just
>>>>> that POSIX semantics is not guaranteed because of nfs close-to-open
>>>>> semantics. And it is left as an implementation choice because
>>>>> protocol-
>>>>> wise RFC5661 allows for POSIX like shared access.
>>>>>
>>>>> Similarities:
>>>>> 1. Both Lustre and pNFS file maintain file layout information on MDS
>>>>> and
>>>>> use layout information to map file data to DS (OST for Lustre)
>>>>>
>>>>> Differences:
>>>>> 1. Lustre layout can support OST level data redundancy like RAID.
>>>>> But pnfs
>>>>> file layout can't because per RFC5661, one unit can be mapped to
>>>>> only one
>>>>> DS list.
>>>>> 2. Data path protocol is different as well as the control protocol
>>>>> between
>>>>> MDS and OSS/OST.
>>>>> 3. Implementation wise, Lustre layout supports POSIX while Linux
>>>>> pNFS only
>>>>> supports close-to-open semantics. (applies to object comparison as
>>>>> well.)
>>>>>
>>>>>
>>>>> In summary the pertinent points for introduction of a new Lustre
>>>>> pNFS
>>>>> layout:
>>>>>
>>>>> 1. Similar architectures are used (same as comparison with object
>>>>> layout).
>>>>> 2. Lustre decouples extents IO permission to DS, and pNFS controls
>>>>> it in
>>>>> MDS (same as comparison with object layout). Or if only
>>>>> close-to-open
>>>>> semantics is required (POSIX may break), pNFS file and object layout
>>>>> can
>>>>> both allow shared IO without caring about data lost (e.g., write on
>>>>> one
>>>>> client and read on another).
>>>>> 3. pnfs file layout supports MDS/DS multipathing via NFSv41
>>>>> trunking.
>>>>> Similarly, Lustre supports failover-pairs of MDS/OSS.
>>>>> 4. Per nfsv4_1_file_layout_ds_addr4, each unit is mapped to one DS
>>>>> (if no
>>>>> failure). Therefore pnfs file layout cannot support any DS level
>>>>> data
>>>>> redundancy (such as RAID1, RAID5 etc.), although RAID0 can be
>>>>> supported
>>>>> and DS can have built-in RAID. On the other hand, both Lustre and
>>>>> object
>>>>> layout can support different RAID algorithms on DS level but the
>>>>> client is
>>>>> involved in the RAID in the case of objects.
>>>>>
>>>>> This is my opinion shared with Peng Tao after a lot of reading on
>>>>> the
>>>>> Lustre layout and locking semantics. For example I include an
>>>>> example of
>>>>> Lustre layout to help with your understanding.
>>>>>
>>>>> In Lustre each file is composed of multiple data objects striped on
>>>>> one or
>>>>> more OSTs (data containers as different from OSS that are object
>>>>> servers).
>>>>> A file object’s layout information is defined in the extended
>>>>> attribute
>>>>> (EA) of the inode that describes the mapping between file object id
>>>>> and
>>>>> its corresponding OSTs. This information is also known as striping
>>>>> EA.
>>>>>
>>>>> For example, if file A has a stripe count of three , then its EA
>>>>> might
>>>>> look like:
>>>>>
>>>>> EA ---> <obj id x, ost p>
>>>>>         <obj id y, ost q>
>>>>>         <obj id z, ost r>
>>>>>         stripe size and stripe width
>>>>>
>>>>> So if the stripe size is 1MB, then this would means that [0,1M]),
>>>>> [4M,5M)
>>>>> … are stored as object x, which is on OST p; [1M, 2M), [5M, 6M) …
>>>>> are
>>>>> stored as object y, which is on OST q; [2M,3M), [6M,7M) … are stored
>>>>> as
>>>>> object z, which is on OST r.
>>>>>
>>>>> This is all I have for the discussion on Wednesday. Thank you for
>>>>> your
>>>>> patience and understanding.
>>>>>
>>>>>
>>>>>
>>>>> Sent from my iPad
>>>>>
>>>>> On Mar 20, 2012, at 2:24 AM, "Spencer Shepler"
>>>>> <sshepler@...>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Yes, we are meeting. Agenda query was long enough ago that we have
>>>>>> collectively forgotten. :-)
>>>>>>
>>>>>> Any others I need to know about; will be posting draft agenda in
>>>>>> couple
>>>>> of days...
>>>>>>
>>>>>> Spencer
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: faibish_sorin@... [mailto:faibish_sorin@...]
>>>>>> Sent: Monday, March 19, 2012 6:15 PM
>>>>>> To: Spencer Shepler
>>>>>> Cc: nfsv4@...
>>>>>> Subject: Agenda items for the Paris meeting
>>>>>>
>>>>>> Spencer,
>>>>>>
>>>>>> I don't think I saw any call for the agenda items for IETF 83 in
>>>>>> Paris,
>>>>> France. Are we going to have a meeting? Is there anybody attending?
>>>>>>
>>>>>> I have 1 agenda item: New pNFS layout for Lustre.
>>>>>> I would also like to discuss the LNFS status of the draft in 4.2
>>>>>> if
>>>>> there are any interested parties attending.
>>>>>>
>>>>>> Thank you for your patience and understanding
>>>>>>
>>>>>> /Sorin
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> nfsv4 mailing list
>>>>> nfsv4@...
>>>>> https://www.ietf.org/mailman/listinfo/nfsv4
>>>>> _______________________________________________
>>>>> nfsv4 mailing list
>>>>> nfsv4@...
>>>>> https://www.ietf.org/mailman/listinfo/nfsv4
>>>>
>>>> _______________________________________________
>>>> nfsv4 mailing list
>>>> nfsv4@...
>>>> https://www.ietf.org/mailman/listinfo/nfsv4
>>> _______________________________________________
>>> nfsv4 mailing list
>>> nfsv4@...
>>> https://www.ietf.org/mailman/listinfo/nfsv4
>>>
>>
>>
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> Trond.Myklebust@...
> www.netapp.com
>
> _______________________________________________
> nfsv4 mailing list
> nfsv4@...
> https://www.ietf.org/mailman/listinfo/nfsv4
_______________________________________________
nfsv4 mailing list
nfsv4@...
https://www.ietf.org/mailman/listinfo/nfsv4

 « Return to Thread: 3530bis: Add a new error code for RENAME