IDNA and IRI document way forward

View: New views
17 Messages — Rating Filter:   Alert me  

IDNA and IRI document way forward

by Larry Masinter-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.

A note about the direction for URI and IRI:

 

I’d like to make progress on two tasks

1)      resolve any issues with regard to IDNA standards and difficulties with internationalized domain names

2)      resolve any conflicts with the requirements for documenting current browser behavior for compatibility with HTML5

 

For (1) I think the main changes are:

 

a)      CHANGE (clarify, update, whatever necessary) URI definitions so that no percent-hex-encoded %XX values are allowed in the “HOST NAME” section of any URI.  URIs that wish to refer to internationalized domain names may only use “A-label” domain names.  My understanding is that this is necessary for security reasons, and may require an update to some existing software (http://tools.ietf.org/html/draft-ietf-idnabis-defs-09#section-2.3.2.1). It may also require an update to the HTTP URI scheme and the Mailto: URI scheme.

b)      Update the IRI definition so that non-ASCII domain names, either in the generic syntax or the alternate syntax, are *not* mapped to URIs by a string algorithm, but rather are parsed, and any IRI -> URI transformation handled by

1.       parsing according to the scheme definition

2.        mapping the parsed components based on whether they are host names, query strings (for HTML5), or non-hostname components

3.       reassembling the URI components after mapping

 

We will need to review existing registered (and perhaps unregistered?) URI schemes to see if there are other places where host names appear than using the generic scheme://host/path  or scheme:local@remote syntax.  IRI->URI transformation will necessarily be scheme specific, and we’ll need to basically define that new URI schemes must either

(a)    not allow host names anywhere in them

(b)   use the scheme://host/path syntax

(c)    be a special exception  -- I think this may be limited to ...

 

There are other uses of domain names in URIs currently; for example, cid: (content-ID) strings often contain domain names.  I’m not sure but it may be reasonable to *not* allow IRI forms, e.g., require that all URIs not using scheme://host/path syntax not allow hex-encoded octets above %7F, for example.

 

I think this is unfortunate and a pretty drastic change to the IRI document, but I don’t think we’re going to make progress if we don’t take the bull by the horns.

 

Larry

--

http://larry.masinter.net

 


Re: IDNA and IRI document way forward

by "Martin J. Dürst" :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello Larry,

Many thanks for this mail.

On 2009/07/29 13:52, Larry Masinter wrote:
> A note about the direction for URI and IRI:
>
> I'd like to make progress on two tasks
> 1)      resolve any issues with regard to IDNA standards and difficulties with internationalized domain names

There are definitely quite some issues that need to be resolved.
One that is for sure is that we have to adapt the BIDI stuff in the IRI
draft to what IDNA does.

> 2)      resolve any conflicts with the requirements for documenting current browser behavior for compatibility with HTML5

Conflicts between *what* and the current browser behavior?

> For (1) I think the main changes are:
>
> a)      CHANGE (clarify, update, whatever necessary) URI definitions so that no percent-hex-encoded %XX values are allowed in the "HOST NAME" section of any URI.  URIs that wish to refer to internationalized domain names may only use "A-label" domain names.  My understanding is that this is necessary for security reasons, and may require an update to some existing software (http://tools.ietf.org/html/draft-ietf-idnabis-defs-09#section-2.3.2.1). It may also require an update to the HTTP URI scheme and the Mailto: URI scheme.

Do you mean you propose to update RFC 3986 (STD 66)? Can you give more
details on the "security reasons"? The section you cite does contain a
lot of definitions, but nothing about security issues or %-encoding
(which isn't surprising, because %-encoding is URI/IRI-specific, and
gets resolved before the data goes to the DNS (or IDNA).

BTW, the issue for allowing %-escaping in the reg-name (NOT host name)
part of an URI can be found at
http://labs.apache.org/webarch/uri/rev-2002/issues.html#036-host-escaping.

> b)      Update the IRI definition so that non-ASCII domain names, either in the generic syntax or the alternate syntax, are *not* mapped to URIs by a string algorithm, but rather are parsed, and any IRI ->  URI transformation handled by
> 1.       parsing according to the scheme definition
> 2.        mapping the parsed components based on whether they are host names, query strings (for HTML5), or non-hostname components
> 3.       reassembling the URI components after mapping

Treating host parts specially (i.e. converting them to punycode rather
than to %-encoding) is already allowed in RFC 3987.

> We will need to review existing registered (and perhaps unregistered?) URI schemes to see if there are other places where host names appear than using the generic scheme://host/path  or scheme:local@remote syntax.  IRI->URI transformation will necessarily be scheme specific, and we'll need to basically define that new URI schemes must either
> (a)    not allow host names anywhere in them
> (b)   use the scheme://host/path syntax
> (c)    be a special exception  -- I think this may be limited to mailto:
>
> There are other uses of domain names in URIs currently; for example, cid: (content-ID) strings often contain domain names.  I'm not sure but it may be reasonable to *not* allow IRI forms, e.g., require that all URIs not using scheme://host/path syntax not allow hex-encoded octets above %7F, for example.

There are many other places where domain names appear in URIs and IRIs.
Take http://validator.w3.org/check?uri=http://www.ietf.org/ as a simple
example (OT: would be nice if the IETF site validated). Or then take a
site such as http://恵比寿駅.jp/. That would be http://validator.w3.org 
/check?uri=http://恵比寿駅.jp/ or some such for the validator
(unfortunately, this is currently not supported). It's very clearly
impossible to rule this out. But even before that, doing scheme-wise
processing kills the U in URIs.

> I think this is unfortunate and a pretty drastic change to the IRI document, but I don't think we're going to make progress if we don't take the bull by the horns.

Before taking anything by the horns (or the tail, or whatever) I'd like
to know in great details what exactly the actual (or pretended) bull is.

Regards,    Martin.

> Larry
> --
> http://larry.masinter.net
>
>

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@...


RE: IDNA and IRI document way forward

by Larry Masinter-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I confess that I'm just coming back up to speed on the
issues, and hope you'll forgive me for missing some of
the history,

It seems there are at least two communities (IDN/IDNA and
IRI/WEB) which should have been working together for
the past many years, haven't been, and we're now facing
some difficulties in bringing their perspectives together,
especially when those perspectives have been built
into long-standing and finely argued documents.

I'm not entirely sure of the use case and difficulties,
which I will try to track down in more detail.

Just as personal speculation, however,
I could easily imagine some problems if it were
possible to register domain names which actually
contained percent-hex-hex sequences.

www.%77%33.org vs www.w3.org?

Perhaps that would be a problem not just for IRIs
but for other kinds of processing too.  Can this
be disallowed at the URI parsing level? Only at
the IRI level?

I see the difficulties of creating a provision for
scheme-specific parsing and restrictions on host names
containing %xx hex-encoded bytes in URIs are even
greater than what I imagined.


> That would be
> http://validator.w3.org/check?uri=http://恵比寿駅.jp/

I'm sure there are difficulties even in circumstances that
don't use "?", but this is especially difficult since the
HTML-URL/HREF/WebAddress handling of non-ASCII query parameters
adds some ambiguity to the translation of this into URI space.

> It's very clearly impossible to rule this out.

Difficult, but not impossible.

> But even before that, doing scheme-wise processing
>  kills the U in URIs.

And the I in Internationalized and several other things. Let's
stick to identifying issues and alternatives.

> I think this is unfortunate and a pretty drastic change
> to the IRI document, but I don't think we're going to make
> progress if we don't take the bull by the horns.

> Before taking anything by the horns (or the tail, or whatever) I'd like
> to know in great details what exactly the actual (or pretended) bull is.

And if there there are two bulls there would be
four horns, a Tetralemma, and quite a bit of BS.

Larry
--
http://larry.masinter.net


Re: IDNA and IRI document way forward

by "Martin J. Dürst" :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello Larry,

On 2009/07/29 16:07, Larry Masinter wrote:
> I confess that I'm just coming back up to speed on the
> issues, and hope you'll forgive me for missing some of
> the history,

No problem with that.

> It seems there are at least two communities (IDN/IDNA and
> IRI/WEB) which should have been working together for
> the past many years, haven't been, and we're now facing
> some difficulties in bringing their perspectives together,
> especially when those perspectives have been built
> into long-standing and finely argued documents.

I guess it's possible to see the situation this way, but I think it's
also possible to see the glass as (at least) half full. Browser vendors
have been implementing both IDNs and IRIs. Michel Suignard in particular
was involved in both efforts. I also was and am on both lists, and tried
to notice any relevant issues (which doesn't mean that they all made it
into the current draft).

> I'm not entirely sure of the use case and difficulties,
> which I will try to track down in more detail.

Great. Very much looking forward to that.

> Just as personal speculation, however,
> I could easily imagine some problems if it were
> possible to register domain names which actually
> contained percent-hex-hex sequences.
>
> www.%77%33.org vs www.w3.org?

That was the direction that my speculation would go too.
It is certainly true that the DNS as such easily accepts any byte
sequence in its labels. As far as I understand, that even includes null
bytes, because the packets that the DNS sends use an initial byte to
indicate the length of a label (taking two bits for other purposes, that
results in the label length limit of 63 bytes).

However, "%77%33" never has been legal in URIs before RFC 3986 made it
mean the same as "w3". Also, "%77%33" is definitely not something any
DNS registrar would allow for registration, nor is it something that
other IETF protocols would accept as a host name if they put any
restrictions on it at all.

> Perhaps that would be a problem not just for IRIs
> but for other kinds of processing too.  Can this
> be disallowed at the URI parsing level? Only at
> the IRI level?

Well, "%77%33" as such would not need to be disallowed.
One would just have to write "%2577%2533" if one really had a need to
express it (which I guess would be the case extremely rarely).

> I see the difficulties of creating a provision for
> scheme-specific parsing and restrictions on host names
> containing %xx hex-encoded bytes in URIs are even
> greater than what I imagined.

I very much have to agree that we have to weight the difficulties
against the benefits.


>> That would be
>> http://validator.w3.org/check?uri=http://恵比寿駅.jp/

Or http://%E6%81%B5%E6%AF%94%E5%AF%BF%E9%A7%85.jp/ when escaped
(using UTF-8, as prescribed by RFC 3986/7).

> I'm sure there are difficulties even in circumstances that
> don't use "?", but this is especially difficult since the
> HTML-URL/HREF/WebAddress handling of non-ASCII query parameters
> adds some ambiguity to the translation of this into URI space.

Yes. But we should try to have these localized to those contexts,
if possible.

>> It's very clearly impossible to rule this out.
>
> Difficult, but not impossible.
>
>> But even before that, doing scheme-wise processing
>>   kills the U in URIs.
>
> And the I in Internationalized and several other things. Let's
> stick to identifying issues and alternatives.

I agree.

Regards,   Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@...


Re: IDNA and IRI document way forward

by Erik van der Poel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi all,

First of all, I am glad that this discussion is happening, and
especially happy to see references to "current browser behavior". :-)

I am a bit concerned about a couple of things in the new IRI bis draft:

http://www.ietf.org/id/draft-duerst-iri-bis-06.txt

There are a couple of new terms in this document: Legacy Extended IRIs
(LEIRIs) and Hypertext References. The difference between LEIRIs and
IRIs appears to be whether or not you %-escape certain characters,
such as Space and Unicode Private Use characters. However, Space is an
ASCII character, so one would expect this to be more related to URIs,
which begs the question of why we don't have the term LEURI as well.

Instead of defining new terms like LEIRI and Hypertext References, I
wonder if it might be better to have Standards Track RFCs on URIs and
IRIs and a BCP (Best Current Practice) on the contexts in which these
*RIs occur. The BCP would cover such issues as whether or not to
%-escape Space and Private Use.

The BCP could also cover HTML-specific issues such as the character
encoding of the query part (after the question mark), or the
HTML-specific material could be moved back to W3C.

Erik

On Wed, Jul 29, 2009 at 4:28 AM, "Martin J.
Dürst"<duerst@...> wrote:

> Hello Larry,
>
> On 2009/07/29 16:07, Larry Masinter wrote:
>>
>> I confess that I'm just coming back up to speed on the
>> issues, and hope you'll forgive me for missing some of
>> the history,
>
> No problem with that.
>
>> It seems there are at least two communities (IDN/IDNA and
>> IRI/WEB) which should have been working together for
>> the past many years, haven't been, and we're now facing
>> some difficulties in bringing their perspectives together,
>> especially when those perspectives have been built
>> into long-standing and finely argued documents.
>
> I guess it's possible to see the situation this way, but I think it's also
> possible to see the glass as (at least) half full. Browser vendors have been
> implementing both IDNs and IRIs. Michel Suignard in particular was involved
> in both efforts. I also was and am on both lists, and tried to notice any
> relevant issues (which doesn't mean that they all made it into the current
> draft).
>
>> I'm not entirely sure of the use case and difficulties,
>> which I will try to track down in more detail.
>
> Great. Very much looking forward to that.
>
>> Just as personal speculation, however,
>> I could easily imagine some problems if it were
>> possible to register domain names which actually
>> contained percent-hex-hex sequences.
>>
>> www.%77%33.org vs www.w3.org?
>
> That was the direction that my speculation would go too.
> It is certainly true that the DNS as such easily accepts any byte sequence
> in its labels. As far as I understand, that even includes null bytes,
> because the packets that the DNS sends use an initial byte to indicate the
> length of a label (taking two bits for other purposes, that results in the
> label length limit of 63 bytes).
>
> However, "%77%33" never has been legal in URIs before RFC 3986 made it mean
> the same as "w3". Also, "%77%33" is definitely not something any DNS
> registrar would allow for registration, nor is it something that other IETF
> protocols would accept as a host name if they put any restrictions on it at
> all.
>
>> Perhaps that would be a problem not just for IRIs
>> but for other kinds of processing too.  Can this
>> be disallowed at the URI parsing level? Only at
>> the IRI level?
>
> Well, "%77%33" as such would not need to be disallowed.
> One would just have to write "%2577%2533" if one really had a need to
> express it (which I guess would be the case extremely rarely).
>
>> I see the difficulties of creating a provision for
>> scheme-specific parsing and restrictions on host names
>> containing %xx hex-encoded bytes in URIs are even
>> greater than what I imagined.
>
> I very much have to agree that we have to weight the difficulties against
> the benefits.
>
>
>>> That would be
>>> http://validator.w3.org/check?uri=http://恵比寿駅.jp/
>
> Or http://%E6%81%B5%E6%AF%94%E5%AF%BF%E9%A7%85.jp/ when escaped
> (using UTF-8, as prescribed by RFC 3986/7).
>
>> I'm sure there are difficulties even in circumstances that
>> don't use "?", but this is especially difficult since the
>> HTML-URL/HREF/WebAddress handling of non-ASCII query parameters
>> adds some ambiguity to the translation of this into URI space.
>
> Yes. But we should try to have these localized to those contexts,
> if possible.
>
>>> It's very clearly impossible to rule this out.
>>
>> Difficult, but not impossible.
>>
>>> But even before that, doing scheme-wise processing
>>>  kills the U in URIs.
>>
>> And the I in Internationalized and several other things. Let's
>> stick to identifying issues and alternatives.
>
> I agree.
>
> Regards,   Martin.
>
> --
> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
> #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@...
>
>


Re: IDNA and IRI document way forward

by John Cowan :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Erik van der Poel scripsit:

> There are a couple of new terms in this document: Legacy Extended IRIs
> (LEIRIs) and Hypertext References. The difference between LEIRIs and
> IRIs appears to be whether or not you %-escape certain characters,
> such as Space and Unicode Private Use characters. However, Space is an
> ASCII character, so one would expect this to be more related to URIs,
> which begs the question of why we don't have the term LEURI as well.

No need for it.  "LEIRI" is a new name for something that's been around
since the First Edition of XML 1.0.  Until now, the numerous W3C and
non-W3C specs have referred to one another to define it, and it's never
had a name of its own.  This has caused uncertainty in definition and
pointless dependencies.  In any case, nobody proposes using non-IRI
LEIRIs for anything; it's just that they are a physical possibility
in various places within XML, and getting the definition standardized
and in one place will enable a lot of standards rectification.

--
John Cowan  cowan@...   http://ccil.org/~cowan
Consider the matter of Analytic Philosophy.  Dennett and Bennett are well-known.
Dennett rarely or never cites Bennett, so Bennett rarely or never cites Dennett.
There is also one Dummett.  By their works shall ye know them.  However, just as
no trinities have fourth persons (Zeppo Marx notwithstanding), Bummett is hardly
known by his works.  Indeed, Bummett does not exist.  It is part of the function
of this and other e-mail messages, therefore, to do what they can to create him.


RE: IDNA and IRI document way forward

by Larry Masinter-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


I'd like to focus on the technical issues remaining --
what should we say about "best practice", whether we've
adequately captured "current browser behavior" before
taking on the editorial issues around whether this all
can appear in the IRI document or needs to be split
out into a BCP etc.

At first glance, the problem we started out with was that
there were 4-5 different documents with conflicting
advice, terminology and introductory material, and
so I'd like to continue working on a combined document
until we get everything all assembled.

My hope is that the deployment advice is brief enough
that a single specification can address both the allowed
syntax and semantics, processing rules, and deployment
advice.

Larry
--
http://larry.masinter.net


-----Original Message-----
From: Erik van der Poel [mailto:erikv@...]
Sent: Wednesday, July 29, 2009 12:39 PM
To: Martin J. Dürst
Cc: Larry Masinter; PUBLIC-IRI@...; URI; John Klensin (klensin@...); Vint Cerf
Subject: Re: IDNA and IRI document way forward

Hi all,

First of all, I am glad that this discussion is happening, and
especially happy to see references to "current browser behavior". :-)

I am a bit concerned about a couple of things in the new IRI bis draft:

http://www.ietf.org/id/draft-duerst-iri-bis-06.txt

There are a couple of new terms in this document: Legacy Extended IRIs
(LEIRIs) and Hypertext References. The difference between LEIRIs and
IRIs appears to be whether or not you %-escape certain characters,
such as Space and Unicode Private Use characters. However, Space is an
ASCII character, so one would expect this to be more related to URIs,
which begs the question of why we don't have the term LEURI as well.

Instead of defining new terms like LEIRI and Hypertext References, I
wonder if it might be better to have Standards Track RFCs on URIs and
IRIs and a BCP (Best Current Practice) on the contexts in which these
*RIs occur. The BCP would cover such issues as whether or not to
%-escape Space and Private Use.

The BCP could also cover HTML-specific issues such as the character
encoding of the query part (after the question mark), or the
HTML-specific material could be moved back to W3C.

Erik

On Wed, Jul 29, 2009 at 4:28 AM, "Martin J.
Dürst"<duerst@...> wrote:

> Hello Larry,
>
> On 2009/07/29 16:07, Larry Masinter wrote:
>>
>> I confess that I'm just coming back up to speed on the
>> issues, and hope you'll forgive me for missing some of
>> the history,
>
> No problem with that.
>
>> It seems there are at least two communities (IDN/IDNA and
>> IRI/WEB) which should have been working together for
>> the past many years, haven't been, and we're now facing
>> some difficulties in bringing their perspectives together,
>> especially when those perspectives have been built
>> into long-standing and finely argued documents.
>
> I guess it's possible to see the situation this way, but I think it's also
> possible to see the glass as (at least) half full. Browser vendors have been
> implementing both IDNs and IRIs. Michel Suignard in particular was involved
> in both efforts. I also was and am on both lists, and tried to notice any
> relevant issues (which doesn't mean that they all made it into the current
> draft).
>
>> I'm not entirely sure of the use case and difficulties,
>> which I will try to track down in more detail.
>
> Great. Very much looking forward to that.
>
>> Just as personal speculation, however,
>> I could easily imagine some problems if it were
>> possible to register domain names which actually
>> contained percent-hex-hex sequences.
>>
>> www.%77%33.org vs www.w3.org?
>
> That was the direction that my speculation would go too.
> It is certainly true that the DNS as such easily accepts any byte sequence
> in its labels. As far as I understand, that even includes null bytes,
> because the packets that the DNS sends use an initial byte to indicate the
> length of a label (taking two bits for other purposes, that results in the
> label length limit of 63 bytes).
>
> However, "%77%33" never has been legal in URIs before RFC 3986 made it mean
> the same as "w3". Also, "%77%33" is definitely not something any DNS
> registrar would allow for registration, nor is it something that other IETF
> protocols would accept as a host name if they put any restrictions on it at
> all.
>
>> Perhaps that would be a problem not just for IRIs
>> but for other kinds of processing too.  Can this
>> be disallowed at the URI parsing level? Only at
>> the IRI level?
>
> Well, "%77%33" as such would not need to be disallowed.
> One would just have to write "%2577%2533" if one really had a need to
> express it (which I guess would be the case extremely rarely).
>
>> I see the difficulties of creating a provision for
>> scheme-specific parsing and restrictions on host names
>> containing %xx hex-encoded bytes in URIs are even
>> greater than what I imagined.
>
> I very much have to agree that we have to weight the difficulties against
> the benefits.
>
>
>>> That would be
>>> http://validator.w3.org/check?uri=http://恵比寿駅.jp/
>
> Or http://%E6%81%B5%E6%AF%94%E5%AF%BF%E9%A7%85.jp/ when escaped
> (using UTF-8, as prescribed by RFC 3986/7).
>
>> I'm sure there are difficulties even in circumstances that
>> don't use "?", but this is especially difficult since the
>> HTML-URL/HREF/WebAddress handling of non-ASCII query parameters
>> adds some ambiguity to the translation of this into URI space.
>
> Yes. But we should try to have these localized to those contexts,
> if possible.
>
>>> It's very clearly impossible to rule this out.
>>
>> Difficult, but not impossible.
>>
>>> But even before that, doing scheme-wise processing
>>>  kills the U in URIs.
>>
>> And the I in Internationalized and several other things. Let's
>> stick to identifying issues and alternatives.
>
> I agree.
>
> Regards,   Martin.
>
> --
> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
> #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@...
>
>

Re: IDNA and IRI document way forward

by "Martin J. Dürst" :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I very much agree with Larry. Having definitions with minor differences
in different places is not a good idea.

LEIRI are not only used widely in XML specs, but are also what was in
some early Internet-Drafts that led up to RFC 3987.

Also, for both LEIRIs and Hypertext References, they are clearly minor
variations on IRIs, and they are clearly *tolerated* variations for
backwards compatibility and continuity rather than recommended best
practice. Using the "Best Practice" label for either of these would
detract from the fact that the relevant technologies (a group of XML
specs, and HTML5, respectively) allow acceptance of these variants, but
authors/generators/creators should avoid them wherever possible.

There is currently clear advice on avoiding variants outside the IRI
syntax for LEIRIs, but not yet for Hypertext References. I wrote the
advice on avoiding variants outside the IRI syntax for LEIRIs, and after
coming back from vacation (the next three weeks, will have email
access), I can work on adapting this advice for Hypertext References, too.

Regards,    Martin.

On 2009/07/30 19:06, Larry Masinter wrote:

> I'd like to focus on the technical issues remaining --
> what should we say about "best practice", whether we've
> adequately captured "current browser behavior" before
> taking on the editorial issues around whether this all
> can appear in the IRI document or needs to be split
> out into a BCP etc.
>
> At first glance, the problem we started out with was that
> there were 4-5 different documents with conflicting
> advice, terminology and introductory material, and
> so I'd like to continue working on a combined document
> until we get everything all assembled.
>
> My hope is that the deployment advice is brief enough
> that a single specification can address both the allowed
> syntax and semantics, processing rules, and deployment
> advice.
>
> Larry
> --
> http://larry.masinter.net
>
>
> -----Original Message-----
> From: Erik van der Poel [mailto:erikv@...]
> Sent: Wednesday, July 29, 2009 12:39 PM
> To: Martin J. Dürst
> Cc: Larry Masinter; PUBLIC-IRI@...; URI; John Klensin (klensin@...); Vint Cerf
> Subject: Re: IDNA and IRI document way forward
>
> Hi all,
>
> First of all, I am glad that this discussion is happening, and
> especially happy to see references to "current browser behavior". :-)
>
> I am a bit concerned about a couple of things in the new IRI bis draft:
>
> http://www.ietf.org/id/draft-duerst-iri-bis-06.txt
>
> There are a couple of new terms in this document: Legacy Extended IRIs
> (LEIRIs) and Hypertext References. The difference between LEIRIs and
> IRIs appears to be whether or not you %-escape certain characters,
> such as Space and Unicode Private Use characters. However, Space is an
> ASCII character, so one would expect this to be more related to URIs,
> which begs the question of why we don't have the term LEURI as well.
>
> Instead of defining new terms like LEIRI and Hypertext References, I
> wonder if it might be better to have Standards Track RFCs on URIs and
> IRIs and a BCP (Best Current Practice) on the contexts in which these
> *RIs occur. The BCP would cover such issues as whether or not to
> %-escape Space and Private Use.
>
> The BCP could also cover HTML-specific issues such as the character
> encoding of the query part (after the question mark), or the
> HTML-specific material could be moved back to W3C.
>
> Erik


--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@...


Re: IDNA and IRI document way forward

by Anne van Kesteren-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, 30 Jul 2009 12:59:38 +0200, Martin J. Dürst <duerst@...> wrote:
> Also, for both LEIRIs and Hypertext References, they are clearly minor  
> variations on IRIs, and they are clearly *tolerated* variations for  
> backwards compatibility and continuity rather than recommended best  
> practice. Using the "Best Practice" label for either of these would  
> detract from the fact that the relevant technologies (a group of XML  
> specs, and HTML5, respectively) allow acceptance of these variants, but  
> authors/generators/creators should avoid them wherever possible.

I think the continued repetition of "HTML5" here is confusing. HTML5 was the first to define the algorithm, but it is used by Web browsers for HTTP, CSS, HTML, XMLHttpRequest, etc. and for all these there is content out there that depends on it. Having said that, only HTML has use for the URL character encoding parameter (as it was originally called). For the other contexts this is always UTF-8 (or not applicable, in case of HTTP).

Also, it makes much more sense (due to the URL character encoding) to map Hypertext References directly to URIs rather than frame them as some type of IRI. That's the way they are implemented and used in the wild (even though the arguably shouldn't be). (See also my other comments.)


--
Anne van Kesteren
http://annevankesteren.nl/


RE: IDNA and IRI document way forward

by Larry Masinter-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> I think the continued repetition of "HTML5" here
> is confusing. HTML5 was the first to define the
> algorithm, but it is used by Web browsers for HTTP,
> CSS, HTML, XMLHttpRequest, etc. and for all these
>  there is content out there that depends on it.

"HTML5" refers to the HTML 5 spec, not the Hypertext
Markup Language that is among the things the spec
attempts to define.

> Also, it makes much more sense (due to the URL
>  character encoding) to map Hypertext References
> directly to URIs rather than frame them as some
>  type of IRI.

I don't understand this: the IRI document currently
describes a HREF -> URI mapping; is that OK
with you?


--
Anne van Kesteren
http://annevankesteren.nl/

Re: IDNA and IRI document way forward

by Anne van Kesteren-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, 30 Jul 2009 15:00:24 +0200, Larry Masinter <masinter@...> wrote:
>> I think the continued repetition of "HTML5" here
>> is confusing. HTML5 was the first to define the
>> algorithm, but it is used by Web browsers for HTTP,
>> CSS, HTML, XMLHttpRequest, etc. and for all these
>>  there is content out there that depends on it.
>
> "HTML5" refers to the HTML 5 spec, not the Hypertext
> Markup Language that is among the things the spec
> attempts to define.

Sure, but HTTP, XMLHttpRequest, CSS, etc. are not part of the HTML 5 spec yet still need to use the same algorithm in the end. That is part of the reason why we defined it separately so the continued repetition that this is just for HTML5 is confusing, in my opinion.


>> Also, it makes much more sense (due to the URL
>>  character encoding) to map Hypertext References
>> directly to URIs rather than frame them as some
>>  type of IRI.
>
> I don't understand this: the IRI document currently
> describes a HREF -> URI mapping; is that OK
> with you?

My bad, yes that is perfectly OK with me. (I guess I'm getting brainwashed by suggestions that Hypertext References can simply be mapped to IRIs.)


--
Anne van Kesteren
http://annevankesteren.nl/


Re: IDNA and IRI document way forward

by Erik van der Poel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks, all, for the info about LEIRI, XML, etc.

> I don't understand this: the IRI document currently
> describes a HREF -> URI mapping; is that OK
> with you?

There are also issues with the URI -> DNS/HTTP mapping, but I suppose
that discussion belongs in uri@....

There are many small issues with IDNA and IRI, but I suppose one of
the bigger issues is the question of mapping. HTML implementers will
be trying to decide what to do about IDNA2003/2008 co-existence and/or
migration. We currently have two drafts that are quite different:

http://tools.ietf.org/html/draft-ietf-idnabis-mappings-01
http://www.unicode.org/reports/tr46/tr46-1.html

Erik


RE: IDNA and IRI document way forward

by Larry Masinter-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I've continued to mull over the contradictory requirements
for IDNA and IRI parsing.

1) IDNA requires (requests, demands, whatever) that there be no
  way in which a %xx percent-hex-encoded version of an Internationalized
  Domain Name ever be presented to a DNS resolver.

2) Traditionally, though, IRIs have been defined as requiring
  a scheme-independent (and syntax-independent) translation from IRI
  (or IRI-like-thing) to URI.

3) URI schemes have host names in many places, not just one:
   mailto:person@host,   ftp://user@host/path, http://host1/path?location=http://host2/path2

I don't think these three things are compatible. If IRIs are defined
by mapping to URIs using (2), then Internationalized Domain Names
in different schemes (3) will translate to percent-hex-encoding
domain names in their corresponding URIs, violating the requirement
for (2).

I can't see any way around (1) or (3), so this leaves me with the
uncomfortable choice of abandoning (2), performing major violence
on the IRI spec.

So here's a swipe at how this might work (please don't shoot me yet):

NO LONGER define an IRI by a generic IRI -> URI mapping.
INSTEAD, IRI parsing is *scheme specific*. "Internationalized"
  (IRI) versions of URI schemes are defined as:
   

   For each URI scheme, there is a corresponding IRI scheme.
   The grammar for the IRI scheme *MUST* be exactly the same
    as the grammar for the URI scheme of the same name, except
   that (a) every syntactic component in the URI scheme that
   allows "unreserved" characters from the URI spec should,
   in the IRI form, allow "Unreserved" characters from the
   IRI repertoire.

   In general, the mapping for handling IRIs and interpreting
   them CAN be defined using a generic IRIstring to URIstring
   component using percent-encoding, with the exception that
   host names are translated to the right IDNa format string.

URI schemes do not AUTOMATICALLY get equivalent IRI schemes.
So, data:, cid:, mid:, tag:, etc. etc. do *Not* have IRI
equivalents automatically (if needed, someone can support
them.)

Instead, we define IRI versions of "http:" and "https:"
and (maybe) "file:" and "ftp:" and "mailto:" using the
new generic IRI definition.

This is painful, of course, but at least it seems to be more
consistent with what's implemented.

Larry



--
http://larry.masinter.net

Re: IDNA and IRI document way forward

by "Martin J. Dürst" :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello Larry,

On 2009/08/25 11:27, Larry Masinter wrote:
> I've continued to mull over the contradictory requirements
> for IDNA and IRI parsing.
>
> 1) IDNA requires (requests, demands, whatever) that there be no
>    way in which a %xx percent-hex-encoded version of an Internationalized
>    Domain Name ever be presented to a DNS resolver.

I fully agree that it's not a good idea to send a %-encoded version of
an Internationalized Domain Name to a DNS resolver. But I fail to
understand the "never ever" aspect of the above sentence. It sounds even
stronger than a MUST, maybe MUST NEVER EVER? Can you or anybody else
give more details on this?

In my understanding, except for some really, really rare weird cases,
what happens is simply "host not found". So we have "incomplete
implementation" + "edge case data" => "null result". Nobody loosing big
money, nobody hurt, no hardware destroyed, etc. Similar things happen in
many IETF protocols, even in DNS.


> 2) Traditionally, though, IRIs have been defined as requiring
>    a scheme-independent (and syntax-independent) translation from IRI
>    (or IRI-like-thing) to URI.

Yes, that's the principle. But please note that even now, RFC 3987 says:

    Systems accepting IRIs MAY convert the ireg-name component of an IRI
    as follows (before step 2 above) for schemes known to use domain
    names in ireg-name, if the scheme definition does not allow
    percent-encoding for ireg-name:

[see http://tools.ietf.org/html/rfc3987#section-3.1 for the details]

I think it would easily be possible to extend this to "scheme
definitions which allow percent-encoding for ireg-name" (there are
probably not yet that many of these currently, anyway) and "other
scheme-specific components that represent domain names" (which would
cover cases such as mailto).

> 3) URI schemes have host names in many places, not just one:
>     mailto:person@host,   ftp://user@host/path, http://host1/path?location=http://host2/path2
>
> I don't think these three things are compatible. If IRIs are defined
> by mapping to URIs using (2), then Internationalized Domain Names
> in different schemes (3) will translate to percent-hex-encoding
> domain names in their corresponding URIs, violating the requirement
> for (2).

I guess you wanted to say "violating the requirement for (1)". And that
needs to assume that a domain name with %-encoding in an URI is handed
to a DNS resolver without any additional processing. While this is
probably what's usually happening, especially if IRI->URI conversion and
URI resolution are completely independent, it's not a given. There is
also the possibility of an URI resolver that is compliant with that part
of RFC 3986, and there is the possibility that the %-encoding will be
resolved as part of general URI resolution (leading to raw UTF-8 being
passed to the DNS resolver), and there are other possibilities.

> I can't see any way around (1) or (3), so this leaves me with the
> uncomfortable choice of abandoning (2), performing major violence
> on the IRI spec.
>
> So here's a swipe at how this might work (please don't shoot me yet):
>
> NO LONGER define an IRI by a generic IRI ->  URI mapping.
> INSTEAD, IRI parsing is *scheme specific*. "Internationalized"
>    (IRI) versions of URI schemes are defined as:

How would *scheme specific* work for host2 in
    http://host1/path?location=http://host2/path2 ?
(I agree with Erik that we should think about that as just being data.)

>     For each URI scheme, there is a corresponding IRI scheme.
>     The grammar for the IRI scheme *MUST* be exactly the same
>      as the grammar for the URI scheme of the same name, except
>     that (a) every syntactic component in the URI scheme that
>     allows "unreserved" characters from the URI spec should,
>     in the IRI form, allow "Unreserved" characters from the
>     IRI repertoire.
>
>     In general, the mapping for handling IRIs and interpreting
>     them CAN be defined using a generic IRIstring to URIstring
>     component using percent-encoding, with the exception that
>     host names are translated to the right IDNa format string.
>
> URI schemes do not AUTOMATICALLY get equivalent IRI schemes.
> So, data:, cid:, mid:, tag:, etc. etc. do *Not* have IRI
> equivalents automatically (if needed, someone can support
> them.)

For many schemes, it is indeed the case currently that they do not have
an equivalent "IRI scheme". mailto would be the classical example. The
mailto scheme, as of RFC 2368, doesn't allow any %-encoding, and
therefore doesn't allow any IRIs except for those that are trivially
identical to URIs. This is being worked on with
http://tools.ietf.org/html/draft-duerst-mailto-bis-06.

On the other hand, I think it would be a huge overkill to require that
every scheme be defined twice (once for URIs and once for IRIs). It's
not rocket science to find the components that are domain names in a
scheme definition, so generic language for this in the IRI spec should
do the job.

> Instead, we define IRI versions of "http:" and "https:"
> and (maybe) "file:" and "ftp:" and "mailto:" using the
> new generic IRI definition.
>
> This is painful, of course, but at least it seems to be more
> consistent with what's implemented.

Looking at Erik's mail
(http://lists.w3.org/Archives/Public/public-iri/2009Aug/0012.html),
implementations seem to be everything else but consistent. Why not have
them move in the right direction?

In summary, (1) is not in the MUST NEVER EVER category, just in the
"shit happens, but it's mostly harmless" category. For (2), RFC 3987
already sins a bit with regards to absolute scheme-independency, and we
can sin a bit more in iri-bis if that's deemed necessary.

Regards,    Martin.
--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@...


RE: IDNA and IRI document way forward

by Larry Masinter-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

One way to think about what I'm talking about
RFC 3987 and the current draft is:

  IRI --translating--> URI

and then (as necessary):

  URI --parsing--> parsed-URI-component(s)

but most deployed browsers actually do

  IRI --parsing--> parsed-IRI-component(s)

and then (as necessary):

  parsed-IRI-component --translating--> parsed-URI-component

where 'translating' might actually be different
for different components (hostname, form query
parameters).

> Yes, that's the principle. But please note that even now, RFC 3987 says:
>    Systems accepting IRIs MAY convert the ireg-name component of an IRI
>    as follows (before step 2 above) for schemes known to use domain
>    names in ireg-name, if the scheme definition does not allow
>    percent-encoding for ireg-name:

I think this should be a MUST rather than a MAY.

> On the other hand, I think it would be a huge overkill to require that
> every scheme be defined twice (once for URIs and once for IRIs).

New schemes should be defined as IRIs if that's applicable.

The old schemes mainly need a general update based on the new
IRI generic syntax. There are a few special cases, but they
should be addressed specially.


> Looking at Erik's mail
> (http://lists.w3.org/Archives/Public/public-iri/2009Aug/0012.html),
> implementations seem to be everything else but consistent. Why not have
> them move in the right direction?

I think "parse then escape" is more common than "escape then parse"
so I think this is the "right direction".


> For (2), RFC 3987
> already sins a bit with regards to absolute scheme-independency, and we
> can sin a bit more in iri-bis if that's deemed necessary.

I think it's just going the whole way, or at least, we should look at
what the spec looks like proposing that.

At this point, I'm thinking of updating RFC 4395 also
 http://www.rfc-editor.org/rfc/rfc4395.txt 
"Guidelines and Registration Procedures for New URI Schemes"

to encourage scheme definitions to

*  be explicit about the applicability or processing methods
   for Unicode strings (default: not allowed)
*  be explicit about HTTP-like "operations" like GET and POST
  (default: not defined)
*  

and starting a review of registered schemes
  http://www.iana.org/assignments/uri-schemes.html
to update any that need IRI definitions.

I think for consistency that the IRI document
should acknowledge that these are often
popularly called "URLs" but that term is used only loosely,
and that formal specifications should distinguish between
URL, URI, IRI, LEIRI, HREF and the various other non-terminals.

The HTML document attempts to be precise in so many places,
using a loose term where a precise one is called for seems
like it's more appropriate, but I hope to push off that
discussion until I have at least rough drafts of the updated
IRI document and a new registry doc.

Larry
--
http://larry.masinter.net





Parent Message unknown Re: IRI update for "parse then (if needed, translate)" vs "translate, then parse"

by "Martin J. Dürst" :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

[I have added public-iri@... to the cc list.]

On 2009/09/03 9:12, Larry Masinter wrote:
> This took a while, but here's the next cut at recasting IRIs and dealing with "web address":
>
>
> http://larry.masinter.net/iribis-hack.html
> http://larry.masinter.net/iribis-hack.txt
> http://larry.masinter.net/iribis-hack.xml
 >
http://tools.ietf.org/rfcdiff?url1=draft-duerst-iri-bis.txt&url2=http://larry.masinter.net/iribis-hack.txt


I have read through this new draft, trying to concentrate on the changed
pieces.

Overall, I'm scared about the tendency to use MUST without much more
careful examination.

I have nothing against *allowing* scheme-specific short-cuts,
optimizations, or short-time backwards compatibility variants, but in
the long term, UTF-8 is much more important than punycode, and
scheme-independent processing isn't something to be thrown away easily.

But I'm quite confident that such a result can be obtained with a new
version of the draft which is much closer to the current -06.txt than
the one discussed here.

In some more detail:

Title: I strongly suggest removing "URI" from the title and limit the
current effort on the "allow scheme-specific conversion" part and the
"LEIRI/Web address" part. These alone are quite serious, and we can
always make another rewrite effort once these have been addressed really
successfully, although I can't see the need for getting too much
involved in URIs at all at the moment.

Abstract: "TO ALLOW RECONCILIATION WITH CURRENT PRACTICE": This is too
strong. At least change to "SOME CURRENT PRACTICE", as there are
definitely implementations that can handle %-encoding in regnames, and
there will be more as time goes on. (see Roy's point about the Host
header in HTTP).

Abstract, shortest para: 'addition' ... 'additional': reword to avoid
repetition.

Document structure: Section 5.5 is the wrong place for LEIRIs and
friends (I'll call these legacy addresses from now on). What we need is
a short notice in section 5 (Normalization/Comparison) about legacy
addresses, but legacy addresses in and by themselves need a separate
section (the more I think about it, the more my conclusion is that an
appendix is the best place).

Introduction: "increasing numbers of protocols" -> "an increasing number
of protocols" (one number, many protocols) (that may have been in there
for ages)

Definitions: "parsed IRI component": Don't start a definition with
"similarly". (definitions should be reasonably usable outside of context)

3., before 3.1: This clearly needs more text talking about the overall
choices and procedures.

3.1 "Convert to UCS" -> "Converting to UCS" (some other titles have the
same problem; verbs don't work well in titles)

3.1, first para: Remove "or octet stream...". Of course the "sequence of
Unicode characters" will be represented somehow, but that's not relevant
here.

3.12 para 2: "benormalized" -> "be normalized"

3.2, para 1: "IRI. this" -> "IRI. This"

3.2, para 1: Is the intent to say that for relative URIs, they should be
absolutized first, and then parsed? If yes, then say so. If no, say what
else. I'm absolutely not sure that this will work; we have to very
carefully check all kinds of interactions (relative -> absolute does
some parsing as far as I understand, and HTML5 tries to convert '\' to
'/' in paths, which probably also interacts.

3.2, para 2: What about unknown schemes? Simply give up, or what?

3.2, para 3: Needs much more care and detail, and can't stay a Note.

3.2, para 4: "Subseqent processing rules may be used to define other
syntactic components.": What exactly is this supposed to mean???

3.3, para 3 (NOTE): Why is a MAY harmful? IRIs are well-defined, and we
have to allow implementations to process only valid ones, and not other
garbage.
"The non-printable characters should be stripped by most software, so by
the time you get here...": This reads like a "survival of the fittest"
for control characters.

3.3: "Hex encode" -> %-encode (That's what both RFC 3986 and 3987 have
used, and even if many people (incl. me) don't like it, there's no
reason to change it just to create even more confusion.

3.4, para 1: Again an unjustified MUST. There are implementations that
don't do this, for good reasons, and they work and shouldn't be made
nonconformant. Also, we have to work on what to do for IDNA2008 here.

3.4, para 2: If ToASCII fails, then it fails. End of story. That's
another reason why converting to %-encoding makes sense; IRIs/URIs
cannot and shouldn't be concerned with the details of the various
namespaces that they contain or grandfather.

3.4, Note 1: "The server side implementation would be responsible":
"would be" -> "is".

3.4, Note 2: What about e.g. http://r%C3%A9sum%C3%A9.example.org in an
IRI? Will that get converted to punycode, or not?

3.4, Note 3: This needs to go somewhere else, it doesn't fit here.

3.5: This is webaddress-specific, needs to be moved.

3.6: Now we suddenly have a SHOULD. Does this trump all the MUSTs in the
details, or what.

3.7.1, last example: There's some inconsistency re. "natto" (maybe from
a long time ago)


7.: Clarifying that URI schemes are also IRI schemes is a good idea. But
this does it the wrong way: It separates URI schemes and IRI schemes,
and claims that only four schemes (ftp, http, https, impa) can be used
with IRIs when actually there are quite a few more. (what was the
criterion for obtaining the above small list?)


That's what I have for the moment.


Regards,   Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@...


Re: IRI update for "parse then (if needed, translate)" vs "translate, then parse"

by Erik van der Poel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Yes, a relative must definitely be parsed before absolutizing. Two
backslashes after http: are also treated as (forward) slashes by major
browsers.

Erik

On Mon, Sep 14, 2009 at 8:32 PM, "Martin J. Dürst"
<duerst@...> wrote:

> [I have added public-iri@... to the cc list.]
>
> On 2009/09/03 9:12, Larry Masinter wrote:
>>
>> This took a while, but here's the next cut at recasting IRIs and dealing
>> with "web address":
>>
>>
>> http://larry.masinter.net/iribis-hack.html
>> http://larry.masinter.net/iribis-hack.txt
>> http://larry.masinter.net/iribis-hack.xml
>
>>
>> http://tools.ietf.org/rfcdiff?url1=draft-duerst-iri-bis.txt&url2=http://larry.masinter.net/iribis-hack.txt
>
>
> I have read through this new draft, trying to concentrate on the changed
> pieces.
>
> Overall, I'm scared about the tendency to use MUST without much more careful
> examination.
>
> I have nothing against *allowing* scheme-specific short-cuts, optimizations,
> or short-time backwards compatibility variants, but in the long term, UTF-8
> is much more important than punycode, and scheme-independent processing
> isn't something to be thrown away easily.
>
> But I'm quite confident that such a result can be obtained with a new
> version of the draft which is much closer to the current -06.txt than the
> one discussed here.
>
> In some more detail:
>
> Title: I strongly suggest removing "URI" from the title and limit the
> current effort on the "allow scheme-specific conversion" part and the
> "LEIRI/Web address" part. These alone are quite serious, and we can always
> make another rewrite effort once these have been addressed really
> successfully, although I can't see the need for getting too much involved in
> URIs at all at the moment.
>
> Abstract: "TO ALLOW RECONCILIATION WITH CURRENT PRACTICE": This is too
> strong. At least change to "SOME CURRENT PRACTICE", as there are definitely
> implementations that can handle %-encoding in regnames, and there will be
> more as time goes on. (see Roy's point about the Host header in HTTP).
>
> Abstract, shortest para: 'addition' ... 'additional': reword to avoid
> repetition.
>
> Document structure: Section 5.5 is the wrong place for LEIRIs and friends
> (I'll call these legacy addresses from now on). What we need is a short
> notice in section 5 (Normalization/Comparison) about legacy addresses, but
> legacy addresses in and by themselves need a separate section (the more I
> think about it, the more my conclusion is that an appendix is the best
> place).
>
> Introduction: "increasing numbers of protocols" -> "an increasing number of
> protocols" (one number, many protocols) (that may have been in there for
> ages)
>
> Definitions: "parsed IRI component": Don't start a definition with
> "similarly". (definitions should be reasonably usable outside of context)
>
> 3., before 3.1: This clearly needs more text talking about the overall
> choices and procedures.
>
> 3.1 "Convert to UCS" -> "Converting to UCS" (some other titles have the same
> problem; verbs don't work well in titles)
>
> 3.1, first para: Remove "or octet stream...". Of course the "sequence of
> Unicode characters" will be represented somehow, but that's not relevant
> here.
>
> 3.12 para 2: "benormalized" -> "be normalized"
>
> 3.2, para 1: "IRI. this" -> "IRI. This"
>
> 3.2, para 1: Is the intent to say that for relative URIs, they should be
> absolutized first, and then parsed? If yes, then say so. If no, say what
> else. I'm absolutely not sure that this will work; we have to very carefully
> check all kinds of interactions (relative -> absolute does some parsing as
> far as I understand, and HTML5 tries to convert '\' to '/' in paths, which
> probably also interacts.
>
> 3.2, para 2: What about unknown schemes? Simply give up, or what?
>
> 3.2, para 3: Needs much more care and detail, and can't stay a Note.
>
> 3.2, para 4: "Subseqent processing rules may be used to define other
> syntactic components.": What exactly is this supposed to mean???
>
> 3.3, para 3 (NOTE): Why is a MAY harmful? IRIs are well-defined, and we have
> to allow implementations to process only valid ones, and not other garbage.
> "The non-printable characters should be stripped by most software, so by the
> time you get here...": This reads like a "survival of the fittest" for
> control characters.
>
> 3.3: "Hex encode" -> %-encode (That's what both RFC 3986 and 3987 have used,
> and even if many people (incl. me) don't like it, there's no reason to
> change it just to create even more confusion.
>
> 3.4, para 1: Again an unjustified MUST. There are implementations that don't
> do this, for good reasons, and they work and shouldn't be made
> nonconformant. Also, we have to work on what to do for IDNA2008 here.
>
> 3.4, para 2: If ToASCII fails, then it fails. End of story. That's another
> reason why converting to %-encoding makes sense; IRIs/URIs cannot and
> shouldn't be concerned with the details of the various namespaces that they
> contain or grandfather.
>
> 3.4, Note 1: "The server side implementation would be responsible": "would
> be" -> "is".
>
> 3.4, Note 2: What about e.g. http://r%C3%A9sum%C3%A9.example.org in an IRI?
> Will that get converted to punycode, or not?
>
> 3.4, Note 3: This needs to go somewhere else, it doesn't fit here.
>
> 3.5: This is webaddress-specific, needs to be moved.
>
> 3.6: Now we suddenly have a SHOULD. Does this trump all the MUSTs in the
> details, or what.
>
> 3.7.1, last example: There's some inconsistency re. "natto" (maybe from a
> long time ago)
>
>
> 7.: Clarifying that URI schemes are also IRI schemes is a good idea. But
> this does it the wrong way: It separates URI schemes and IRI schemes, and
> claims that only four schemes (ftp, http, https, impa) can be used with IRIs
> when actually there are quite a few more. (what was the criterion for
> obtaining the above small list?)
>
>
> That's what I have for the moment.
>
>
> Regards,   Martin.
>
> --
> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
> #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@...
>