IDNA and IRI document way forward

View: New views
9 Messages — Rating Filter:   Alert me  

IDNA and IRI document way forward

by Larry Masinter-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.

A note about the direction for URI and IRI:

 

I’d like to make progress on two tasks

1)      resolve any issues with regard to IDNA standards and difficulties with internationalized domain names

2)      resolve any conflicts with the requirements for documenting current browser behavior for compatibility with HTML5

 

For (1) I think the main changes are:

 

a)      CHANGE (clarify, update, whatever necessary) URI definitions so that no percent-hex-encoded %XX values are allowed in the “HOST NAME” section of any URI.  URIs that wish to refer to internationalized domain names may only use “A-label” domain names.  My understanding is that this is necessary for security reasons, and may require an update to some existing software (http://tools.ietf.org/html/draft-ietf-idnabis-defs-09#section-2.3.2.1). It may also require an update to the HTTP URI scheme and the Mailto: URI scheme.

b)      Update the IRI definition so that non-ASCII domain names, either in the generic syntax or the alternate syntax, are *not* mapped to URIs by a string algorithm, but rather are parsed, and any IRI -> URI transformation handled by

1.       parsing according to the scheme definition

2.        mapping the parsed components based on whether they are host names, query strings (for HTML5), or non-hostname components

3.       reassembling the URI components after mapping

 

We will need to review existing registered (and perhaps unregistered?) URI schemes to see if there are other places where host names appear than using the generic scheme://host/path  or scheme:local@remote syntax.  IRI->URI transformation will necessarily be scheme specific, and we’ll need to basically define that new URI schemes must either

(a)    not allow host names anywhere in them

(b)   use the scheme://host/path syntax

(c)    be a special exception  -- I think this may be limited to ...

 

There are other uses of domain names in URIs currently; for example, cid: (content-ID) strings often contain domain names.  I’m not sure but it may be reasonable to *not* allow IRI forms, e.g., require that all URIs not using scheme://host/path syntax not allow hex-encoded octets above %7F, for example.

 

I think this is unfortunate and a pretty drastic change to the IRI document, but I don’t think we’re going to make progress if we don’t take the bull by the horns.

 

Larry

--

http://larry.masinter.net

 


Re: IDNA and IRI document way forward

by "Martin J. Dürst" :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello Larry,

Many thanks for this mail.

On 2009/07/29 13:52, Larry Masinter wrote:
> A note about the direction for URI and IRI:
>
> I'd like to make progress on two tasks
> 1)      resolve any issues with regard to IDNA standards and difficulties with internationalized domain names

There are definitely quite some issues that need to be resolved.
One that is for sure is that we have to adapt the BIDI stuff in the IRI
draft to what IDNA does.

> 2)      resolve any conflicts with the requirements for documenting current browser behavior for compatibility with HTML5

Conflicts between *what* and the current browser behavior?

> For (1) I think the main changes are:
>
> a)      CHANGE (clarify, update, whatever necessary) URI definitions so that no percent-hex-encoded %XX values are allowed in the "HOST NAME" section of any URI.  URIs that wish to refer to internationalized domain names may only use "A-label" domain names.  My understanding is that this is necessary for security reasons, and may require an update to some existing software (http://tools.ietf.org/html/draft-ietf-idnabis-defs-09#section-2.3.2.1). It may also require an update to the HTTP URI scheme and the Mailto: URI scheme.

Do you mean you propose to update RFC 3986 (STD 66)? Can you give more
details on the "security reasons"? The section you cite does contain a
lot of definitions, but nothing about security issues or %-encoding
(which isn't surprising, because %-encoding is URI/IRI-specific, and
gets resolved before the data goes to the DNS (or IDNA).

BTW, the issue for allowing %-escaping in the reg-name (NOT host name)
part of an URI can be found at
http://labs.apache.org/webarch/uri/rev-2002/issues.html#036-host-escaping.

> b)      Update the IRI definition so that non-ASCII domain names, either in the generic syntax or the alternate syntax, are *not* mapped to URIs by a string algorithm, but rather are parsed, and any IRI ->  URI transformation handled by
> 1.       parsing according to the scheme definition
> 2.        mapping the parsed components based on whether they are host names, query strings (for HTML5), or non-hostname components
> 3.       reassembling the URI components after mapping

Treating host parts specially (i.e. converting them to punycode rather
than to %-encoding) is already allowed in RFC 3987.

> We will need to review existing registered (and perhaps unregistered?) URI schemes to see if there are other places where host names appear than using the generic scheme://host/path  or scheme:local@remote syntax.  IRI->URI transformation will necessarily be scheme specific, and we'll need to basically define that new URI schemes must either
> (a)    not allow host names anywhere in them
> (b)   use the scheme://host/path syntax
> (c)    be a special exception  -- I think this may be limited to mailto:
>
> There are other uses of domain names in URIs currently; for example, cid: (content-ID) strings often contain domain names.  I'm not sure but it may be reasonable to *not* allow IRI forms, e.g., require that all URIs not using scheme://host/path syntax not allow hex-encoded octets above %7F, for example.

There are many other places where domain names appear in URIs and IRIs.
Take http://validator.w3.org/check?uri=http://www.ietf.org/ as a simple
example (OT: would be nice if the IETF site validated). Or then take a
site such as http://恵比寿駅.jp/. That would be http://validator.w3.org 
/check?uri=http://恵比寿駅.jp/ or some such for the validator
(unfortunately, this is currently not supported). It's very clearly
impossible to rule this out. But even before that, doing scheme-wise
processing kills the U in URIs.

> I think this is unfortunate and a pretty drastic change to the IRI document, but I don't think we're going to make progress if we don't take the bull by the horns.

Before taking anything by the horns (or the tail, or whatever) I'd like
to know in great details what exactly the actual (or pretended) bull is.

Regards,    Martin.

> Larry
> --
> http://larry.masinter.net
>
>

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@...


RE: IDNA and IRI document way forward

by Larry Masinter-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I confess that I'm just coming back up to speed on the
issues, and hope you'll forgive me for missing some of
the history,

It seems there are at least two communities (IDN/IDNA and
IRI/WEB) which should have been working together for
the past many years, haven't been, and we're now facing
some difficulties in bringing their perspectives together,
especially when those perspectives have been built
into long-standing and finely argued documents.

I'm not entirely sure of the use case and difficulties,
which I will try to track down in more detail.

Just as personal speculation, however,
I could easily imagine some problems if it were
possible to register domain names which actually
contained percent-hex-hex sequences.

www.%77%33.org vs www.w3.org?

Perhaps that would be a problem not just for IRIs
but for other kinds of processing too.  Can this
be disallowed at the URI parsing level? Only at
the IRI level?

I see the difficulties of creating a provision for
scheme-specific parsing and restrictions on host names
containing %xx hex-encoded bytes in URIs are even
greater than what I imagined.


> That would be
> http://validator.w3.org/check?uri=http://恵比寿駅.jp/

I'm sure there are difficulties even in circumstances that
don't use "?", but this is especially difficult since the
HTML-URL/HREF/WebAddress handling of non-ASCII query parameters
adds some ambiguity to the translation of this into URI space.

> It's very clearly impossible to rule this out.

Difficult, but not impossible.

> But even before that, doing scheme-wise processing
>  kills the U in URIs.

And the I in Internationalized and several other things. Let's
stick to identifying issues and alternatives.

> I think this is unfortunate and a pretty drastic change
> to the IRI document, but I don't think we're going to make
> progress if we don't take the bull by the horns.

> Before taking anything by the horns (or the tail, or whatever) I'd like
> to know in great details what exactly the actual (or pretended) bull is.

And if there there are two bulls there would be
four horns, a Tetralemma, and quite a bit of BS.

Larry
--
http://larry.masinter.net


Re: IDNA and IRI document way forward

by Vint Cerf :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

i do not believe it is possible under IDNA2008 (nor under IDNA2003?)  
to register percent-escaped strings in the domain name universe.  
double checking.

we did not deserve that last comment - but it was funny anyway.

v

On Jul 29, 2009, at 3:07 AM, Larry Masinter wrote:

> I confess that I'm just coming back up to speed on the
> issues, and hope you'll forgive me for missing some of
> the history,
>
> It seems there are at least two communities (IDN/IDNA and
> IRI/WEB) which should have been working together for
> the past many years, haven't been, and we're now facing
> some difficulties in bringing their perspectives together,
> especially when those perspectives have been built
> into long-standing and finely argued documents.
>
> I'm not entirely sure of the use case and difficulties,
> which I will try to track down in more detail.
>
> Just as personal speculation, however,
> I could easily imagine some problems if it were
> possible to register domain names which actually
> contained percent-hex-hex sequences.
>
> www.%77%33.org vs www.w3.org?
>
> Perhaps that would be a problem not just for IRIs
> but for other kinds of processing too.  Can this
> be disallowed at the URI parsing level? Only at
> the IRI level?
>
> I see the difficulties of creating a provision for
> scheme-specific parsing and restrictions on host names
> containing %xx hex-encoded bytes in URIs are even
> greater than what I imagined.
>
>
>> That would be
>> http://validator.w3.org/check?uri=http://恵比寿駅.jp/
>
> I'm sure there are difficulties even in circumstances that
> don't use "?", but this is especially difficult since the
> HTML-URL/HREF/WebAddress handling of non-ASCII query parameters
> adds some ambiguity to the translation of this into URI space.
>
>> It's very clearly impossible to rule this out.
>
> Difficult, but not impossible.
>
>> But even before that, doing scheme-wise processing
>> kills the U in URIs.
>
> And the I in Internationalized and several other things. Let's
> stick to identifying issues and alternatives.
>
>> I think this is unfortunate and a pretty drastic change
>> to the IRI document, but I don't think we're going to make
>> progress if we don't take the bull by the horns.
>
>> Before taking anything by the horns (or the tail, or whatever) I'd  
>> like
>> to know in great details what exactly the actual (or pretended)  
>> bull is.
>
> And if there there are two bulls there would be
> four horns, a Tetralemma, and quite a bit of BS.
>
> Larry
> --
> http://larry.masinter.net
>




Re: IDNA and IRI document way forward

by Charles Lindsey :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, 29 Jul 2009 05:52:58 +0100, Larry Masinter <masinter@...>  
wrote:

> A note about the direction for URI and IRI:
>
> There are other uses of domain names in URIs currently; for example,  
> cid: (content-ID) strings often contain domain names.  I'm not sure but  
> it may be reasonable to *not* allow IRI forms, e.g., require that all  
> URIs not using scheme://host/path syntax not allow hex-encoded octets  
> above %7F, for example.

That doesn't seem right. A domain name (usually after an '@') in a  
Content-ID or a Message-ID is just a convention (in fact any meaningless  
string would do, provided it is believed to be unique). In particular, it  
is NEVER required to submit that supposed domain name to a DNS query. What  
IS required is that the cid or mid should always compare equal to other  
copies of itself, hoverer it or those copies may have been mangled during  
transmission. Hence encoding it in punycode is never likely to be helpful,  
but encoding it in hex, even for octets greater than %7F, should always be  
harmless and easily reversible. This situation could well arise in the  
case of the news URI scheme, for example.

Fortunately, current I18N efforts such as EAI have chosen to retain a  
strict ASCII syntax for cids and mids, but that might not remain true  
indefinitely.

--
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131                       
   Web: http://www.cs.man.ac.uk/~chl
Email: chl@...      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5


Re: IDNA and IRI document way forward

by John C Klensin-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



--On Wednesday, July 29, 2009 05:09 -0400 Vint Cerf
<vint@...> wrote:

> i do not believe it is possible under IDNA2008 (nor under
> IDNA2003?) to register percent-escaped strings in the domain
> name universe. double checking.

That is correct.   It is not permitted under either version.

   john



Re: IDNA and IRI document way forward

by "Martin J. Dürst" :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello Larry,

On 2009/07/29 16:07, Larry Masinter wrote:
> I confess that I'm just coming back up to speed on the
> issues, and hope you'll forgive me for missing some of
> the history,

No problem with that.

> It seems there are at least two communities (IDN/IDNA and
> IRI/WEB) which should have been working together for
> the past many years, haven't been, and we're now facing
> some difficulties in bringing their perspectives together,
> especially when those perspectives have been built
> into long-standing and finely argued documents.

I guess it's possible to see the situation this way, but I think it's
also possible to see the glass as (at least) half full. Browser vendors
have been implementing both IDNs and IRIs. Michel Suignard in particular
was involved in both efforts. I also was and am on both lists, and tried
to notice any relevant issues (which doesn't mean that they all made it
into the current draft).

> I'm not entirely sure of the use case and difficulties,
> which I will try to track down in more detail.

Great. Very much looking forward to that.

> Just as personal speculation, however,
> I could easily imagine some problems if it were
> possible to register domain names which actually
> contained percent-hex-hex sequences.
>
> www.%77%33.org vs www.w3.org?

That was the direction that my speculation would go too.
It is certainly true that the DNS as such easily accepts any byte
sequence in its labels. As far as I understand, that even includes null
bytes, because the packets that the DNS sends use an initial byte to
indicate the length of a label (taking two bits for other purposes, that
results in the label length limit of 63 bytes).

However, "%77%33" never has been legal in URIs before RFC 3986 made it
mean the same as "w3". Also, "%77%33" is definitely not something any
DNS registrar would allow for registration, nor is it something that
other IETF protocols would accept as a host name if they put any
restrictions on it at all.

> Perhaps that would be a problem not just for IRIs
> but for other kinds of processing too.  Can this
> be disallowed at the URI parsing level? Only at
> the IRI level?

Well, "%77%33" as such would not need to be disallowed.
One would just have to write "%2577%2533" if one really had a need to
express it (which I guess would be the case extremely rarely).

> I see the difficulties of creating a provision for
> scheme-specific parsing and restrictions on host names
> containing %xx hex-encoded bytes in URIs are even
> greater than what I imagined.

I very much have to agree that we have to weight the difficulties
against the benefits.


>> That would be
>> http://validator.w3.org/check?uri=http://恵比寿駅.jp/

Or http://%E6%81%B5%E6%AF%94%E5%AF%BF%E9%A7%85.jp/ when escaped
(using UTF-8, as prescribed by RFC 3986/7).

> I'm sure there are difficulties even in circumstances that
> don't use "?", but this is especially difficult since the
> HTML-URL/HREF/WebAddress handling of non-ASCII query parameters
> adds some ambiguity to the translation of this into URI space.

Yes. But we should try to have these localized to those contexts,
if possible.

>> It's very clearly impossible to rule this out.
>
> Difficult, but not impossible.
>
>> But even before that, doing scheme-wise processing
>>   kills the U in URIs.
>
> And the I in Internationalized and several other things. Let's
> stick to identifying issues and alternatives.

I agree.

Regards,   Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@...


Re: IDNA and IRI document way forward

by Erik van der Poel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi all,

First of all, I am glad that this discussion is happening, and
especially happy to see references to "current browser behavior". :-)

I am a bit concerned about a couple of things in the new IRI bis draft:

http://www.ietf.org/id/draft-duerst-iri-bis-06.txt

There are a couple of new terms in this document: Legacy Extended IRIs
(LEIRIs) and Hypertext References. The difference between LEIRIs and
IRIs appears to be whether or not you %-escape certain characters,
such as Space and Unicode Private Use characters. However, Space is an
ASCII character, so one would expect this to be more related to URIs,
which begs the question of why we don't have the term LEURI as well.

Instead of defining new terms like LEIRI and Hypertext References, I
wonder if it might be better to have Standards Track RFCs on URIs and
IRIs and a BCP (Best Current Practice) on the contexts in which these
*RIs occur. The BCP would cover such issues as whether or not to
%-escape Space and Private Use.

The BCP could also cover HTML-specific issues such as the character
encoding of the query part (after the question mark), or the
HTML-specific material could be moved back to W3C.

Erik

On Wed, Jul 29, 2009 at 4:28 AM, "Martin J.
Dürst"<duerst@...> wrote:

> Hello Larry,
>
> On 2009/07/29 16:07, Larry Masinter wrote:
>>
>> I confess that I'm just coming back up to speed on the
>> issues, and hope you'll forgive me for missing some of
>> the history,
>
> No problem with that.
>
>> It seems there are at least two communities (IDN/IDNA and
>> IRI/WEB) which should have been working together for
>> the past many years, haven't been, and we're now facing
>> some difficulties in bringing their perspectives together,
>> especially when those perspectives have been built
>> into long-standing and finely argued documents.
>
> I guess it's possible to see the situation this way, but I think it's also
> possible to see the glass as (at least) half full. Browser vendors have been
> implementing both IDNs and IRIs. Michel Suignard in particular was involved
> in both efforts. I also was and am on both lists, and tried to notice any
> relevant issues (which doesn't mean that they all made it into the current
> draft).
>
>> I'm not entirely sure of the use case and difficulties,
>> which I will try to track down in more detail.
>
> Great. Very much looking forward to that.
>
>> Just as personal speculation, however,
>> I could easily imagine some problems if it were
>> possible to register domain names which actually
>> contained percent-hex-hex sequences.
>>
>> www.%77%33.org vs www.w3.org?
>
> That was the direction that my speculation would go too.
> It is certainly true that the DNS as such easily accepts any byte sequence
> in its labels. As far as I understand, that even includes null bytes,
> because the packets that the DNS sends use an initial byte to indicate the
> length of a label (taking two bits for other purposes, that results in the
> label length limit of 63 bytes).
>
> However, "%77%33" never has been legal in URIs before RFC 3986 made it mean
> the same as "w3". Also, "%77%33" is definitely not something any DNS
> registrar would allow for registration, nor is it something that other IETF
> protocols would accept as a host name if they put any restrictions on it at
> all.
>
>> Perhaps that would be a problem not just for IRIs
>> but for other kinds of processing too.  Can this
>> be disallowed at the URI parsing level? Only at
>> the IRI level?
>
> Well, "%77%33" as such would not need to be disallowed.
> One would just have to write "%2577%2533" if one really had a need to
> express it (which I guess would be the case extremely rarely).
>
>> I see the difficulties of creating a provision for
>> scheme-specific parsing and restrictions on host names
>> containing %xx hex-encoded bytes in URIs are even
>> greater than what I imagined.
>
> I very much have to agree that we have to weight the difficulties against
> the benefits.
>
>
>>> That would be
>>> http://validator.w3.org/check?uri=http://恵比寿駅.jp/
>
> Or http://%E6%81%B5%E6%AF%94%E5%AF%BF%E9%A7%85.jp/ when escaped
> (using UTF-8, as prescribed by RFC 3986/7).
>
>> I'm sure there are difficulties even in circumstances that
>> don't use "?", but this is especially difficult since the
>> HTML-URL/HREF/WebAddress handling of non-ASCII query parameters
>> adds some ambiguity to the translation of this into URI space.
>
> Yes. But we should try to have these localized to those contexts,
> if possible.
>
>>> It's very clearly impossible to rule this out.
>>
>> Difficult, but not impossible.
>>
>>> But even before that, doing scheme-wise processing
>>>  kills the U in URIs.
>>
>> And the I in Internationalized and several other things. Let's
>> stick to identifying issues and alternatives.
>
> I agree.
>
> Regards,   Martin.
>
> --
> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
> #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@...
>
>


Re: IDNA and IRI document way forward

by John Cowan :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Erik van der Poel scripsit:

> There are a couple of new terms in this document: Legacy Extended IRIs
> (LEIRIs) and Hypertext References. The difference between LEIRIs and
> IRIs appears to be whether or not you %-escape certain characters,
> such as Space and Unicode Private Use characters. However, Space is an
> ASCII character, so one would expect this to be more related to URIs,
> which begs the question of why we don't have the term LEURI as well.

No need for it.  "LEIRI" is a new name for something that's been around
since the First Edition of XML 1.0.  Until now, the numerous W3C and
non-W3C specs have referred to one another to define it, and it's never
had a name of its own.  This has caused uncertainty in definition and
pointless dependencies.  In any case, nobody proposes using non-IRI
LEIRIs for anything; it's just that they are a physical possibility
in various places within XML, and getting the definition standardized
and in one place will enable a lot of standards rectification.

--
John Cowan  cowan@...   http://ccil.org/~cowan
Consider the matter of Analytic Philosophy.  Dennett and Bennett are well-known.
Dennett rarely or never cites Bennett, so Bennett rarely or never cites Dennett.
There is also one Dummett.  By their works shall ye know them.  However, just as
no trinities have fourth persons (Zeppo Marx notwithstanding), Bummett is hardly
known by his works.  Indeed, Bummett does not exist.  It is part of the function
of this and other e-mail messages, therefore, to do what they can to create him.