|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
|
|
|
Re: Definitions limit on label length in UTF-8Some additional points below.
On 2009/09/12 12:14, Martin J. Dürst wrote: > Hello John, > On 2009/09/12 0:47, John C Klensin wrote: >> >> --On Friday, September 11, 2009 17:37 +0900 "\"Martin J. >> Dürst\""<duerst@...> wrote: >> I note that, while I haven't had time to respond, some of the >> discussion on the IRI list has included an argument that domain >> names in URIs cannot be restricted to A-label forms but must >> include %-escaped UTF-8 simply because those strings might not >> be public-DNS domain names but references to some other database >> or DNS environment. > > It's not 'simply because'. It's first and foremost because of the > syntactic uniformity of URIs, and the fact that it's impossible to > identify all domain names in an URI (the usual slot after the '//' is > easy, scheme-specific processing (which is not what URIs and IRIs are > about) may be able to deal with some of 'mailto', but what do you do > about domain names in query parts? Also, this syntax is part of RFC > 3986, STD 66, a full IETF Standard. Also, consider EAI (email address internationalization) and mailto: (or something like 'imailto:' if we go with a separate scheme name for internationalized addresses). If we use scheme-specific processing, we can convert IDN labels to punycode, but for EAI, that would be useless overkill, because EAI uses UTF-8. This works much better if we use %-encoding for the whole IRI->URI conversion than if we try to be 'smart'. > Overall, it's just a question of what escaping convention should be > used. URIs have their specific escaping convention (%-encoding), and DNS > has its specific escaping convention (punycode). > > Also please note that the IRI spec doesn't prohibit to use punycode when > converting to URIs. > > In addition, please note that at least my personal implementation > experience (adding IDN support to Amaya) shows that the overhead of > supporting %-encoding in domain names in URIs is minimal, and helps > streamline the implementation. > >> It seems to me that one cannot have it >> both ways -- either the application knows whether a string is a >> public DNS reference that must conform _only_ to IDNA >> requirements (but then can be restricted to A-labels) or the >> application does not know and therefore must conform to DNS >> requirements for label lengths. > > There is absolutely no need to restrict *all* references just because > *some of them* may use other resolver systems with other length > restrictions (which may be "63 octets per label when measured in UTF-8" > or something completely different). It would be very similar to saying > "Some compilers/linkers can only deal with identifiers 6 characters or > shorter, so all longer identifiers are prohibited." In addition, for IDNA2003 (which we are using for implementation experience), a label being in UTF-8 means that it may not yet have been nameprepped. That in turn implies that it may contain non-NFKC characters, which take more or less space than the nameprepped version of UTF-8. If there were indeed implementations that did conversion to lenght-string pairs in UTF-8 and only later applied punycode, there could be cases where an IDN label may or may not resolve depending on whether input was normalized or not. So it could e.g. resolve on a Linux or Windows system (these use precomposed characters mostly identical to NFC), but not resolve on a Mac (which uses decomposed characters, taking more space). Weird and improbable. Regards, Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@... |
|
|
RE: Definitions limit on label length in UTF-8Martin J. Dürst writes:
> Hello John, > > [Dave, this is Cc'ed to you because of some discussion relating to > draft-iab-idn-encoding-00.txt.] > > [I'm also cc'ing public-iri@... because of the IRI-related issue at > the end.] > > [Everybody, please remove the Cc fields when they are unnecessary.] > > > Overall, I'm afraid that on this issue, more convoluted explanations > won't convince me nor anybody else, but I'll nevertheless try to answer > your discussion below point-by-point. > > What I (and I guess others on this list) really would like to know is > whether you have any CONCRETE reports or evidence regarding problems > with IDN labels that are longer than 63 octets when expressed in UTF-8. > > Otherwise, Michel has put it much better than me: "given the lack of > issues with IDNA2003 on that specific topic there are no reasons to > introduce an incompatible change". > > > On 2009/09/12 0:47, John C Klensin wrote: > > > > --On Friday, September 11, 2009 17:37 +0900 "\"Martin J. > > Dürst\""<duerst@...> wrote: > > > >>> (John claimed that the email context required such a > >>> rule, but I did not bother to confirm that.) > >> Given dinosaur implementations such as sendmail, I can > >> understand the concern that some SMTP implementations may not > >> easily be upgradable to use domain names with more than 255 > >> octets or labels with more than 63 octets. In than case, I > >> would have expected at least a security warning at > >> http://tools.ietf.org/html/rfc4952#section-9 (EAI is currently > >> written in terms of IDNA2003, and so there are no length > >> restrictions on U-labels). > > > > I obviously have not been explaining this very well. The > > problem is not "dinosaur implementations" > > Okay, good. > > > but a combination of > > two things (which interact): > > > > (1) Late resolution of strings, possibly through APIs that > > resolve names in places that may not be the public DNS. > > Systems using those APIs may keep strings in UTF-8 until very > > late in the process, even passing the UTF-8 strings into the > > interface or converting them to ACE form just before calling the > > interface. Either way, because other systems have come to rely > > on the 63 octet limit, strings longer than 63 characters pose a > > risk of unexpected problems. The issues with this are better > > explained in draft-iab-idn-encoding-00.txt, which I would > > strongly encourage people in this WG to go read. Actually systems using those APIs which are the "standard" (with a lower case s) APIs, may keep strings in UTF-8 (or even UTF-16 for common but non-"standard" variants) until very late, and may keep strings in UTF-8 without ever converting them for some protocols, e.g. mDNS, that are defined to use UTF-8. > I have indeed read draft-iab-idn-encoding-00.txt (I sent comments to > the > author and the IAB and copied this list). That document mentions the > length restrictions, as essentially the only restrictions in DNS > itself, > rather than in things on top of it. That document also (well, mainly) > discusses the issue of names being handed down into APIs in various > forms (UTF-8, UTF-16, punycode, legacy encodings,...), and being > resolved by various mechanisms (DNS, NetBIOS, mDNS, hosts file,...), > and > the problem that these mechanisms may use and expect different > encodings > for non-ASCII characters. > > However, I haven't found any mention, nor even a hint, in that > document, > of a need to restrict punycode labels to less than 63 octets when > expressed in UTF-8. I agree with the above characterization. > The document mentions (as something that might happen, but shouldn't) > that an application may pass a UTF-8 string to something like > getaddrinfo, and that string may be passed directly to the DNS. First, > if this happens, IDNA has already lost. I'm don't agree with the "shouldn't", and certainly it was not the intent of draft-iab-idn-encoding-00.txt to actually state whether this "shouldn't" happen, but that it "can" happen (and perhaps "does"). There's also a potential argument in the doc that this is not harmful (see 2nd paragraph of section 4 for instance, and extrapolate from there). > Second, whether the string is > UTF-8 or pure ASCII, if the API isn't prepared to handle labels longer > than 63 octets and overall names longer than 255 octets defensively > (i.e. return something like 'not found'), then the programmer should be > fired. Anyway, in that case, the problem isn't with UTF-8. > > What draft-iab-idn-encoding-00.txt essentially points out is that > different name resolution services use different encodings for non- > ASCII > characters, and that currently different users (meaning applications) > of > a name resolution API may assume different encodings for non-ASCII > characters, which creates all kinds of chances for errors. Some > heuristics may help in some cases, but the right solution (as with all > cases where characters, and in particular non-ASCII ones, are involved) > is to clearly say where which encoding is used. A very simple example > for this is GetAddInfoW, which assumes UTF-16. > > The only potential problem that I see from the discussion in > draft-iab-idn-encoding-00.txt is the following: Some labels containing > non-ASCII characters that fit into 63 octets in punycode and therefore > can be resolved with the DNS may not be resolvable with some other > resolution service because that service may use a different encoding > (and may or may not have different length limits). > > I have absolutely nothing against some text in a Security > Considerations > section or in Rationale pointing out that if you want to set up some > name or label for resolution via multiple different resolution > services, > you have to take care that you choose your names and labels so that > they > meet the length restrictions for all those services. But that doesn't > imply at all that we have to artificially restrict the length of > punycode labels by counting octets in UTF-8. Completely agree with all of the above. I think a brief discussion of this issue may make sense in the next version of draft-iab-idn-encoding, if we can get IAB consensus on text. > > (2) The "conversion of DNS name formats" issue that has been > > extensively discussed as part of the question of alternate label > > separators (sometimes described in our discussions as > > "dot-oids"). Applications that use domain names, including > > domain names that are not going to be resolved (or even looked > > up), must be able to freely and accurately converted between > > DNS-external (dot-separated labels) and DNS-internal > > (length-string pairs) formats _without_ knowing whether they are > > IDNs or not. > > I'm not exactly sure what you mean here. If you want to say "without > checking whether they contain xn-- prefixes and punycode or not", then > I > can agree, but that cannot motivate a UTF-8 based length restriction. Right. I'm not sure why most "applications" would care about DNS- internal (length-string pairs) formats, only NULL-terminated strings (containing dot-separated labels) that get passed to getaddrinfo-like functions. Most applications are (and should be) oblivious to the fact that DNS or some other protocol is used for resolving names. > If you say that applications, rather than first converting U-label -> > A-label and then converting from dot-separated to length-string > notation, have to be able to first convert to length-string notation > and > then convert U-labels to A-labels, then I contend that nobody in their > right mind would do it that way, and even less if "dot-oids" are > involved. For a starter, U-labels don't have a fixed encoding. > > > As discussed earlier, one of several reasons for > > that requirement is that, in non-IDNA-aware contexts, labels in > > non-IDNA-aware applications or contexts may be perfectly valid > > as far as the DNS is concerned, because the only restriction the > > DNS (and the normal label type) imposes is "octets". > > If and where somebody has binary labels, of course these binary labels > must not be longer than 63 octets. But IDNA doesn't use binary labels, > and doesn't stuff UTF-8 into DNS protocol slots, so for IDNA, any > length > restrictions on UTF-8 are irrelevant. > > > That > > length-string format has a hard limit of 63 characters that can > > be exceeded only if one can figure out how to get a larger > > number into six bits (see RFC1035, first paragraph of Section > > 3.1, and elsewhere). > > I very well know that the 63 octets (not characters) limit is a hard > one. In the long run, one might imagine an extension to DNS that uses > another label format, without this limitation, but there is no need at > all to go there for this discussion. > > > If we permit longer U-label strings on the > > theory that the only important restriction is on A-labels, we > > introduce new error states into the format conversion process. > > For IDNA, only A-labels get sent through the DNS protocol, so only > there, the length restrictions for labels is relevant. If somebody gets > this wrong in the format conversion process (we currently don't have > any > reports on that), then that's their problem (and we can point it out in > a Security section or so). > > > If this needs more explanation somewhere (possibly in > > Rationale), I'm happy to try to do that. But I think > > eliminating the restriction would cause far more problems than > > it is worth. > > It hasn't caused ANY problems in IDNA2003. There is nothing new in > IDNA2008 that would motivate a change. *Running code*, one of the > guidelines of the IETF, shows that the restriction is unnecessary. > > > > I note that, while I haven't had time to respond, some of the > > discussion on the IRI list has included an argument that domain > > names in URIs cannot be restricted to A-label forms but must > > include %-escaped UTF-8 simply because those strings might not > > be public-DNS domain names but references to some other database > > or DNS environment. > > It's not 'simply because'. It's first and foremost because of the > syntactic uniformity of URIs, and the fact that it's impossible to > identify all domain names in an URI (the usual slot after the '//' is > easy, scheme-specific processing (which is not what URIs and IRIs are > about) may be able to deal with some of 'mailto', but what do you do > about domain names in query parts? Also, this syntax is part of RFC > 3986, STD 66, a full IETF Standard. > > Overall, it's just a question of what escaping convention should be > used. URIs have their specific escaping convention (%-encoding), and > DNS > has its specific escaping convention (punycode). > > Also please note that the IRI spec doesn't prohibit to use punycode > when > converting to URIs. > > In addition, please note that at least my personal implementation > experience (adding IDN support to Amaya) shows that the overhead of > supporting %-encoding in domain names in URIs is minimal, and helps > streamline the implementation. > > > It seems to me that one cannot have it > > both ways -- either the application knows whether a string is a > > public DNS reference that must conform _only_ to IDNA > > requirements (but then can be restricted to A-labels) or the > > application does not know and therefore must conform to DNS > > requirements for label lengths. > > There is absolutely no need to restrict *all* references just because > *some of them* may use other resolver systems with other length > restrictions (which may be "63 octets per label when measured in UTF-8" > or something completely different). It would be very similar to saying > "Some compilers/linkers can only deal with identifiers 6 characters or > shorter, so all longer identifiers are prohibited." I agree with that. > > For our purposes, the only > > sensible way, at least IMO, to deal with this is to require > > conformance to both sets of rules, i.e., 63 character maximum > > for A-labels and 63 character maximum for U-labels. > > As far as I understand punycode, it's impossible to encode a Unicode > character in less than one octet. This means that a maximum of 63 > *characters* for U-labels is automatically guaranteed by a maximum of > 63 > characters/octets for A-labels. > > However, Defs clearly says "length in octets of the UTF-8 form", so I > guess this was just a slip of your fingers. > > Regards, Martin. -Dave |
|
|
Re: Definitions limit on label length in UTF-8Martin,
First of all, please understand that I'm much more agnostic on this issue than I think you assume. I'm trying to reflect what I believe I've been told by the WG and by various other communities on the subject but, if the WG says "change it", I will do so as editor and lose very little sleep about the subject. I'll let Dave and Stuart address the API and eventual migration to pure UTF-8 issues. I've been told that the ability to convert to length-value form (with a six-bit length) _before_ Punycode conversion (or in an IDNA-unaware, "octets only" implementation) is critical for the DNS community and for some security-related applications which store DNS-based identifiers in that form. But I have no personal implementation experience in either area, so perhaps Andrew and Paul can either speak to those issues or point us to someone who can. As a sometime-implementer, I'm nervous about unlimited-length strings (as, based on recent interactions, are Stuart and Vint). But it seems to me that the string length here is bounded in any event -- with 59 characters of Punycode in an A-label, the upper limit on a UTF-8 or UTF-32 string cannot be over 236 characters and, I assume, would be considerably smaller. Especially if we can pin that number down (Adam?), I'd be a lot happier with text that said, essentially, "the limit is on the A-label string, but implementations should be aware that a maximum-length A-label can convert to a U-label of up to NNN" characters than saying "unlimited" and I think some others would be too. All of that said, I'm not persuaded by the "there have been no issues raised, therefore there is no problem" argument. The reality is that, for mnemonic and typing convenience, people generally prefer shorter labels to longer ones. Other than in test demonstrations and as part of efforts to encode other types of information in DNS labels, I don't believe I've ever seen a 60+ character ASCII label in the wild. Regardless of script, a few such labels in the same FQDN would not only be nearly impossible for most people to enter correctly but also would guarantee line-wrapping of DNS names in most screen-layout and documentation arrangements... never an ideal situation. That isn't an argument for banning labels of that length or longer; it does suggest a reason why no problems have been identified other than "people have been using this for years with no difficulty". regards, john --On Saturday, September 12, 2009 12:14 +0900 "\"Martin J. Dürst\"" <duerst@...> wrote: > Hello John, > > [Dave, this is Cc'ed to you because of some discussion > relating to draft-iab-idn-encoding-00.txt.] > > [I'm also cc'ing public-iri@... because of the IRI-related > issue at the end.] > > [Everybody, please remove the Cc fields when they are > unnecessary.] > > > Overall, I'm afraid that on this issue, more convoluted > explanations won't convince me nor anybody else, but I'll > nevertheless try to answer your discussion below > point-by-point. > > What I (and I guess others on this list) really would like to > know is whether you have any CONCRETE reports or evidence > regarding problems with IDN labels that are longer than 63 > octets when expressed in UTF-8. > > Otherwise, Michel has put it much better than me: "given the > lack of issues with IDNA2003 on that specific topic there are > no reasons to introduce an incompatible change". |
| Free embeddable forum powered by Nabble | Forum Help |