Re: Changes to DOM3 Events Key Identifiers

View: New views
9 Messages — Rating Filter:   Alert me  

Parent Message unknown Re: Changes to DOM3 Events Key Identifiers

by Mark Davis ☕ :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I want to point out that Unicode code points can go up to hex 10FFFF. The standard for \u is exactly 4 digits, so that one can intermix with characters and know where it terminates. There are a couple of schemes that are used to extend this to up to 6 digits, and still know where to terminate.

\UXXXXXXXX - C++, ICU
\UXXXXXX - C#
\u{xxxxxx} - Ruby

There needs to be some mechanism for extending to 6 digits. It would be best to use one of the above rather than a new one. (My personal favorite is Ruby's.)

Mark


On Fri, Oct 30, 2009 at 00:32, Doug Schepers <schepers@...> wrote:
Hi, Folks-

(BCC to potentially affected groups: w3c-html-cg, public-webapps, public-i18n-core, wai-xtech, www-svg, public-forms, public-xhtml2, public-html@..., www-voice... please forward on to any relevant groups or individuals I may have missed, especially outside W3C.)

As editor of the DOM3 Events specification, I made what some may consider to be drastic changes in the most recent drafts:
 * I changed the syntax of the key identifier strings from "U+xxxx" (a plain string representing the Unicode code point) to "\uxxxx" (an escaped UTF-16 character string), based on content author and implementer feedback.
 * I renamed the "key identifier(s)" feature to "key value(s)".

I've mentioned these ideas before in DOM3 Events telcons, and finally decided to do it, after first consulting with the I18n WG, who generally approved of the scheme (though not without some comments about details that will need to be addressed and resolved).

The new string format should be easier to deal with for developers, and the new name reflects some confusion I've encountered when explaining what "key identifiers" are... the work "identifier" seems to evoke the concept of a unique identifier for a key, when in fact what the feature does is provides the most appropriate value given the state of keyboard modifiers and modes.  I have tried also to clarify this in the prose of the spec.

We are aware that there may already be implementations and specifications that rely on the previous string format and name (as well as links), back from when this was a W3C Note, and we do not make this decision lightly, but we do believe this is the right decision for a stable and internationalized keyboard interface going forward.  For those implementations and specifications that need the previous functionality and name, you may be able to reference the SVG Tiny 1.2 specification [2] instead, which does include the old Key Identifiers feature more or less intact from the previous definition, and is a stable W3C Recommendation.

You can review the changes in the most recent Editor's Draft [1].  The WebApps WG welcomes your feedback to the www-dom@... list.  This specification is still a work in progress, though we do hope to go to Last Call soon, so we are open to suggestions. (Note that the spec is mostly feature-complete, so new event types and other changes may have to wait for the next version, but send them on anyway.)

[1] http://dev.w3.org/2006/webapi/DOM-Level-3-Events/html/DOM3-Events.html#keyset
[2] http://www.w3.org/TR/SVGTiny12/svgudom.html#KeyIdentifiersSet


Regards-
-Doug Schepers, on behalf of the WebApps WG
Editor, DOM Level 3 Events
W3C Team Contact, SVG and WebApps WGs



Re: Changes to DOM3 Events Key Identifiers

by Doug Schepers-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi, Mark-

Mark Davis ☕ wrote (on 10/30/09 12:22 PM):

> I want to point out that Unicode code points can go up to hex 10FFFF.
> The standard for \u is exactly 4 digits, so that one can intermix with
> characters and know where it terminates. There are a couple of schemes
> that are used to extend this to up to 6 digits, and still know where to
> terminate.
>
> \UXXXXXXXX - C++, ICU
> \UXXXXXX - C#
> \u{xxxxxx} - Ruby
>
> There needs to be some mechanism for extending to 6 digits. It would be
> best to use one of the above rather than a new one. (My personal
> favorite is Ruby's.)

The reason the "\u" escaped character sequence was chosen was that it is
the native ECMAScript escape notation, which is easy for browser-based
applications to use directly (i.e. they can inject it directly into the
markup as a character).

But, yes, this does have the cap of 4 digits, and I personally would
prefer to use a different escape mechanism... but only if one or both of
these 2 conditions obtains:

1) DOM3 Events implementations also update their Javascript engines to
be able to process the additional escape sequence (e.g. one of the ones
you mention above) in the same way they process the "\u" escape
sequence.  This is the better long-term solution, and I'd hope ECMA TC39
could be persuaded to add this to future ECMAScript specs.

2) Script authors could use a normalizing method (c.f. convertKeyValue)
to "dumb down" the 6-digit escape sequence into the 4-digit format (by
converting to surrogate pairs when necessary).

Javascript is becoming increasingly important, and so is the need for
internationalized and localized language support.  With the new
font-linking enablers (including my favorite, WOFF [1]), and i18n domain
extension policy [2], we're going to see more use of languages I have no
chance of ever understanding, and I want DOM3 Events and ECMAScript to
be part of that.  I'd rather not introduce a not-very-good solution
(UTF-16) that we know would not meet all the needs of the world
community, just because of a (temporary?) circumstance with a vagary of
Javascript.

But, I also want this spec interoperably implemented... so, any solution
needs the buy-in of the implementers.  Any arguments on either side of
the coin would help make a more informed decision.

BTW, you stated a preference for the Ruby-style delimited escaped
characters... could you say why you prefer that?

[1] http://people.mozilla.com/~jkew/woff/woff-2009-09-16.html
[2] http://www.icann.org/en/announcements/announcement-30oct09-en.htm

Regards-
-Doug Schepers
W3C Team Contact, SVG and WebApps WGs


Re: Changes to DOM3 Events Key Identifiers

by Mark Davis ☕ :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

If the target of this is JavaScript, then the alternative (which Java has also chosen) is to use the UTF16 representation, wherein a pair of \u characters represents each supplementary character (above FFFF). It just needs to be carefully documented.

Mark


On Fri, Oct 30, 2009 at 11:38, Doug Schepers <schepers@...> wrote:
Hi, Mark-

Mark Davis ☕ wrote (on 10/30/09 12:22 PM):

I want to point out that Unicode code points can go up to hex 10FFFF.
The standard for \u is exactly 4 digits, so that one can intermix with
characters and know where it terminates. There are a couple of schemes
that are used to extend this to up to 6 digits, and still know where to
terminate.

\UXXXXXXXX - C++, ICU
\UXXXXXX - C#
\u{xxxxxx} - Ruby

There needs to be some mechanism for extending to 6 digits. It would be
best to use one of the above rather than a new one. (My personal
favorite is Ruby's.)

The reason the "\u" escaped character sequence was chosen was that it is the native ECMAScript escape notation, which is easy for browser-based applications to use directly (i.e. they can inject it directly into the markup as a character).

But, yes, this does have the cap of 4 digits, and I personally would prefer to use a different escape mechanism... but only if one or both of these 2 conditions obtains:

1) DOM3 Events implementations also update their Javascript engines to be able to process the additional escape sequence (e.g. one of the ones you mention above) in the same way they process the "\u" escape sequence.  This is the better long-term solution, and I'd hope ECMA TC39 could be persuaded to add this to future ECMAScript specs.

2) Script authors could use a normalizing method (c.f. convertKeyValue) to "dumb down" the 6-digit escape sequence into the 4-digit format (by converting to surrogate pairs when necessary).

Javascript is becoming increasingly important, and so is the need for internationalized and localized language support.  With the new font-linking enablers (including my favorite, WOFF [1]), and i18n domain extension policy [2], we're going to see more use of languages I have no chance of ever understanding, and I want DOM3 Events and ECMAScript to be part of that.  I'd rather not introduce a not-very-good solution (UTF-16) that we know would not meet all the needs of the world community, just because of a (temporary?) circumstance with a vagary of Javascript.

But, I also want this spec interoperably implemented... so, any solution needs the buy-in of the implementers.  Any arguments on either side of the coin would help make a more informed decision.

BTW, you stated a preference for the Ruby-style delimited escaped characters... could you say why you prefer that?

[1] http://people.mozilla.com/~jkew/woff/woff-2009-09-16.html
[2] http://www.icann.org/en/announcements/announcement-30oct09-en.htm

Regards-
-Doug Schepers

W3C Team Contact, SVG and WebApps WGs


Re: Changes to DOM3 Events Key Identifiers

by John Cowan :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Doug Schepers scripsit:

> 1) DOM3 Events implementations also update their Javascript engines to
> be able to process the additional escape sequence (e.g. one of the ones
> you mention above) in the same way they process the "\u" escape
> sequence.  This is the better long-term solution, and I'd hope ECMA TC39
> could be persuaded to add this to future ECMAScript specs.

I doubt it, given that such escapes are usually programmatically generated.
In any case, ECMAScript is firmly committed to a 16-bit character model.

--
A rabbi whose congregation doesn't want         John Cowan
to drive him out of town isn't a rabbi,         http://www.ccil.org/~cowan
and a rabbi who lets them do it                 cowan@...
isn't a man.    --Jewish saying


Re: Changes to DOM3 Events Key Identifiers

by Mark Davis ☕ :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Java is committed to 16-bit code units as well, but a relatively small number of additions enabled effective handling of UTF-16 text.

Mark


On Fri, Oct 30, 2009 at 13:42, John Cowan <cowan@...> wrote:
Doug Schepers scripsit:

> 1) DOM3 Events implementations also update their Javascript engines to
> be able to process the additional escape sequence (e.g. one of the ones
> you mention above) in the same way they process the "\u" escape
> sequence.  This is the better long-term solution, and I'd hope ECMA TC39
> could be persuaded to add this to future ECMAScript specs.

I doubt it, given that such escapes are usually programmatically generated.
In any case, ECMAScript is firmly committed to a 16-bit character model.

--
A rabbi whose congregation doesn't want         John Cowan
to drive him out of town isn't a rabbi,         http://www.ccil.org/~cowan
and a rabbi who lets them do it                 cowan@...
isn't a man.    --Jewish saying


Re: Changes to DOM3 Events Key Identifiers

by John Cowan :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Mark Davis �?? scripsit:

> Java is committed to 16-bit code units as well, but a relatively small
> number of additions enabled effective handling of UTF-16 text.

Good luck getting any such additions into JavaScript, except as a library.

--
They tried to pierce your heart                 John Cowan
with a Morgul-knife that remains in the         http://www.ccil.org/~cowan
wound.  If they had succeeded, you would
become a wraith under the domination of the Dark Lord.         --Gandalf


RE: Changes to DOM3 Events Key Identifiers

by Phillips, Addison :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Doug Schepers scripsit:
>
> > 1) DOM3 Events implementations also update their Javascript
> engines to
> > be able to process the additional escape sequence (e.g. one of
> the ones
> > you mention above) in the same way they process the "\u" escape
> > sequence.  This is the better long-term solution, and I'd hope
> ECMA TC39
> > could be persuaded to add this to future ECMAScript specs.
>
> I doubt it, given that such escapes are usually programmatically
> generated.
> In any case, ECMAScript is firmly committed to a 16-bit character
> model.
>

ECMAScript's "firm commitment" to a 16-bit character model (i.e. UTF-16) is not the problem. Lack of support for supplementary characters (that is, those above 0xFFFF in Unicode), however, is a very real problem. No UTF-16 process can escape the fact that, even if one applies a short-sighted limit to BMP characters, a character may require more than one code point to encode. As long as it is clear that DOM3 Events key identifiers are a string containing possibly more than one code point (and potentially more than one character), the escaping syntax is just a detail of the language.

Addison

Re: Changes to DOM3 Events Key Identifiers

by John Cowan :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Phillips, Addison scripsit:

> ECMAScript's "firm commitment" to a 16-bit character model (i.e. UTF-16)

If only.

JavaScript and JSON strings aren't sequences of characters, they are
sequences of 16-bit unsigned integers.  If you happen to want to interpret
them as UTF-16, you are free to do so, but there is not and never will
be any guarantee that all strings are well-formed UTF-16.  What's more,
the built-in JSON serializer provided by ECMAScript 5th edition does
not generate escape sequences for isolated surrogate codepoints, so that
some strings will be written out in CESU-8 rather than UTF-8.

Worse yet, the JSON RFC is self-contradictory, with the result that it's
not even clear that CESU-8-encoded JSON is illegal.

--
Let's face it: software is crap. Feature-laden and bloated, written under
tremendous time-pressure, often by incapable coders, using dangerous
languages and inadequate tools, trying to connect to heaps of broken or
obsolete protocols, implemented equally insufficiently, running on
unpredictable hardware -- we are all more than used to brokenness.
                   --Felix Winkelmann


Re: Changes to DOM3 Events Key Identifiers

by Mark Davis ☕ :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> If you happen to want to interpret
them as UTF-16, you are free to do so, but there is not and never will
be any guarantee that all strings are well-formed UTF-16.

You never have that guarantee, any more than you have the guarantee that a source purporting to be UTF-8 is in fact well formed. All conscientious recipients need to check the data -- if they are sensitive to ill-formed text. Luckily, the impact of ill-formed UTF-16 is vastly less than that of ill-formed UTF-8.

Mark


On Fri, Oct 30, 2009 at 17:47, John Cowan <cowan@...> wrote:
Phillips, Addison scripsit:

> ECMAScript's "firm commitment" to a 16-bit character model (i.e. UTF-16)

If only.

JavaScript and JSON strings aren't sequences of characters, they are
sequences of 16-bit unsigned integers.  If you happen to want to interpret
them as UTF-16, you are free to do so, but there is not and never will
be any guarantee that all strings are well-formed UTF-16.  What's more,
the built-in JSON serializer provided by ECMAScript 5th edition does
not generate escape sequences for isolated surrogate codepoints, so that
some strings will be written out in CESU-8 rather than UTF-8.

Worse yet, the JSON RFC is self-contradictory, with the result that it's
not even clear that CESU-8-encoded JSON is illegal.

--
Let's face it: software is crap. Feature-laden and bloated, written under
tremendous time-pressure, often by incapable coders, using dangerous
languages and inadequate tools, trying to connect to heaps of broken or
obsolete protocols, implemented equally insufficiently, running on
unpredictable hardware -- we are all more than used to brokenness.
                  --Felix Winkelmann