StringEncoding open issues

View: New views
13 Messages — Rating Filter:   Alert me  

StringEncoding open issues

by Joshua Bell-3 :: Rate this Message:

| View Threaded | Show Only this Message

Regarding the API proposal at: http://wiki.whatwg.org/wiki/StringEncoding

It looks like we've got some developer interest in implementing this, and
need to nail down the open issues. I encourage folks to look over the
"Resolved" issues in the wiki page and make sure the resolutions - gathered
from loose consensus here and offline discussion - are truly resolved or if
anything is not future-proof and should block implementations from
proceeding. Also, look at the "Notes to Implementers" section; this should
be non-controversial but may be non-obvious.

This leaves two open issues: behavior on encoding error, and handling of
Byte Order Marks (BOMs)

== Encoding Errors ==

The proposal builds on Anne's
http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html encoding spec,
which defines when encodings should emit an encoder error. In that spec
(which describes the existing behavior of Web browsers) encoders are used
in a limited fashion, e.g. for encoding form results before submission via
HTTP, and hence the cases are much more restricted than the errors
encountered when browsers are asked to decode content from the wild. As
noted, the encoding process could terminate when an error is emitted.
Alternately (and as is necessary for forms, etc) there is a
use-case-specific escaping mechanism for non-encodable code points.

The proposed TextDecoder object takes a TextDecoderOptions options with a
|fatal| flag that controls the decode behavior in case of error - if
|fatal| is unset (default) a decode error produces a fallback character
(U+FFFD); if |fatal| is set then a DOMException is raised instead.

No such option is currently proposed for the TextEncoder object; the
proposal dictates that a DOMException is thrown if the encoder emits an
error. I believe this is sufficient for V1, but want feedback. For V2 (or
now, if desired), the API could be extended to accept an options object
allowing for some/all of these cases;

* Don't throw, instead emit a standard/encoding-specific replacement
character (e.g. '?')
* Don't throw, instead emit a fixed placeholder character (byte?) sequence
* Don't throw, instead call a user-defined callback and allow it to produce
a replacement "escaped" character sequence, e.g. "&#xXXXX;"

The latter seems the most flexible (superset of the rest) but is probably
overkill for now. Since it can be added in easily later, can we defer until
we have implementer and user feedback?


== Byte Order Marks (BOMs) ==

Once again, the proposal builds on Anne's
http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html encoding spec,
which describes the existing behavior of Web browsers. In the wild,
browsers deal with a variety of mechanisms for indicating the encoding of
documents (server headers, meta tags, XML preludes, etc), many of which are
blatantly incorrect or contradictory. One form is fortunately rarely wrong
- if the document is encoded in UTF-8, UTF-16LE or UTF-16BE and includes
the byte order mark (the encoding-specific serialization of U+FEFF). This
is built into the Encoding spec - given a byte sequence to decode and an
encoding label, the label is ignored if the sequence starts with one of the
three UTF BOMs, and the BOM-indicated encoding is used to decode the rest
of the stream.

The proposed API will have different uses, so it is unclear that this is
necessary or desirable.

At a minimum, it is clear that:

* If one of the UTF encodings is specified AND the BOM matches then the
leading BOM character (U+FEFF) MUST NOT be emitted in the output character
sequence (i.e. it is silently consumed)

Less clear is this behavior in these two cases.

* If one of the UTF encodings is specified AND and a different BOM is
present (e.g. UTF-16LE but a UTF-16BE BOM)
* If one of the non-UTF encodings is specified AND a UTF BOM is present

Options include:
* Nothing special - decoder does what it will with the bytes, possibly
emitting garbage, possibly throwing
* Raise a DOMException
* Switch the decoder from the user-specified encoding to the DOM-specified
encoding

The latter seems the most helpful when the proposed API is used as follows:

var s = TextDecoder().decode(bytes); // handles UTF-8 w/o BOM and any UTF
w/ BOM

... but it does seem a little weird when used like this;

var d = TextDecoder('euc-jp');
assert(d.encoding === 'euc-jp');
var s = d.decode(new Uint8Array([0xFE]), {stream: true});
assert(d.encoding === 'euc-jp');
assert(s.length === 0); // can't emit anything until BOM is definitely
passed
s += d.decode(new Uint8Array([0xFF]), {stream: true});
assert(d.encoding === 'utf-16be'); // really?

Re: StringEncoding open issues

by Jonas Sicking-2 :: Rate this Message:

| View Threaded | Show Only this Message

On Mon, Aug 6, 2012 at 11:29 AM, Joshua Bell <jsbell@...> wrote:

> Regarding the API proposal at: http://wiki.whatwg.org/wiki/StringEncoding
>
> It looks like we've got some developer interest in implementing this, and
> need to nail down the open issues. I encourage folks to look over the
> "Resolved" issues in the wiki page and make sure the resolutions - gathered
> from loose consensus here and offline discussion - are truly resolved or if
> anything is not future-proof and should block implementations from
> proceeding. Also, look at the "Notes to Implementers" section; this should
> be non-controversial but may be non-obvious.
>
> This leaves two open issues: behavior on encoding error, and handling of
> Byte Order Marks (BOMs)
>
> == Encoding Errors ==
>
> The proposal builds on Anne's
> http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html encoding spec,
> which defines when encodings should emit an encoder error. In that spec
> (which describes the existing behavior of Web browsers) encoders are used
> in a limited fashion, e.g. for encoding form results before submission via
> HTTP, and hence the cases are much more restricted than the errors
> encountered when browsers are asked to decode content from the wild. As
> noted, the encoding process could terminate when an error is emitted.
> Alternately (and as is necessary for forms, etc) there is a
> use-case-specific escaping mechanism for non-encodable code points.
>
> The proposed TextDecoder object takes a TextDecoderOptions options with a
> |fatal| flag that controls the decode behavior in case of error - if
> |fatal| is unset (default) a decode error produces a fallback character
> (U+FFFD); if |fatal| is set then a DOMException is raised instead.
>
> No such option is currently proposed for the TextEncoder object; the
> proposal dictates that a DOMException is thrown if the encoder emits an
> error. I believe this is sufficient for V1, but want feedback. For V2 (or
> now, if desired), the API could be extended to accept an options object
> allowing for some/all of these cases;

Not introducing options for the encoder for V1 sounds like a good idea
to me. However I would definitely prefer if the default for encoding
matches the default for decoding and used replacement characters
rather than threw an exception.

This also matches what the recent WebSocket spec which recently
changed from throwing to using replacement characters for encoding.

The reason WebSocket was changed was because it's relatively easy to
make a mistake and cause a surrogate UTF16 pair be cut into two, which
results in an invalidly encoded DOMString. The problem with this is
that it's very data dependent and so might not happen on the
developer's computer, but only in the wild when people write text
which uses non-BMP characters. In such cases throwing an exception
will likely result in more breakage than using a replacement
character.

> * Don't throw, instead emit a standard/encoding-specific replacement
> character (e.g. '?')

Yes, using the replacement character sounds good to me.

> * Don't throw, instead emit a fixed placeholder character (byte?) sequence
> * Don't throw, instead call a user-defined callback and allow it to produce
> a replacement "escaped" character sequence, e.g. "&#xXXXX;"
>
> The latter seems the most flexible (superset of the rest) but is probably
> overkill for now. Since it can be added in easily later, can we defer until
> we have implementer and user feedback?

Indeed, we can explore these options if the need arises.

> == Byte Order Marks (BOMs) ==
>
> Once again, the proposal builds on Anne's
> http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html encoding spec,
> which describes the existing behavior of Web browsers. In the wild,
> browsers deal with a variety of mechanisms for indicating the encoding of
> documents (server headers, meta tags, XML preludes, etc), many of which are
> blatantly incorrect or contradictory. One form is fortunately rarely wrong
> - if the document is encoded in UTF-8, UTF-16LE or UTF-16BE and includes
> the byte order mark (the encoding-specific serialization of U+FEFF). This
> is built into the Encoding spec - given a byte sequence to decode and an
> encoding label, the label is ignored if the sequence starts with one of the
> three UTF BOMs, and the BOM-indicated encoding is used to decode the rest
> of the stream.
>
> The proposed API will have different uses, so it is unclear that this is
> necessary or desirable.
>
> At a minimum, it is clear that:
>
> * If one of the UTF encodings is specified AND the BOM matches then the
> leading BOM character (U+FEFF) MUST NOT be emitted in the output character
> sequence (i.e. it is silently consumed)

Agreed.

> Less clear is this behavior in these two cases.
>
> * If one of the UTF encodings is specified AND and a different BOM is
> present (e.g. UTF-16LE but a UTF-16BE BOM)
> * If one of the non-UTF encodings is specified AND a UTF BOM is present
>
> Options include:
> * Nothing special - decoder does what it will with the bytes, possibly
> emitting garbage, possibly throwing
> * Raise a DOMException
> * Switch the decoder from the user-specified encoding to the DOM-specified
> encoding
>
> The latter seems the most helpful when the proposed API is used as follows:
>
> var s = TextDecoder().decode(bytes); // handles UTF-8 w/o BOM and any UTF
> w/ BOM
>
> ... but it does seem a little weird when used like this;
>
> var d = TextDecoder('euc-jp');
> assert(d.encoding === 'euc-jp');
> var s = d.decode(new Uint8Array([0xFE]), {stream: true});
> assert(d.encoding === 'euc-jp');
> assert(s.length === 0); // can't emit anything until BOM is definitely
> passed
> s += d.decode(new Uint8Array([0xFF]), {stream: true});
> assert(d.encoding === 'utf-16be'); // really?

I would add the case of "no encoding was specified and a BOM was
detected" to the "clear" category. I.e. I think we clearly here should
honor the BOM.

As for what to do if an encoding is specified (UTF or otherwise), and
a different BOM is detected, that I agree is much less clear. I don't
have a strong opinion of if it should throw or attempt to decode
(possibly into garbage). But silently ignoring the specified encoding
feels strange. In gecko we decided to throw since that would leave the
most options for adjusting to spec changes.

/ Jonas

Re: StringEncoding open issues

by Glenn Maynard :: Rate this Message:

| View Threaded | Show Only this Message

I agree with Jonas that encoding should just use a replacement character
(U+FFFD for Unicode encodings, '?' otherwise), and that we should put off
other modes (eg. exceptions and user-specified replacement characters)
until there's a clear need.

My intuition is that encoding DOMString to UTF-16 should never have errors;
if there are dangling surrogates, pass them through unchanged.  There's no
point in using a placeholder that says "an error occured here", when the
error can be passed through in exactly the same form (not possible with eg.
DOMString->SJIS).  I don't feel strongly about this only because outputting
UTF-16 is so rare to begin with.

On Mon, Aug 6, 2012 at 1:29 PM, Joshua Bell <jsbell@...> wrote:

> - if the document is encoded in UTF-8, UTF-16LE or UTF-16BE and includes
> the byte order mark (the encoding-specific serialization of U+FEFF).


This rarely detects the wrong type, but that doesn't mean it's not the
wrong answer.  If my input is meant to be UTF-8, and someone hands me
BOM-marked UTF-16, I want it to fail in the same way it would if someone
passed in SJIS.  I don't want it silently translated.

On the other hand, it probably does make sense for UTF-16 to switch to
UTF-16BE, since that's by definition the original purpose of the BOM.

The convention iconv uses, which I think is a useful one, is decoding from
"UTF-16" means "try to figure out the encoding from the BOM, if any", and
"UTF-16LE" and "UTF-16BE" mean "always use this exact encoding".

 * If one of the UTF encodings is specified AND the BOM matches then the
> leading BOM character (U+FEFF) MUST NOT be emitted in the output character
> sequence (i.e. it is silently consumed)
>

It's a little weird that

data = readFile("user-supplied-file.txt"); // shortcutting for brevity
var s = new TextDecoder("utf-16").decode(data); // or utf-8
s = s.replace("a", "b");
var data2 = new TextEncoder("utf-16").encode(s);
writeFile("user-supplied-file.txt", data2);

causes the BOM to be quietly stripped away.  Normally if you're modifying a
file, you want to pass through the BOM (or lack thereof) untouched.

One way to deal with this could be:

var decoder = new TextDecoder("utf-16");
var s = decoder.decode(data);
s = s.replace("a", "b");
var data2 = new TextEncoder(decoder.encoding).encode(s);

where decoder.encoding is eg. "UTF-16LE-BOM" if a BOM was present, thus
preserving both the BOM and (for UTF-16) endianness.  I don't actually like
this, though, because I don't like the idea of decoder.encoding changing
after the decoder has already been constructed.

I think I agree with just stripping it, and people who want to preserve
BOMs on write-through can jump the hoops manually (which aren't terribly
hard).


Another issue is "new TextDecoder('ascii').encoding" (and ISO-8859-1)
giving .encoding = "windows-1252".  That's strange, even when you know why
it's happening.

Is there any reason to expose the actual "primary" names?  It's not clear
that the "name" column in the Encoding spec is even intended to be exposed
to APIs; they look more like labels for specs to refer to internally.
(Anne?)  If there's no pressing reason to expose this, I'd suggest that the
.encoding attribute simply return the name that was passed to the
constructor.

It's still not ideal (it's weird that asking for ASCII gives you something
other than ASCII in the first place), but it at least seems a bit less
strange.  The "nice" fix would be to implement actual ASCII, ISO-8859-1,
ISO-8859-9, etc. charsets, but that just means extra implementation work
(and some charset proliferation) without use cases.

--
Glenn Maynard

Re: StringEncoding open issues

by Joshua Bell-3 :: Rate this Message:

| View Threaded | Show Only this Message

On Mon, Aug 6, 2012 at 5:06 PM, Glenn Maynard <glenn@...> wrote:

> I agree with Jonas that encoding should just use a replacement character
> (U+FFFD for Unicode encodings, '?' otherwise), and that we should put off
> other modes (eg. exceptions and user-specified replacement characters)
> until there's a clear need.
>
> My intuition is that encoding DOMString to UTF-16 should never have errors;
> if there are dangling surrogates, pass them through unchanged.  There's no
> point in using a placeholder that says "an error occured here", when the
> error can be passed through in exactly the same form (not possible with eg.
> DOMString->SJIS).  I don't feel strongly about this only because outputting
> UTF-16 is so rare to begin with.
>
> On Mon, Aug 6, 2012 at 1:29 PM, Joshua Bell <jsbell@...> wrote:
>
> > - if the document is encoded in UTF-8, UTF-16LE or UTF-16BE and includes
> > the byte order mark (the encoding-specific serialization of U+FEFF).
>
>
> This rarely detects the wrong type, but that doesn't mean it's not the
> wrong answer.  If my input is meant to be UTF-8, and someone hands me
> BOM-marked UTF-16, I want it to fail in the same way it would if someone
> passed in SJIS.  I don't want it silently translated.
>
> On the other hand, it probably does make sense for UTF-16 to switch to
> UTF-16BE, since that's by definition the original purpose of the BOM.
>
> The convention iconv uses, which I think is a useful one, is decoding from
> "UTF-16" means "try to figure out the encoding from the BOM, if any", and
> "UTF-16LE" and "UTF-16BE" mean "always use this exact encoding".


Let me take a crack at making this into an algorithm:

In the TextDecoder constructor:

   - If encoding is not specified, set an internal useBOM flag
   - If encoding is specified and is a case insensitive match for "utf-16"
   set an internal useBOM flag.

NOTE: This means if "utf-8", "utf-16le" or "utf-16be" is explicitly
specified the flag is not set.

When decode() is called

   - If useBOM is set and the stream offset is 0, then
      - If there are not enough bytes to test for a BOM then return without
      emitting anything (NOTE: if not streaming an EOF byte would be present in
      the stream which would be a negative match for a BOM)
      - If encoding is "utf-16" and the first bytes match 0xFF 0xFE or 0xFE
      0xFF then set current encoding to "utf-16" or "utf-16be" respectively and
      advance the stream past the BOM. The current encoding is used until the
      stream is reset.
      - Otherwise, if the first bytes match 0xFF 0xFE, 0xFE 0xFF, or 0xEF
      0xBB 0xBF then set current encoding to "utf-16", "utf-16be" or "utf-8"
      respectively and advance the stream past the BOM. The current encoding is
      used until the stream is reset.
   - Otherwise, if useBOM is not set and the steam offset is 0, then if the
   encoding is "utf-8", "utf-16" or "utf-16be"
      - If the first bytes match 0xFF 0xFE, 0xFE 0xFF, or 0xEF 0xBB 0xBF
      then let detected encoding be "utf-16", "utf-16be" or "utf-8"
respectively.
      If the detected encoding matches the object's encoding, advance
the stream
      past the BOM. Otherwise, if the fatal flag is set then throw a
      "EncodingError" DOMException. Otherwise, the decoding algorithm proceeds.
      - If there are not enough bytes to test for a BOM then return without
      emitting anything (NOTE: if not streaming an EOF byte would be inserted
      which would be a negative match for a BOM)

Working the "current encoding" switcheroo into the spec will require some
refactoring, so trying to get consensus here first.

In English:

   - Create an encoder with TextDecoder() and if present a BOM will be
   respected (and consumed) otherwise default to UTF-8
   - Create an encoder with TextDecoder("utf-16") and either UTF-16LE or
   UTF-16BE BOM will be respected (and consumed) otherwise default to UTF-16LE
   (which may decode garbage if UTF-8 BOM or other non-UTF-16 data is present)
   - Create an encoder with TextDecoder("utf-8",
   {fatal:true}), TextDecoder("utf-16le", {fatal:true}),
   TextDecoder("utf-16be", {fatal:true}) and a matching BOM will be consumed,
   a mismatching BOM will throw an EncodingError
   - Create an encoder with TextDecoder("utf-8"), TextDecoder("utf-16le"),
   TextDecoder("utf-16be") and a matching BOM will be consumed, a mismatching
   BOM will be blithely decoded (probably giving you replacement characters),
   but not throwing.

 * If one of the UTF encodings is specified AND the BOM matches then the

> > leading BOM character (U+FEFF) MUST NOT be emitted in the output
> character
> > sequence (i.e. it is silently consumed)
> >
>
> It's a little weird that
>
> data = readFile("user-supplied-file.txt"); // shortcutting for brevity
> var s = new TextDecoder("utf-16").decode(data); // or utf-8
> s = s.replace("a", "b");
> var data2 = new TextEncoder("utf-16").encode(s);
> writeFile("user-supplied-file.txt", data2);
>
> causes the BOM to be quietly stripped away.  Normally if you're modifying a
> file, you want to pass through the BOM (or lack thereof) untouched.
>
> One way to deal with this could be:
>
> var decoder = new TextDecoder("utf-16");
> var s = decoder.decode(data);
> s = s.replace("a", "b");
> var data2 = new TextEncoder(decoder.encoding).encode(s);
>
> where decoder.encoding is eg. "UTF-16LE-BOM" if a BOM was present, thus
> preserving both the BOM and (for UTF-16) endianness.  I don't actually like
> this, though, because I don't like the idea of decoder.encoding changing
> after the decoder has already been constructed.
>
> I think I agree with just stripping it, and people who want to preserve
> BOMs on write-through can jump the hoops manually (which aren't terribly
> hard).
>

This gets easier if we restrict to encoding UTF-8 which typically doesn't
include BOMs. But it's looking like there's enough desire to keep UTF-16
encoding at the moment. Agree with just stripping it for now.


> Another issue is "new TextDecoder('ascii').encoding" (and ISO-8859-1)
> giving .encoding = "windows-1252".  That's strange, even when you know why
> it's happening.
>
> Is there any reason to expose the actual "primary" names?  It's not clear
> that the "name" column in the Encoding spec is even intended to be exposed
> to APIs; they look more like labels for specs to refer to internally.
> (Anne?)  If there's no pressing reason to expose this, I'd suggest that the
> .encoding attribute simply return the name that was passed to the
> constructor.
>
> It's still not ideal (it's weird that asking for ASCII gives you something
> other than ASCII in the first place), but it at least seems a bit less
> strange.  The "nice" fix would be to implement actual ASCII, ISO-8859-1,
> ISO-8859-9, etc. charsets, but that just means extra implementation work
> (and some charset proliferation) without use cases.
>

Leaning towards simply dropping the attribute. Does anyone advocate for
keeping it?

Re: StringEncoding open issues

by Glenn Maynard :: Rate this Message:

| View Threaded | Show Only this Message

On Tue, Aug 14, 2012 at 12:34 PM, Joshua Bell <jsbell@...> wrote:

>    - Create an encoder with TextDecoder() and if present a BOM will be
>    respected (and consumed) otherwise default to UTF-8
>

Let's not default to "autodetect Unicode formats".  It encourages people to
support UTF-16 when they may not mean to.  If BOM detection for both UTF-8
and UTF-16 is wanted, I'd suggest something explicit, like "utf-*".

If the argument to the ctor is optional, I think the default should be
purely UTF-8.


>  This gets easier if we restrict to encoding UTF-8 which typically doesn't
> include BOMs. But it's looking like there's enough desire to keep UTF-16
> encoding at the moment. Agree with just stripping it for now.
>

UTF-8 sometimes does have a BOM, especially in Windows where applications
sometimes use it to distinguish UTF-8 from ACP text files (which are just
as common as ever--Windows has made no motion away from legacy encodings
whatsoever).  Stripping the BOM can cause those applications to
misinterpret the files as ACP.

Anyway, even if the encoding API gives a "helper" for this, figuring out
how that works would probably be more effort for developers than just
peeking at the ArrayBuffer for the BOM and adding it back in manually.
(I'm pretty sure anybody who knows enough to pay attention to this in the
first place will have no trouble doing that.)  So, yeah, let's not worry
about this.

--
Glenn Maynard

Re: StringEncoding open issues

by Joshua Bell-3 :: Rate this Message:

| View Threaded | Show Only this Message

On Wed, Aug 15, 2012 at 5:30 PM, Glenn Maynard <glenn@...> wrote:

> On Tue, Aug 14, 2012 at 12:34 PM, Joshua Bell <jsbell@...> wrote:
>
>>    - Create an encoder with TextDecoder() and if present a BOM will be
>>
>>    respected (and consumed) otherwise default to UTF-8
>>
>
> Let's not default to "autodetect Unicode formats".  It encourages people
> to support UTF-16 when they may not mean to.  If BOM detection for both
> UTF-8 and UTF-16 is wanted, I'd suggest something explicit, like "utf-*".
>
> If the argument to the ctor is optional, I think the default should be
> purely UTF-8.
>

Works for me. In the algorithm specified in the email, this simply removes
the clause "If encoding is not specified, set an internal useBOM flag" -
namely, only "utf-16" gets the useBOM flag.

I'll attempt to wedge this into the spec soon.



>  This gets easier if we restrict to encoding UTF-8 which typically doesn't
>> include BOMs. But it's looking like there's enough desire to keep UTF-16
>> encoding at the moment. Agree with just stripping it for now.
>>
>
> UTF-8 sometimes does have a BOM, especially in Windows where applications
> sometimes use it to distinguish UTF-8 from ACP text files (which are just
> as common as ever--Windows has made no motion away from legacy encodings
> whatsoever).
>

Good point. Ah, Notepad, my old friend...


> Stripping the BOM can cause those applications to misinterpret the files
> as ACP.
>
> Anyway, even if the encoding API gives a "helper" for this, figuring out
> how that works would probably be more effort for developers than just
> peeking at the ArrayBuffer for the BOM and adding it back in manually.
> (I'm pretty sure anybody who knows enough to pay attention to this in the
> first place will have no trouble doing that.)  So, yeah, let's not worry
> about this.
>
> --
> Glenn Maynard
>
>

Re: StringEncoding open issues

by Jonas Sicking-2 :: Rate this Message:

| View Threaded | Show Only this Message

On Tue, Aug 14, 2012 at 10:34 AM, Joshua Bell <jsbell@...> wrote:

> On Mon, Aug 6, 2012 at 5:06 PM, Glenn Maynard <glenn@...> wrote:
>
>> I agree with Jonas that encoding should just use a replacement character
>> (U+FFFD for Unicode encodings, '?' otherwise), and that we should put off
>> other modes (eg. exceptions and user-specified replacement characters)
>> until there's a clear need.
>>
>> My intuition is that encoding DOMString to UTF-16 should never have errors;
>> if there are dangling surrogates, pass them through unchanged.  There's no
>> point in using a placeholder that says "an error occured here", when the
>> error can be passed through in exactly the same form (not possible with eg.
>> DOMString->SJIS).  I don't feel strongly about this only because outputting
>> UTF-16 is so rare to begin with.
>>
>> On Mon, Aug 6, 2012 at 1:29 PM, Joshua Bell <jsbell@...> wrote:
>>
>> > - if the document is encoded in UTF-8, UTF-16LE or UTF-16BE and includes
>> > the byte order mark (the encoding-specific serialization of U+FEFF).
>>
>>
>> This rarely detects the wrong type, but that doesn't mean it's not the
>> wrong answer.  If my input is meant to be UTF-8, and someone hands me
>> BOM-marked UTF-16, I want it to fail in the same way it would if someone
>> passed in SJIS.  I don't want it silently translated.
>>
>> On the other hand, it probably does make sense for UTF-16 to switch to
>> UTF-16BE, since that's by definition the original purpose of the BOM.
>>
>> The convention iconv uses, which I think is a useful one, is decoding from
>> "UTF-16" means "try to figure out the encoding from the BOM, if any", and
>> "UTF-16LE" and "UTF-16BE" mean "always use this exact encoding".
>
>
> Let me take a crack at making this into an algorithm:
>
> In the TextDecoder constructor:
>
>    - If encoding is not specified, set an internal useBOM flag
>    - If encoding is specified and is a case insensitive match for "utf-16"
>    set an internal useBOM flag.
>
> NOTE: This means if "utf-8", "utf-16le" or "utf-16be" is explicitly
> specified the flag is not set.
>
> When decode() is called
>
>    - If useBOM is set and the stream offset is 0, then
>       - If there are not enough bytes to test for a BOM then return without
>       emitting anything (NOTE: if not streaming an EOF byte would be present in
>       the stream which would be a negative match for a BOM)
>       - If encoding is "utf-16" and the first bytes match 0xFF 0xFE or 0xFE
>       0xFF then set current encoding to "utf-16" or "utf-16be" respectively and
>       advance the stream past the BOM. The current encoding is used until the
>       stream is reset.
>       - Otherwise, if the first bytes match 0xFF 0xFE, 0xFE 0xFF, or 0xEF
>       0xBB 0xBF then set current encoding to "utf-16", "utf-16be" or "utf-8"
>       respectively and advance the stream past the BOM. The current encoding is
>       used until the stream is reset.

This doesn't sound right. The effect of the rules so far would be that
if you create a decoder and specify "utf-16" as encoding, and the
first bytes in the stream are 0xEF 0xBB 0xBF you'd silently switch to
"utf-8" decoding.

/ Jonas

Re: StringEncoding open issues

by Glenn Maynard :: Rate this Message:

| View Threaded | Show Only this Message

On Fri, Aug 17, 2012 at 2:23 AM, Jonas Sicking <jonas@...> wrote:

> >       - If encoding is "utf-16" and the first bytes match 0xFF 0xFE or
> 0xFE
> >       0xFF then set current encoding to "utf-16" or "utf-16be"
> respectively and
> >       advance the stream past the BOM. The current encoding is used
> until the
> >       stream is reset.
> >       - Otherwise, if the first bytes match 0xFF 0xFE, 0xFE 0xFF, or 0xEF
> >       0xBB 0xBF then set current encoding to "utf-16", "utf-16be" or
> "utf-8"
> >       respectively and advance the stream past the BOM. The current
> encoding is
> >       used until the stream is reset.
>
> This doesn't sound right. The effect of the rules so far would be that
> if you create a decoder and specify "utf-16" as encoding, and the
> first bytes in the stream are 0xEF 0xBB 0xBF you'd silently switch to
> "utf-8" decoding.
>

I think the scope of the "otherwise" is unclear, and this is meant to be
"otherwise (if encoding is not "utf-16")".

--
Glenn Maynard

Re: StringEncoding open issues

by Jonas Sicking-2 :: Rate this Message:

| View Threaded | Show Only this Message

On Fri, Aug 17, 2012 at 7:15 AM, Glenn Maynard <glenn@...> wrote:

> On Fri, Aug 17, 2012 at 2:23 AM, Jonas Sicking <jonas@...> wrote:
>>
>> >       - If encoding is "utf-16" and the first bytes match 0xFF 0xFE or
>> > 0xFE
>> >       0xFF then set current encoding to "utf-16" or "utf-16be"
>> > respectively and
>> >       advance the stream past the BOM. The current encoding is used
>> > until the
>> >       stream is reset.
>> >       - Otherwise, if the first bytes match 0xFF 0xFE, 0xFE 0xFF, or
>> > 0xEF
>> >       0xBB 0xBF then set current encoding to "utf-16", "utf-16be" or
>> > "utf-8"
>> >       respectively and advance the stream past the BOM. The current
>> > encoding is
>> >       used until the stream is reset.
>>
>> This doesn't sound right. The effect of the rules so far would be that
>> if you create a decoder and specify "utf-16" as encoding, and the
>> first bytes in the stream are 0xEF 0xBB 0xBF you'd silently switch to
>> "utf-8" decoding.
>
> I think the scope of the "otherwise" is unclear, and this is meant to be
> "otherwise (if encoding is not "utf-16")".

Ah, that would make sense. It effectively means "if encoding is not set".

/ Jonas

Re: StringEncoding open issues

by Joshua Bell-3 :: Rate this Message:

| View Threaded | Show Only this Message

On Fri, Aug 17, 2012 at 5:19 PM, Jonas Sicking <jonas@...> wrote:

> On Fri, Aug 17, 2012 at 7:15 AM, Glenn Maynard <glenn@...> wrote:
> > On Fri, Aug 17, 2012 at 2:23 AM, Jonas Sicking <jonas@...> wrote:
> >>
> >> >       - If encoding is "utf-16" and the first bytes match 0xFF 0xFE or
> >> > 0xFE
> >> >       0xFF then set current encoding to "utf-16" or "utf-16be"
> >> > respectively and
> >> >       advance the stream past the BOM. The current encoding is used
> >> > until the
> >> >       stream is reset.
> >> >       - Otherwise, if the first bytes match 0xFF 0xFE, 0xFE 0xFF, or
> >> > 0xEF
> >> >       0xBB 0xBF then set current encoding to "utf-16", "utf-16be" or
> >> > "utf-8"
> >> >       respectively and advance the stream past the BOM. The current
> >> > encoding is
> >> >       used until the stream is reset.
> >>
> >> This doesn't sound right. The effect of the rules so far would be that
> >> if you create a decoder and specify "utf-16" as encoding, and the
> >> first bytes in the stream are 0xEF 0xBB 0xBF you'd silently switch to
> >> "utf-8" decoding.
> >
> > I think the scope of the "otherwise" is unclear, and this is meant to be
> > "otherwise (if encoding is not "utf-16")".
>
> Ah, that would make sense. It effectively means "if encoding is not set".
>
> / Jonas
>

I've attempted to distill the above into the spec in an algorithmic way:
http://wiki.whatwg.org/wiki/StringEncoding#TextDecoder

English version: If you specify "utf-16" you get endian-agnostic UTF-16
encoding support. Failing that, if your encoding matches your BOM it is
consumed. Failing *that*, you get whatever behavior falls out of the decode
algorithm (garbage, error, etc).

The JS shim has *not* been updated yet.

Only part of this edit has been live for the last few weeks - apologies to
the Moz folks who were trying to understand what the half-specified
internal useBOM flag was for. Any implementer feedback so far?

Re: StringEncoding open issues

by Anne van Kesteren-4 :: Rate this Message:

| View Threaded | Show Only this Message

On Mon, Sep 17, 2012 at 11:13 PM, Joshua Bell <jsbell@...> wrote:
> I've attempted to distill the above into the spec in an algorithmic way:
> http://wiki.whatwg.org/wiki/StringEncoding#TextDecoder
>
> English version: If you specify "utf-16" you get endian-agnostic UTF-16
> encoding support. Failing that, if your encoding matches your BOM it is
> consumed. Failing *that*, you get whatever behavior falls out of the decode
> algorithm (garbage, error, etc).

Why would we want the API to work different from how it works in
markup (with <meta charset> etc.)? Granted it's not super logical, but
I don't really see why we should make it inconsistent and more
complicated.


--
http://annevankesteren.nl/

Re: StringEncoding open issues

by Joshua Bell-3 :: Rate this Message:

| View Threaded | Show Only this Message

On Mon, Sep 17, 2012 at 2:17 PM, Anne van Kesteren <annevk@...> wrote:

> On Mon, Sep 17, 2012 at 11:13 PM, Joshua Bell <jsbell@...> wrote:
> > I've attempted to distill the above into the spec in an algorithmic way:
> > http://wiki.whatwg.org/wiki/StringEncoding#TextDecoder
> >
> > English version: If you specify "utf-16" you get endian-agnostic UTF-16
> > encoding support. Failing that, if your encoding matches your BOM it is
> > consumed. Failing *that*, you get whatever behavior falls out of the
> decode
> > algorithm (garbage, error, etc).
>
> Why would we want the API to work different from how it works in
> markup (with <meta charset> etc.)? Granted it's not super logical, but
> I don't really see why we should make it inconsistent and more
> complicated.
>

That's how the spec started out, so a recap of this thread would give you
the back-and-forth that led here. To summarize:

Having the BOM in the content be higher priority than the coding selected
by the developer was not seen as desirable (see earlier in the thread), and
potentially a source of errors. Selecting encoding via BOM (in general, or
to emulate <meta charset>, etc) was seen as something that could be done in
user code if desired, but unexpected otherwise.

Two desired behaviors remained: (1) developer need for BOM-specified
endian-agnostic UTF-16 encoding similar to ICU's handling that
distinguishes "utf-16" from "utf-16le", and (2) that matching BOMs should
be consumed and not appear in the decoded data.

Re: StringEncoding open issues

by Glenn Maynard :: Rate this Message:

| View Threaded | Show Only this Message

On Mon, Sep 17, 2012 at 4:17 PM, Anne van Kesteren <annevk@...> wrote:

> Why would we want the API to work different from how it works in
> markup (with <meta charset> etc.)? Granted it's not super logical, but
> I don't really see why we should make it inconsistent and more
> complicated.
>

That algorithm always allows UTF-16, even when you explicitly say UTF-8.
 That would cause everyone using this API for UTF-8 to inadvertently
perpetuate UTF-16 support.

I hope we wouldn't want to apply the heuristic (#8) or user-locale steps
(#9), and the rest (user overrides, prescanning) is HTML-specific.

--
Glenn Maynard