WARNING: This server is unstable and will be retired in the next days. If you want to keep this forum available, please request immediately a migration on the Nabble Support forum. Forums that don't receive any migration request will be deleted forever.

Encoding Standard (mostly complete)

View: New views
9 Messages — Rating Filter:   Alert me  

Encoding Standard (mostly complete)

by Anne van Kesteren-2 :: Rate this Message:

| View Threaded | Show Only this Message

Hi,

Apart from big5 (which requires some more research) all encoders and  
decoders are now defined:

http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html

I think it is now ready to be integrated into HTML.

I also think as a next step we might want to define the encoding sniffing  
algorithm. Presumably the one used by Gecko, but input welcome.

Outstanding feedback on the Encoding Standard is recorded here:

https://www.w3.org/Bugs/Public/buglist.cgi?product=Web%20Platform%20%28other%29&component=Encoding&resolution=---

Feedback is appreciated, either email this list or file a new bug via:

https://www.w3.org/Bugs/Public/enter_bug.cgi?product=Web%20Platform%20%28other%29&component=Encoding

Links are also present in the standard.

Kind regards,


--
Anne van Kesteren
http://annevankesteren.nl/

Re: Encoding Standard (mostly complete)

by Julian Reschke :: Rate this Message:

| View Threaded | Show Only this Message

On 2012-04-17 11:30, Anne van Kesteren wrote:
> Hi,
>
> Apart from big5 (which requires some more research) all encoders and
> decoders are now defined:
>
> http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html
> ...

As a nit, I believe that "Character Encoding" would make a better title
than just "Encoding".

Best regards, Julian

Re: Encoding Standard (mostly complete)

by Glenn Maynard :: Rate this Message:

| View Threaded | Show Only this Message

"This is a decoder error" seems odd; it's descriptive language ("this thing
must be made true") rather than declarative ("do this thing").  I'd suggest
the declarative language "Emit a decoder error" and "Emit an encoder error".

"If code point is equal or greater than lower boundary" is more naturally
"greater than or equal to" (and "less than or equal to").  That said, this
would be much clearer with interval syntax:

"If code point is in the range [*lower boundary*, 0x10FFFF] and is not in
the range [0xD800, 0xDFFF], emit code point (and continue)."

which I think is easier to read, and also makes it clear that the "0xD800
to 0xDFFF" is a closed interval (0xD800 and 0xDFFF are included).

> An encoder contains one or more encoder error points. Unless stated
otherwise the encoder is terminated at that point.

Encoding form data, at least, doesn't abort on the first error; any
unrepresentable codepoints are encoded as as &x1234;.  (It would sure be
nice if encoding to non-Unicode-based encodings would just *always* use
that syntax for non-ASCII, so the encoders could be dropped, but I guess
that would trigger bugs in pages that are currently masked...)  Is there
any encoding path in browsers that does give up on the first error?

--
Glenn Maynard

Re: Encoding Standard (mostly complete)

by Anne van Kesteren-2 :: Rate this Message:

| View Threaded | Show Only this Message

On Tue, 17 Apr 2012 14:01:36 +0200, Julian Reschke <julian.reschke@...>  
wrote:
> As a nit, I believe that "Character Encoding" would make a better title  
> than just "Encoding".

I was thinking maybe "Text Encoding" given how we're using that for the  
API, but I like single-word specs so I'm quite reluctant to change it.


--
Anne van Kesteren
http://annevankesteren.nl/

Re: Encoding Standard (mostly complete)

by Anne van Kesteren-2 :: Rate this Message:

| View Threaded | Show Only this Message

On Wed, 18 Apr 2012 15:40:33 +0200, Glenn Maynard <glenn@...> wrote:
> "This is a decoder error" seems odd; it's descriptive language ("this  
> thing must be made true") rather than declarative ("do this thing").  
> I'd suggest the declarative language "Emit a decoder error" and "Emit an  
> encoder error".

Yes. Awesome suggestion implemented.


> "If code point is equal or greater than lower boundary" is more naturally
> "greater than or equal to" (and "less than or equal to").  That said,  
> this would be much clearer with interval syntax:
>
> "If code point is in the range [*lower boundary*, 0x10FFFF] and is not in
> the range [0xD800, 0xDFFF], emit code point (and continue)."
>
> which I think is easier to read, and also makes it clear that the "0xD800
> to 0xDFFF" is a closed interval (0xD800 and 0xDFFF are included).

Then we'd first have to introduce interval syntax to the English language.  
We could do that I suppose in the Terminology section if you think it  
would be better.


>> An encoder contains one or more encoder error points. Unless stated
>> otherwise the encoder is terminated at that point.
>
> Encoding form data, at least, doesn't abort on the first error; any
> unrepresentable codepoints are encoded as as &x1234;.  (It would sure be
> nice if encoding to non-Unicode-based encodings would just *always* use
> that syntax for non-ASCII, so the encoders could be dropped, but I guess
> that would trigger bugs in pages that are currently masked...)  Is there
> any encoding path in browsers that does give up on the first error?

It has been proposed for the API.

And in URLs you do not get "&#...;" (though in WebKit you do) but you get  
"?" (IE at the network layer, Opera earlier on) or the utf-8  
representation (Gecko is totally weird).

Maybe we should align URLs with <form> here and use "&#...;" throughout if  
that is compatible with content. Probably deserves a a discussion in its  
own thread.

I do not know any cases beyond URLs, <form>, and the proposed API that  
require an encoder in the platform.


--
Anne van Kesteren
http://annevankesteren.nl/

Re: Encoding Standard (mostly complete)

by Glenn Maynard :: Rate this Message:

| View Threaded | Show Only this Message

On Wed, Apr 18, 2012 at 12:12 PM, Anne van Kesteren <annevk@...>wrote:

>  "If code point is equal or greater than lower boundary" is more naturally
>> "greater than or equal to" (and "less than or equal to").  That said,
>> this would be much clearer with interval syntax:
>>
>> "If code point is in the range [*lower boundary*, 0x10FFFF] and is not in
>>
>> the range [0xD800, 0xDFFF], emit code point (and continue)."
>>
>> which I think is easier to read, and also makes it clear that the "0xD800
>> to 0xDFFF" is a closed interval (0xD800 and 0xDFFF are included).
>>
>
> Then we'd first have to introduce interval syntax to the English language.
> We could do that I suppose in the Terminology section if you think it would
> be better.


It would also apply to
http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html#index-gb18030-code-point,
and it could apply to "select" ranges (eg. 7.1 step 5: "[0,0x7f]").  Maybe
it's not enough to be worth figuring out how to define it.

Encoding form data, at least, doesn't abort on the first error; any

>> unrepresentable codepoints are encoded as as &x1234;.  (It would sure be
>> nice if encoding to non-Unicode-based encodings would just *always* use
>> that syntax for non-ASCII, so the encoders could be dropped, but I guess
>> that would trigger bugs in pages that are currently masked...)  Is there
>> any encoding path in browsers that does give up on the first error?
>>
>
> It has been proposed for the API.
>
> And in URLs you do not get "&#...;" (though in WebKit you do) but you get
> "?" (IE at the network layer, Opera earlier on) or the utf-8 representation
> (Gecko is totally weird).
>

I was testing with POST, which (at least in Gecko) uses HTML escapes for
unrepresentable characters.

(It would be pretty neat if that could be changed to *always* using HTML
escapes for non-ASCII, except when encoding to UTF-8, since that's not
introducing anything new--you can already receive &x1234; escapes in POST
data--and it would alleviate the "form submit encoding depends on the
source page's encoding" problem.  I guess this must break pages somehow, or
vendors would have done this long ago.)

--
Glenn Maynard

Re: Encoding Standard (mostly complete)

by Mark Callow :: Rate this Message:

| View Threaded | Show Only this Message



On 19/04/2012 07:34, Glenn Maynard wrote:
> On Wed, Apr 18, 2012 at 12:12 PM, Anne van Kesteren <annevk@...>wrote:
>> Then we'd first have to introduce interval syntax to the English language.
>> We could do that I suppose in the Terminology section if you think it would
>> be better.
> It would also apply to
> http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html#index-gb18030-code-point,
> and it could apply to "select" ranges (eg. 7.1 step 5: "[0,0x7f]"). Maybe
> it's not enough to be worth figuring out how to define it.
All it takes is a couple of short sentences.

    "*Numeric Intervals.* Closed intervals are denoted with square
    brackets and open intervals with round brackets. For example, [0,
    10) denotes the values from zero to ten, including zero but not
    including ten."

I agree with Glenn that using intervals would be clearer as well as
being shorter.

Regards

    -Mark


Re: Encoding Standard (mostly complete)

by Mark Callow :: Rate this Message:

| View Threaded | Show Only this Message


I find having the steps incrementing the byte and code point pointers
being before the current byte or code point is processed (except for the
EOF check) confusing but a way to make it less confusing is not obvious.

Regards

    -Mark


On 17/04/2012 18:30, Anne van Kesteren wrote:
> Hi,
>
> Apart from big5 (which requires some more research) all encoders and
> decoders are now defined:
> ...

Re: Encoding Standard (mostly complete)

by and-py :: Rate this Message:

| View Threaded | Show Only this Message

On 2012-04-18 22:34, Glenn Maynard wrote:
> (It would be pretty neat if that could be changed to *always* using HTML
> escapes for non-ASCII, except when encoding to UTF-8, since that's not
> introducing anything new--you can already receive&x1234; escapes in POST
> data--and it would alleviate the "form submit encoding depends on the
> source page's encoding" problem.  I guess this must break pages somehow, or
> vendors would have done this long ago.)

It naturally would break any page that's deliberately using a non-UTF
encoding. Web applications do not - and should not be -
HTML-character-reference-decoding their input because this would mangle
literal use of & characters (which are *not* escaped to &). There is
no way to correctly recover a value that has been through this form of
lossy encoding.

The charref-encoding-fallback is an ugly legacy hack that confuses web
authors and tempts them into using submitted strings directly without
HTML-escaping, resulting in security holes. Its use should be minimised
wherever possible.

--
And Clover
mailto:and@...
http://www.doxdesk.com/
gtalk:chat?jid=bobince@...