What makes illegal characters non-conformant

View: New views
7 Messages — Rating Filter:   Alert me  

What makes illegal characters non-conformant

by Henry S. Thompson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

validator.nu finds an error in

  http://www.ltg.ed.ac.uk/~ht/char_alias.html

I don't think I have a problem with that, I can imagine an argument
that it's broken (although http://www.ltg.ed.ac.uk/~ht/char_alias.xml
is _not_ broken per the XML specification. . .), but I can't find
anywhere in the HTML5 spec. which says so.  Does it/should it?

ht
- --
       Henry S. Thompson, School of Informatics, University of Edinburgh
                         Half-time member of W3C Team
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 651-1426, e-mail: ht@...
                       URL: http://www.ltg.ed.ac.uk/~ht/
[mail really from me _always_ has this .sig -- mail without it is forged spam]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFKuizlkjnJixAXWBoRAu+RAJ92Qgw1nFNt9DEcB8cAb3OVN11nDgCfZms5
uM5iIDb88zKefGCn93/Xg44=
=Gzwx
-----END PGP SIGNATURE-----


Re: What makes illegal characters non-conformant

by Anne van Kesteren-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, 23 Sep 2009 16:12:53 +0200, Henry S. Thompson <ht@...>  
wrote:
> validator.nu finds an error in
>
>   http://www.ltg.ed.ac.uk/~ht/char_alias.html
>
> I don't think I have a problem with that, I can imagine an argument
> that it's broken (although http://www.ltg.ed.ac.uk/~ht/char_alias.xml
> is _not_ broken per the XML specification. . .), but I can't find
> anywhere in the HTML5 spec. which says so.  Does it/should it?

http://whatwg.org/html5#misinterpreted-for-compatibility


--
Anne van Kesteren
http://annevankesteren.nl/


Re: What makes illegal characters non-conformant

by Henry S. Thompson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Anne van Kesteren writes:

> http://whatwg.org/html5#misinterpreted-for-compatibility

That's about agents, not documents.

ht
- --
       Henry S. Thompson, School of Informatics, University of Edinburgh
                         Half-time member of W3C Team
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 651-1426, e-mail: ht@...
                       URL: http://www.ltg.ed.ac.uk/~ht/
[mail really from me _always_ has this .sig -- mail without it is forged spam]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFKulwtkjnJixAXWBoRAujRAJ48mIC1P/wKZxHBn0OER0r14H2eQgCfYO08
y1Qi2uVmizIybucJbLUD44Y=
=/fZi
-----END PGP SIGNATURE-----


Re: What makes illegal characters non-conformant

by Henri Sivonen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sep 23, 2009, at 20:34, Henry S. Thompson wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Anne van Kesteren writes:
>
>> http://whatwg.org/html5#misinterpreted-for-compatibility
>
> That's about agents, not documents.


What happens here is that Validator.nu is out of date and doesn't  
misinterpret US-ASCII for compatibility, the US-ASCII decoder finds a  
bad byte.

However, what makes the document non-conforming (but what isn't the  
reason why Validator.nu says it's non-conforming) is the sentence "The  
character encoding name given must be the name of the character  
encoding used to serialize the file." under http://www.whatwg.org/specs/web-apps/current-work/#charset

The byte 0x80 is not valid in US-ASCII. Thus, US-ASCII isn't the name  
of the encoding used.

Note that for encodings that aren't "misinterpreted for compatibility"  
the reasoning would be that the normative requirements of the encoding  
become part of the conformance criteria by reference. Since  
Validator.nu is out of date and treats US-ASCII like any non-special  
encoding, this is the reason why it complains.

--
Henri Sivonen
hsivonen@...
http://hsivonen.iki.fi/




Re: What makes illegal characters non-conformant

by Geoffrey Sneddon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On 23 Sep 2009, at 15:12, Henry S. Thompson wrote:

> although http://www.ltg.ed.ac.uk/~ht/char_alias.xml
> is _not_ broken per the XML specification. . .

It should be, per:

> It is a fatal error if an XML entity is determined (via default,  
> encoding declaration, or higher-level protocol) to be in a certain  
> encoding but contains byte sequences that are not legal in that  
> encoding.

That said, though processors must throw a fatal error, I can't see  
anything saying the document isn't well-formed (bug?).


--
Geoffrey Sneddon
<http://gsnedders.com/>



Re: What makes illegal characters non-conformant

by Henry S. Thompson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Geoffrey Sneddon writes:

> On 23 Sep 2009, at 15:12, Henry S. Thompson wrote:
>
>> although http://www.ltg.ed.ac.uk/~ht/char_alias.xml
>> is _not_ broken per the XML specification. . .
>
> It should be, per:
>
>> It is a fatal error if an XML entity is determined (via default,
>> encoding declaration, or higher-level protocol) to be in a certain
>> encoding but contains byte sequences that are not legal in that
>> encoding.

You're right, I was mistaken.

> That said, though processors must throw a fatal error, I can't see
> anything saying the document isn't well-formed (bug?).

Hmm.

ht
- --
       Henry S. Thompson, School of Informatics, University of Edinburgh
                         Half-time member of W3C Team
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 651-1426, e-mail: ht@...
                       URL: http://www.ltg.ed.ac.uk/~ht/
[mail really from me _always_ has this .sig -- mail without it is forged spam]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFKum7QkjnJixAXWBoRAl7dAJ9YERQmccq5h1FQC+/y+8ya5DRfcwCghAT2
rfoIGs4VEOSoEQ8HKz23Yc8=
=MnAk
-----END PGP SIGNATURE-----


Re: What makes illegal characters non-conformant

by Bjoern Hoehrmann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

* Henry S. Thompson wrote:
>I don't think I have a problem with that, I can imagine an argument
>that it's broken (although http://www.ltg.ed.ac.uk/~ht/char_alias.xml
>is _not_ broken per the XML specification. . .), but I can't find
>anywhere in the HTML5 spec. which says so.  Does it/should it?

It is not broken per the XML specification by the same reasoning that a
PNG image is not broken per the XML specification. Procedurally for both
cases the XML processor determines some character encoding and attempts
to decode the document, and then encounters byte sequences that do not
have a well-defined meaning according to the encoding's specification.
It is therefore not possible to restore the textual data the binary data
represents, and the XML specification only defines conformance for pro-
cessors and textual data objects.

Consider that the XML specification does not normatively define exactly
how to determine the character encoding (and I am ignoring that you've
used text/xml as media type for the document which has other theoretical
considerations rarely met in practise), so you can easily define a new
character encoding very-bogus-encoding as "Any sequence of bytes stands
for the text <?xml version='1.0' encoding='very-bogus-encoding'?><x/>"
and your document would be perfectly conforming if the processor does
indeed support that encoding.

Cases like this do in fact exist in the real world, for example, with
UTF-32 encoded documents the processor may not support UTF-32 and may
instead detect UTF-16 or UTF-8 and encounter illegal byte sequences or
disallowed characters. The only difference is in perception as UTF-32
is widely recognized while very-bogus-encoding is not.

It is ultimately entirely irrelevant whether your document is broken
per the XML specification as it is as far as common sense goes broken
per the US-ASCII specification. You might just as well have your web
server send out malformed TCP datagrams or a malformed HTTP response
and muse how that is or is not broken per unrelated specifications.
Similarily is very-bogus-encoding irrelevant because it violates what
is considered common sense. http://xkcd.com/468/ comes to mind.

(The XML specification actually considers your case a fatal error and
those are errors which in turn are violations of the constraints of the
specification, I've argued unsuccessfully against that in the past as
having specification violations dependant on processor capabilities is
a violation of common sense.)
--
Björn Höhrmann · mailto:bjoern@... · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/