[HTML5] 2.8 Character encodings

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

[HTML5] 2.8 Character encodings

by Dr. Olaf Hoffmann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,

in the current draft are mentioned in 2.8
http://www.w3.org/TR/2009/WD-html5-20090423/infrastructure.html#character-encodings-0
some 'willful' misinterpretations of encoding information, for example
to interprete a string like 'ISO-8859-1' as 'Windows-1252'.

1. Which string has an author to note, if he really wants to indicate, that
the encoding is for example 'ISO-8859-1' and not 'Windows-1252'?

2. As far as I have seen, HTML5 has no version indication like previous
versions of HTML had and other popular formats like SVG have.
How can a browser identify, that a document is really intended as
'HTML5' with the implicated  'willful' misinterpretations of encoding
information and no other HTMLversion?
This assumes of course, that for other versions and formats there
is no such 'willful' misinterpretation and the identification problem
of encodings happens as usual by interpreting the string as
provided by the author.

Assuming that a viewer is able to identify a document somehow
being a HTML5 document after looking into the content and for
example a server sended 'ISO-8859-1' before, does this mean, that
the viewer switches to or reparses the document with 'Windows-1252'
again?

Obviously it would be better to avoid such misinterpretation by using
an encoding like UTF-8 not confused by the current HTML5 draft,
however due to the history of older projects or server configurations
it might be still convenient for many authors to continue to use
'ISO-8859-1' instead of other encodings, even if they switch for
example from HTML4 to HTML5 for some documents.



Best wishes

Olaf




Re: [HTML5] 2.8 Character encodings

by Ian Hickson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, 6 Jul 2009, Dr. Olaf Hoffmann wrote:
>
> in the current draft are mentioned in 2.8
> http://www.w3.org/TR/2009/WD-html5-20090423/infrastructure.html#character-encodings-0
> some 'willful' misinterpretations of encoding information, for example
> to interprete a string like 'ISO-8859-1' as 'Windows-1252'.
>
> 1. Which string has an author to note, if he really wants to indicate, that
> the encoding is for example 'ISO-8859-1' and not 'Windows-1252'?

"ISO-8859-1". If the author has really used that encoding, then there is
no difference between them (1252 is a superset).


> 2. As far as I have seen, HTML5 has no version indication like previous
> versions of HTML had and other popular formats like SVG have.
> How can a browser identify, that a document is really intended as
> 'HTML5' with the implicated  'willful' misinterpretations of encoding
> information and no other HTMLversion?

It doesn't matter, all versions of HTML are in practice processed with
these mappings. It is indeed why HTML5 has these mappings -- because
browsers already did this. We wouldn't add these mappings if we didn't
have to to handle legacy content (content in previous versions of HTML).


> Assuming that a viewer is able to identify a document somehow being a
> HTML5 document after looking into the content and for example a server
> sended 'ISO-8859-1' before, does this mean, that the viewer switches to
> or reparses the document with 'Windows-1252' again?

I don't understand the question.


> Obviously it would be better to avoid such misinterpretation by using an
> encoding like UTF-8 not confused by the current HTML5 draft, however due
> to the history of older projects or server configurations it might be
> still convenient for many authors to continue to use 'ISO-8859-1'
> instead of other encodings, even if they switch for example from HTML4
> to HTML5 for some documents.

Hopefully my answers above will reassure you that this is not in fact a
problem that authors will face.

--
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [HTML5] 2.8 Character encodings

by Dr. Olaf Hoffmann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Ian Hickson:

> On Mon, 6 Jul 2009, Dr. Olaf Hoffmann wrote:
> > in the current draft are mentioned in 2.8
> > http://www.w3.org/TR/2009/WD-html5-20090423/infrastructure.html#character
> >-encodings-0 some 'willful' misinterpretations of encoding information,
> > for example to interprete a string like 'ISO-8859-1' as 'Windows-1252'.
> >
> > 1. Which string has an author to note, if he really wants to indicate,
> > that the encoding is for example 'ISO-8859-1' and not 'Windows-1252'?
>
> "ISO-8859-1". If the author has really used that encoding, then there is
> no difference between them (1252 is a superset).
>

I know, that there is only a difference for a few characters, not relevant
for ISO-8859-1 usage (see tables at wikipedia for example).
However, because I prefer to provide only well defined documents
in publications, I want to be sure, that if ISO-8859-1 is mentioned
within a document, that this really means ISO-8859-1
and not something else. The current draft defines something else,
something I never used (and my preferred editor kate from KDE
is able to distinguish between 'ISO-8859-1' and 'cp 1252' (Windows-1252
seems to be an alias for this encoding).
The only way I can see with the current draft is not to use the string
'ISO-8859-1' at all for 'HTML5' because this format defines that
this is interpreted as 'Windows-1252' and not as intended as 'ISO-8859-1'.
This is a problem, because 'ISO-8859-1' is the default encoding for
HTML4 for example. Therefore to switch from project still having
HTML4 (and no XHTML already) to 'HTML5' seems to require to
switch to UTF-8 to avoid plurivalences  with the encoding due to
the current draft.


> > 2. As far as I have seen, HTML5 has no version indication like previous
> > versions of HTML had and other popular formats like SVG have.
> > How can a browser identify, that a document is really intended as
> > 'HTML5' with the implicated  'willful' misinterpretations of encoding
> > information and no other HTMLversion?
>
> It doesn't matter, all versions of HTML are in practice processed with
> these mappings. It is indeed why HTML5 has these mappings -- because
> browsers already did this. We wouldn't add these mappings if we didn't
> have to to handle legacy content (content in previous versions of HTML).
>

Well for HTML4 and XHTML1.x and all other XML formats this is simple
a bug of the browser, nothing to worry about for authors, because the
string 'ISO-8859-1' has a well defined meaning in all these formats
completely independent from the behaviour of current buggy browsers.
And if some authors are forced by this bug to indicate 'cp 1252' as
'ISO-8859-1' I think, this is even more and indication, that this bug
has to be fixed in browsers to inform those authors to fix their documents
to get a well defined encoding for their document instead of hiding
such a bug to prevent authors to fix such a nasty bug.

Especially 'ISO-8859-1' authors currently do not have to worry about
the bug, because they (typically) do not use the characters with
different meaning in both encodings.
With the current 'HTML5' draft already the indication as 'ISO-8859-1'
is plurivalent and has to be avoided to create a well defined document
in the format 'HTML5' (or 'HTML5' has to be avoided to create a
well defined document).

> > Assuming that a viewer is able to identify a document somehow being a
> > HTML5 document after looking into the content and for example a server
> > sended 'ISO-8859-1' before, does this mean, that the viewer switches to
> > or reparses the document with 'Windows-1252' again?
>
> I don't understand the question.

If a server, an XML-processing instruction or maybe a meta-element
indicates the encoding as 'ISO-8859-1' a proper browser has to encode
the document with 'ISO-8859-1' (with the implication that some characters
defined differently in 'Windows-1252' do not have a useful graphical or
acoustical representation in 'ISO-8859-1'). If within the processing, these
hypothetical proper browser is able to detect somehow, that the current
document is 'HTML5', the browser has to switch the encoding and some
characters may be interpreted differently (maybe including a useful
graphical or acoustical representation).
Obviously there a several options for such a proper browser when and
how to switch. Of course, this problem does not occur, if a buggy
browser interpretes the encoding wrong for all formats, not only for
'HTML5' ;o) But 'HTML5' cannot redefine how to interprete encoding
information for other formats or versions (HTML4, XHTML1, SVG, MathML,
RDF, DAISY, FictionBook etc)

>
> > Obviously it would be better to avoid such misinterpretation by using an
> > encoding like UTF-8 not confused by the current HTML5 draft, however due
> > to the history of older projects or server configurations it might be
> > still convenient for many authors to continue to use 'ISO-8859-1'
> > instead of other encodings, even if they switch for example from HTML4
> > to HTML5 for some documents.
>
> Hopefully my answers above will reassure you that this is not in fact a
> problem that authors will face.

Yes and no - the behaviour of current buggy browsers for other formats
is in practice not really a problem for authors of 'ISO-8859-1' documents,
because these document typically will not contain the plurivalent
characters.
However, if it is specified that 'ISO-8859-1' is not 'ISO-8859-1' in 'HTML5',
this is a general problem for authors wanting to create well defined
documents with this encoding, because there is currently no string for
'HTML5' to provide the proper encoding information, because the
common used string 'ISO-8859-1' is tainted or corrupted by the
current 'HTML5' draft.

If it is known, that many buggy browser use the wrong encoding,
of course, this can be mentioned in the 'HTML5' draft as an
(important) informational note for authors to be careful, but the
draft should not redefine the meaning of the string incompatible
to any other format.
The 'HTML5' draft should not mix up reporting browser bugs
with proper definitions. If this happens more often, thoughtful
authors may get the impression, that 'HTML5' is only about the
the behaviour of current buggy browsers and not something that
defines a new version of (X)HTML - what is something completely
differrent. Even if there is no browser without bugs, an author
can still write well defined documents in such a format, but
this is not possible, if the format itself has major plurivalences
and contradiction.
Indeed, though there was never a browser interpreting
HTML4 complete and correct, it is still pretty useful to write
documents (maybe less useful than XHTML, but 'HTML5'
avoids this problem alread defining an XML variant too).
Containing too many plurivalences, bloomers and historical
stupidities due to browser bugs, 'HTML5' will never be useful
to create well defined documents.








Re: [HTML5] 2.8 Character encodings

by Ian Hickson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, 28 Jul 2009, Dr. Olaf Hoffmann wrote:

> > >
> > > 1. Which string has an author to note, if he really wants to
> > > indicate, that the encoding is for example 'ISO-8859-1' and not
> > > 'Windows-1252'?
> >
> > "ISO-8859-1". If the author has really used that encoding, then there
> > is no difference between them (1252 is a superset).
>
> I know, that there is only a difference for a few characters, not
> relevant for ISO-8859-1 usage (see tables at wikipedia for example).
> However, because I prefer to provide only well defined documents in
> publications, I want to be sure, that if ISO-8859-1 is mentioned within
> a document, that this really means ISO-8859-1 and not something else.

If you use ISO-8859-1 correctly (i.e. no control characters) then there is
no way to distinguish the behaviour you want from the behaviour the spec
describes if you label your document as ISO-8859-1.


> This is a problem, because 'ISO-8859-1' is the default encoding for
> HTML4 for example.

HTML4 has no default encoding.


> Therefore to switch from project still having HTML4 (and no XHTML
> already) to 'HTML5' seems to require to switch to UTF-8 to avoid
> plurivalences with the encoding due to the current draft.

If you're using ISO-8859-1 correctly, you can continue using it without
any change whatsoever.


> > > 2. As far as I have seen, HTML5 has no version indication like
> > > previous versions of HTML had and other popular formats like SVG
> > > have. How can a browser identify, that a document is really intended
> > > as 'HTML5' with the implicated 'willful' misinterpretations of
> > > encoding information and no other HTMLversion?
> >
> > It doesn't matter, all versions of HTML are in practice processed with
> > these mappings. It is indeed why HTML5 has these mappings -- because
> > browsers already did this. We wouldn't add these mappings if we didn't
> > have to to handle legacy content (content in previous versions of
> > HTML).
>
> Well for HTML4 and XHTML1.x and all other XML formats this is simple
> a bug of the browser, nothing to worry about for authors, because the
> string 'ISO-8859-1' has a well defined meaning in all these formats
> completely independent from the behaviour of current buggy browsers.

A bug that is done the same way by all browsers is a de facto standard,
and unless you don't really care about how your document is seen by your
readers, it is something you have to worry about.



> And if some authors are forced by this bug to indicate 'cp 1252' as
> 'ISO-8859-1' I think, this is even more and indication, that this bug
> has to be fixed in browsers to inform those authors to fix their
> documents to get a well defined encoding for their document instead of
> hiding such a bug to prevent authors to fix such a nasty bug.

That would be wonderful, but we can't get there from here since it would
cause pages to break and thus browsers refuse to do it.


> Especially 'ISO-8859-1' authors currently do not have to worry about
> the bug, because they (typically) do not use the characters with
> different meaning in both encodings.

Indeed, as they are control characters there is no reason that they
should ever use those characters at all.


> With the current 'HTML5' draft already the indication as 'ISO-8859-1' is
> plurivalent and has to be avoided to create a well defined document in
> the format 'HTML5' (or 'HTML5' has to be avoided to create a well
> defined document).

It's well-defined, just not defined the way you would like.


> If a server, an XML-processing instruction or maybe a meta-element
> indicates the encoding as 'ISO-8859-1' a proper browser has to encode
> the document with 'ISO-8859-1' (with the implication that some characters
> defined differently in 'Windows-1252' do not have a useful graphical or
> acoustical representation in 'ISO-8859-1'). If within the processing, these
> hypothetical proper browser is able to detect somehow, that the current
> document is 'HTML5', the browser has to switch the encoding and some
> characters may be interpreted differently (maybe including a useful
> graphical or acoustical representation).

I assume you mean "decoding", not "encoding".

A proper HTML5 browser will treat ISO-8859-1 as Windows-1252 for any
text/html processing. There is no need for a magical detection ability.


> But 'HTML5' cannot redefine how to interprete encoding information for
> other formats or versions (HTML4, XHTML1, SVG, MathML, RDF, DAISY,
> FictionBook etc)

HTML5 redefines how to interpret encoding information for HTML4, as it
replaces it. It does not affect processing of XML or other formats.

(In practice, what HTML5 requires is what was implemented anyway for
HTML4, so there is no actual difference.)


> If it is known, that many buggy browser use the wrong encoding, of
> course, this can be mentioned in the 'HTML5' draft as an (important)
> informational note for authors to be careful, but the draft should not
> redefine the meaning of the string incompatible to any other format.

Yes, it should. I'm documenting what is actually implemented, and making
sure we have interoperable behaviour, and making sure that new tools will
work with legacy content interoperably with legacy user agents and future
user agents. This requires a candid approach and cannot be achieved by
beating around the bush in a politically correct way about what should
ideally have happened vs what actually happens in the real world.

Yes, this means ISO-8859-1's control characters can't be expressed in
HTML documents. Tough. Deal with it. That's how implementations are,
that's what legacy content relies on, and pretending otherwise is a big
waste of everyone's time.


> Indeed, though there was never a browser interpreting
> HTML4 complete and correct

Exactly. There wasn't. There _will_ be one that implements HTML5
completely and correctly.


> Containing too many plurivalences, bloomers and historical stupidities
> due to browser bugs, 'HTML5' will never be useful to create well defined
> documents.

It's already useful for this purpose. Being honest about what happens
makes it _more_ useful.

--
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [HTML5] 2.8 Character encodings

by Dr. Olaf Hoffmann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,

within the last ten or more years I have already a lot
of experience especially with german authors (they
typically want to use Umlaute and the ß-ligature and
sometimes the Euro-sign) and their problems.
Every month I have to explain again, how to
distinguish between UTF-8 and ISO-8859-1(5) and
that ISO-8859-1 has no BOM - respectively that
this is displayed as visible characters in some
browser versions etc.
One has to explain, that the indication of the
server (stupid or correct) is more relevant than
indications within the document and how to
get it correct with PHP or with the .htaccess file
of the Apache web-server and that one cannot
change the encoding within one document.

And of course, I still have in mind the behaviour
of the browsers I used years ago, when the
Euro was introduced in europe and some authors
tried to use the Euro-sign without masking
within ISO-8859-1. There was a clear indication,
that this does not work in these legacy browsers,
therefore it is not true, that the interpretation
of 'ISO-8859-1' as 'Windows-1252' is compatible
with older browsers - they/some had no bug and
indicated the not representable character for
example with a question mark, a box etc.
And indeed, it was simple to explain, that the
author either has to use another encoding or
has to mask the character to fix his/her bug.
This simple approach is maybe corrupted
now with bugs in current versions of browsers.


'HTML5' seems to introduce a new rule how to
identify the encoding. The encoding problem
is obviously already hardly understandable
for many authors. Suddenly this new rule with
some opaque method to identify 'HTML5'
documents complicates the situation even more
and makes it much harder to explain, what to
do to get a well defined document or script
output and how to fix bugs.

Therefore the main questions remain open up to
here:

1. How to indicate the 'ISO-8859-1' encoding
within an 'HTML5' document and not
'Windows-1252', if an author wants to specify
'ISO-8859-1' and nothing else?

2. How does a proper viewer/browser identify,
that a document is 'HTML5' and that this
specific rule has to be applied, if 'ISO-8859-1'
is indicated.

3. At which point the encoding information
switches from the information given by the
server or the XML processing instruction
to the specific rule of 'HTML5' to interprete
the string  'ISO-8859-1' as indication for
'Windows-1252'?

Indeed, up to here, this is all about encoding
information, not how a document is decoded
by the viewer (buggy or not).


Olaf




Re: [HTML5] 2.8 Character encodings

by bilcorry :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Just so I'm clear, the problem you are defining is that HTML5 requires browsers to interpret ISO-8859-1 as Windows-1252 [1], and traditionally authors that used Windows-1252 encoding, but specified ISO-8859-1 for the character set, were given feedback of the mismatch because the browser would fail to render the correct glyph.  And because this feedback has been removed in HTML5 (the correct glyph is shown despite the charset mismatch), it goes unnoticed by the author and the page remains misidentified as ISO-8859-1 when it's really Windows-1252.  Is that correct?

It may be worthwhile to point out that Firefox 2 & 3 and Internet Explorer 7 & 8 all render content encoded as Windows-1252 but identified as ISO-8859-1 as Windows-1252 -- the exact behavior of HTML5.  And it's possible older versions of FF/IE and other browsers also have this behavior.  So the lack of feedback already exists, which may explain the preponderance of pages misidentified as ISO-8869-1.

Given the above, I'd argue that HTML5 is more compatible with existing behavior than it would be if it required strictly render of ISO-8859-1.

If you are curious, you can test your browser to see how it renders UTF-8 and Windows-1252 byte streams given a particular encoding:

        http://www.corry.biz/charset_mismatch.lasso

For fun, try MacRoman encoding in Internet Explorer, you'll see it is rendered as Windows-1252.


- Bil

[1] http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#misinterpreted-for-compatibility



Dr. Olaf Hoffmann wrote on 7/31/2009 5:53 AM:

> Hello,
>
> within the last ten or more years I have already a lot
> of experience especially with german authors (they
> typically want to use Umlaute and the ß-ligature and
> sometimes the Euro-sign) and their problems.
> Every month I have to explain again, how to
> distinguish between UTF-8 and ISO-8859-1(5) and
> that ISO-8859-1 has no BOM - respectively that
> this is displayed as visible characters in some
> browser versions etc.
> One has to explain, that the indication of the
> server (stupid or correct) is more relevant than
> indications within the document and how to
> get it correct with PHP or with the .htaccess file
> of the Apache web-server and that one cannot
> change the encoding within one document.
>
> And of course, I still have in mind the behaviour
> of the browsers I used years ago, when the
> Euro was introduced in europe and some authors
> tried to use the Euro-sign without masking
> within ISO-8859-1. There was a clear indication,
> that this does not work in these legacy browsers,
> therefore it is not true, that the interpretation
> of 'ISO-8859-1' as 'Windows-1252' is compatible
> with older browsers - they/some had no bug and
> indicated the not representable character for
> example with a question mark, a box etc.
> And indeed, it was simple to explain, that the
> author either has to use another encoding or
> has to mask the character to fix his/her bug.
> This simple approach is maybe corrupted
> now with bugs in current versions of browsers.
>
>
> 'HTML5' seems to introduce a new rule how to
> identify the encoding. The encoding problem
> is obviously already hardly understandable
> for many authors. Suddenly this new rule with
> some opaque method to identify 'HTML5'
> documents complicates the situation even more
> and makes it much harder to explain, what to
> do to get a well defined document or script
> output and how to fix bugs.
>
> Therefore the main questions remain open up to
> here:
>
> 1. How to indicate the 'ISO-8859-1' encoding
> within an 'HTML5' document and not
> 'Windows-1252', if an author wants to specify
> 'ISO-8859-1' and nothing else?
>
> 2. How does a proper viewer/browser identify,
> that a document is 'HTML5' and that this
> specific rule has to be applied, if 'ISO-8859-1'
> is indicated.
>
> 3. At which point the encoding information
> switches from the information given by the
> server or the XML processing instruction
> to the specific rule of 'HTML5' to interprete
> the string  'ISO-8859-1' as indication for
> 'Windows-1252'?
>
> Indeed, up to here, this is all about encoding
> information, not how a document is decoded
> by the viewer (buggy or not).
>
>
> Olaf
>



Re: [HTML5] 2.8 Character encodings

by Dr. Olaf Hoffmann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Bil Corry:
> Just so I'm clear, the problem you are defining is that HTML5 requires
> browsers to interpret ISO-8859-1 as Windows-1252 [1], and traditionally
> authors that used Windows-1252 encoding, but specified ISO-8859-1 for the
> character set, were given feedback of the mismatch because the browser
> would fail to render the correct glyph.  And because this feedback has been
> removed in HTML5 (the correct glyph is shown despite the charset mismatch),
> it goes unnoticed by the author and the page remains misidentified as
> ISO-8859-1 when it's really Windows-1252.  Is that correct?
>

With the still open questions I mainly try to find out whether it is
possible to specify, that a 'HTML5' document has an encoding like
'ISO-8859-1' and not 'Windows-1252'.

As far as I understand the specification, this is not possible, but
then the other questions become interesting, because typically
it is relevant what the server indicates, not what is mentioned
explictely or implicitely in the document.
And if 'HTML5' changes the relevance (what is not necessarily
bad) of the encoding information, I want to know of course,
how a viewer/browser identifies a document as 'HTML5', because
for other formats or versions this behaviour is simply a bug and
not a feature. This is more a problem of the missing version
indication, for example the newest XHTML variant has the
attribute version="XHTML+RDFa 1.0", the formal identification
problem would be solved for 'HTML5' for example with
version="HTML5", then the document is at least well defined
and specific rules are applicable, if the specification is known
to the decoder.
A proper browser of course has to search for such a version
indication to know, which encoding information applies, before
the document is presented.
Even if current browsers do it differently (wrong), this is no reason,
that there is no rule required to define a correct way.


> It may be worthwhile to point out that Firefox 2 & 3 and Internet Explorer
> 7 & 8 all render content encoded as Windows-1252 but identified as
> ISO-8859-1 as Windows-1252 -- the exact behavior of HTML5.  And it's
> possible older versions of FF/IE and other browsers also have this
> behavior.  So the lack of feedback already exists, which may explain the
> preponderance of pages misidentified as ISO-8869-1.
>
> Given the above, I'd argue that HTML5 is more compatible with existing
> behavior than it would be if it required strictly render of ISO-8859-1.
>
> If you are curious, you can test your browser to see how it renders UTF-8
> and Windows-1252 byte streams given a particular encoding:
>
> http://www.corry.biz/charset_mismatch.lasso
>
> For fun, try MacRoman encoding in Internet Explorer, you'll see it is
> rendered as Windows-1252.
>
>

As already mentioned, years ago there were browsers/versions without this
bug and for typical 'ISO-8859-1' a slightly wrong decoding is not really a
problem, as already discussed. A practical problem currently only appears,
if a document has the wrong encoding information and a legacy browser
without the bug is used to present it. Because not all legacy browsers
have this bug, it is a misinformation of authors to make them believe, that
everything is solved for buggy documents, because all browser
compensate this bug with yet another browser bug. However, authors
of documents with wrong encoding informations are guilty anyway,
this is not really a problem of a specification. It is just more difficult to
teach them to write proper documents. Indifferent or ignorant
authors are not necessarily a problem for a specification, they are
a problem mainly for the general audience.
 
The questions are more about the problem, how to indicate,
that a ' HTML5' document really has the encoding ISO-8859-1.
This can be important for long living documents and archival storage.
Because in 50 or 100 or 1000 years one cannot rely on the behaviour
of browsers of the year 2009, but it might be still possible to decode
well defined documents with completely different programs.
To simplify this, one should have simple and intuitive indications
and not such a bloomer like to write 'ISO-8859-1' if you mean
'Windows-1252'.


With the current draft, one can only recommend 'HTML5'+UTF-8
or another format/version like XHTML+RDFa for long living
documents and archival storage (what is not necessarily bad too,
just something interesting to know for some people).


Olaf



Re: [HTML5] 2.8 Character encodings

by bilcorry :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dr. Olaf Hoffmann wrote on 7/31/2009 1:10 PM:
> With the still open questions I mainly try to find out whether it is
> possible to specify, that a 'HTML5' document has an encoding like
> 'ISO-8859-1' and not 'Windows-1252'.

You do it the same way as you would for any character set, by specifying the content encoding as ISO-8859-1.  Typically this is done via the Content-Type header:

        Content-Type: text/html; charset=ISO-8859-1

That header means, "This HTML document is in the ISO-8859-1 character set."  By inference, it also means that it isn't Windows-1252, or UTF-8, etc.

 
> As far as I understand the specification, this is not possible

I think what you mean is it isn't possible to force the UA to use the ISO-8859-1 charset when specified and you're right.  As I mentioned in my previous email, IE will display Windows-1252 when MacRoman is specified which is clearly wrong -- the glyphs don't even come close to matching each other.  As an author you have to work around that.  And as an author you must ensure the encoding is correct to get consistent results.


> but
> then the other questions become interesting, because typically
> it is relevant what the server indicates, not what is mentioned
> explictely or implicitely in the document.

I haven't ever tested what happens when the content-type header doesn't match the meta, but considering it's incorrect, an author can hardly expect positive results.  I wonder if HTML5 specifies the behavior in this case?


> As already mentioned, years ago there were browsers/versions without this
> bug and for typical 'ISO-8859-1' a slightly wrong decoding is not really a
> problem, as already discussed. A practical problem currently only appears,
> if a document has the wrong encoding information and a legacy browser
> without the bug is used to present it. Because not all legacy browsers
> have this bug, it is a misinformation of authors to make them believe, that
> everything is solved for buggy documents, because all browser
> compensate this bug with yet another browser bug. However, authors
> of documents with wrong encoding informations are guilty anyway,
> this is not really a problem of a specification. It is just more difficult to
> teach them to write proper documents. Indifferent or ignorant
> authors are not necessarily a problem for a specification, they are
> a problem mainly for the general audience.

This also describes the issue with browsers doing a best-guess with rendering HTML content that is malformed.  The solution to both malformed HTML and misidentified charsets is to run the page through validator.w3.org -- both get flagged if wrong.  If you want to see, try this:

        http://validator.w3.org/check?uri=http%3A%2F%2Fwww.corry.biz%2Fcharset_mismatch.lasso%3Fcharset%3DISO-8859-1

It (correctly) returns the error:

        Using windows-1252 instead of the declared encoding iso-8859-1.
        Line 22, Column 87: Unmappable byte sequence: 9d.

So if your authors care about checking their markup with validator.w3.org, they will also have their charset checked as well.


> The questions are more about the problem, how to indicate,
> that a ' HTML5' document really has the encoding ISO-8859-1.
> This can be important for long living documents and archival storage.
> Because in 50 or 100 or 1000 years one cannot rely on the behaviour
> of browsers of the year 2009, but it might be still possible to decode
> well defined documents with completely different programs.
> To simplify this, one should have simple and intuitive indications
> and not such a bloomer like to write 'ISO-8859-1' if you mean
> 'Windows-1252'.

A 1000 years from now, if they do what HTML5 does now and use Windows-1252 when ISO-8859-1 is specified, they'll be guaranteed to correctly view the document (assuming it's in either ISO-8859-1 or Windows-1252).  The same can not be said for viewing Windows-1252 as ISO-8859-1.


> With the current draft, one can only recommend 'HTML5'+UTF-8
> or another format/version like XHTML+RDFa for long living
> documents and archival storage (what is not necessarily bad too,
> just something interesting to know for some people).

UTF-8 isn't free from issues either -- I've seen Windows-1252 served as UTF-8 which produces illegal byte sequences.  Or here's an example where the page (Windows-1252) doesn't specify a charset at all; in Firefox it's rendered as UTF-8 with broken bytes and IE it's rendered with the correct charset of Windows-1252:

        http://cspinet.org/new/200907301.html

Which browser do you think they test their site with?  Which browser do you think the end user thinks is broken?


- Bil




Re: [HTML5] 2.8 Character encodings

by Dr. Olaf Hoffmann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Bil Corry:

> Dr. Olaf Hoffmann wrote on 7/31/2009 1:10 PM:
> > With the still open questions I mainly try to find out whether it is
> > possible to specify, that a 'HTML5' document has an encoding like
> > 'ISO-8859-1' and not 'Windows-1252'.
>
> You do it the same way as you would for any character set, by specifying
> the content encoding as ISO-8859-1.  Typically this is done via the
> Content-Type header:
>
> Content-Type: text/html; charset=ISO-8859-1
>
> That header means, "This HTML document is in the ISO-8859-1 character set."
>  By inference, it also means that it isn't Windows-1252, or UTF-8, etc.

This I know and is true for other formats but 'HTML5', the current draft
of 'HTML5' has a specific rule, that this means 'Windows-1252' and
not 'ISO-8859-1' - and this seems to supersede what the server indicates,
if a viewer is able to identify is as a 'HTML5' document.

>
> > As far as I understand the specification, this is not possible
>
> I think what you mean is it isn't possible to force the UA to use the
> ISO-8859-1 charset when specified and you're right.  

It is never possible for an author to force an arbitrary viewer to do
something (specific). An careful author can only write well defined
and valid documents. And an author can test it with some currently
known viewers. The results of these tests are not necessarily
representative for other viewers, not accessible for the author
(for example because they are not published yet).



> As I mentioned in my
> previous email, IE will display Windows-1252 when MacRoman is specified
> which is clearly wrong -- the glyphs don't even come close to matching each
> other.  As an author you have to work around that.  And as an author you
> must ensure the encoding is correct to get consistent results.
>

Maybe this browser does not know this encoding, however then the user
should be warned, that in such a case something stupid might happen.
The warning is clearly the problem of the implementor, not the author.
Of course, the author can avoid such problems, using more common
encodings like UTF-8 or ISO-8859-1.

> > but
> > then the other questions become interesting, because typically
> > it is relevant what the server indicates, not what is mentioned
> > explictely or implicitely in the document.
>
> I haven't ever tested what happens when the content-type header doesn't
> match the meta, but considering it's incorrect, an author can hardly expect
> positive results.  I wonder if HTML5 specifies the behavior in this case?
>

I think, 'HTML5' only specifies a specific rule for these few strings,
that 'Windows-1252' is used instead of them. Only if the author
provides these strings, this indication is superseded, independently
from the question, how the encoding was indicated. If the server
sends with the content-type header UTF-8, the specific 'HTML5' rule
does not apply and everything is interpreted as UTF-8, no matter
what is mentioned within the document.


> > As already mentioned, years ago there were browsers/versions without this
> > bug and for typical 'ISO-8859-1' a slightly wrong decoding is not really
> > a problem, as already discussed. A practical problem currently only
> > appears, if a document has the wrong encoding information and a legacy
> > browser without the bug is used to present it. Because not all legacy
> > browsers have this bug, it is a misinformation of authors to make them
> > believe, that everything is solved for buggy documents, because all
> > browser
> > compensate this bug with yet another browser bug. However, authors
> > of documents with wrong encoding informations are guilty anyway,
> > this is not really a problem of a specification. It is just more
> > difficult to teach them to write proper documents. Indifferent or
> > ignorant
> > authors are not necessarily a problem for a specification, they are
> > a problem mainly for the general audience.
>
> This also describes the issue with browsers doing a best-guess with
> rendering HTML content that is malformed.  The solution to both malformed
> HTML and misidentified charsets is to run the page through validator.w3.org
> -- both get flagged if wrong.  If you want to see, try this:
>
> http://validator.w3.org/check?uri=http%3A%2F%2Fwww.corry.biz%2Fcharset_mis
>match.lasso%3Fcharset%3DISO-8859-1
>
> It (correctly) returns the error:
>
> Using windows-1252 instead of the declared encoding iso-8859-1.
> Line 22, Column 87: Unmappable byte sequence: 9d.
>
> So if your authors care about checking their markup with validator.w3.org,
> they will also have their charset checked as well.
>

Well, the validator has some bugs too. If no encoding is specified for
text/*, iso-8859-1 should be used (according to HTTP).
I noticed already some months ago, that the validator noted wrong
bugs for XHTML+RDFa instead of indicating, that this cannot be
validated yet (without an optional doctype within the document,
however the version indication is done with the version attribute,
not a doctype). I noted errors for SVG documents too. The validator is
a pretty good help in many cases (less useful for SVG, because it
checks almost no attribute values), but an author still has to be
careful with the results...


> > The questions are more about the problem, how to indicate,
> > that a ' HTML5' document really has the encoding ISO-8859-1.
> > This can be important for long living documents and archival storage.
> > Because in 50 or 100 or 1000 years one cannot rely on the behaviour
> > of browsers of the year 2009, but it might be still possible to decode
> > well defined documents with completely different programs.
> > To simplify this, one should have simple and intuitive indications
> > and not such a bloomer like to write 'ISO-8859-1' if you mean
> > 'Windows-1252'.
>
> A 1000 years from now, if they do what HTML5 does now and use Windows-1252
> when ISO-8859-1 is specified, they'll be guaranteed to correctly view the
> document (assuming it's in either ISO-8859-1 or Windows-1252).  The same
> can not be said for viewing Windows-1252 as ISO-8859-1.
>
> > With the current draft, one can only recommend 'HTML5'+UTF-8
> > or another format/version like XHTML+RDFa for long living
> > documents and archival storage (what is not necessarily bad too,
> > just something interesting to know for some people).
>
> UTF-8 isn't free from issues either -- I've seen Windows-1252 served as
> UTF-8 which produces illegal byte sequences.  Or here's an example where
> the page (Windows-1252) doesn't specify a charset at all; in Firefox it's
> rendered as UTF-8 with broken bytes and IE it's rendered with the correct
> charset of Windows-1252:
>
> http://cspinet.org/new/200907301.html
>
> Which browser do you think they test their site with?  Which browser do you
> think the end user thinks is broken?
>

The Gecko I use indicated ISO-8859-1 as well as Konqueror, no UTF-8
or Windows-1252, Opera notes a problem (unsupported) and notes too,
that Windows-1252 is used.
Because this tag soup is served as text/html without encoding
information, I think, the Gecko and Konqueror are perfectly correct
with ISO-8859-1.
There is no indication, that this might be 'HTML5'. Therefore no
specific rule from the 'HTML5' draft needs to be applied.
The document does not indicate at all, which version of HTML it
uses; looking into the source code, I think it uses a proprietary
private slang of the author (even comments are noted wrong ;o)
It cannot be expected from a program, that any version information
can be derived from this tag soup.
The presentation cannot be wrong, because it is undefined
(a viewer may use any set of rules to interprete this or reject it
as nonsense; a viewer can use 'HTML5' to try to interprete this,
but can use any other set of rules as well, because the author
does not indicate a version informtion at all).
The encoding however seems to be defined (only) by the
HTTP protocol, not by the document itself.

Of course, ISO-8859-1 (and Windows-1252) are only compatible
with UTF-8 for a basic set of characters. This can be seen more
easily with documents not in english but in other languages like
german, french, spanish, pretty well covered with ISO-8859-1 but
for several important characters with different encoding in
ISO-8859-1 and UTF-8. Because there is no indication, that this
might be XML, there is no need to think, that UTF-8 is a proper
choice for decoding.







Re: [HTML5] 2.8 Character encodings

by bilcorry :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dr. Olaf Hoffmann wrote on 8/1/2009 10:08 AM:

> Bil Corry:
>> Dr. Olaf Hoffmann wrote on 7/31/2009 1:10 PM:
>>> With the still open questions I mainly try to find out whether it is
>>> possible to specify, that a 'HTML5' document has an encoding like
>>> 'ISO-8859-1' and not 'Windows-1252'.
>> You do it the same way as you would for any character set, by specifying
>> the content encoding as ISO-8859-1.  Typically this is done via the
>> Content-Type header:
>>
>> Content-Type: text/html; charset=ISO-8859-1
>>
>> That header means, "This HTML document is in the ISO-8859-1 character set."
>>  By inference, it also means that it isn't Windows-1252, or UTF-8, etc.
>
> This I know and is true for other formats but 'HTML5', the current draft
> of 'HTML5' has a specific rule, that this means 'Windows-1252' and
> not 'ISO-8859-1' - and this seems to supersede what the server indicates,
> if a viewer is able to identify is as a 'HTML5' document.

I started to reply, but realized this thread is just going circular.

At issue, you are claiming the HTML5 charset rules will create problems for authors -- can you provide some real-world examples?  I would be very interested to some of your documents where your ISO-8859-1 encoding is broken by the HTML5 charset rules.



- Bil




Re: [HTML5] 2.8 Character encodings

by Dr. Olaf Hoffmann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Bil Corry:

>
> I started to reply, but realized this thread is just going circular.

Indeed, the main questions remain open ;o)

>
> At issue, you are claiming the HTML5 charset rules will create problems for
> authors -- can you provide some real-world examples?  I would be very
> interested to some of your documents where your ISO-8859-1 encoding is
> broken by the HTML5 charset rules.
>

Well, because 'HTML5' is currently just a draft, I do not use it.
I think, up to know, it has not even a version indication, therefore it
is not obvious to me how to indicate, that a document is written in
'HTML5'. Until several of these issues are not solved and 'HTML5'
is not really stable, I surely will not use it. Currently I use XHTML+RDFa
for new projects and to fill semantical gaps of (X)HTML.

But as already mentioned, for an author of an 'ISO-8859-1'-'HTML5'
document apart from the version indication it is already a problem to
specify the used encoding properly. This problem appears while a
document is written and has to be solved before publication, therefore
published documents are not broken, because they simply are not
published due to this problem.
Therefore if I start to write some test documents and this problem is
not avoided and a version indication is possible, I think, I will use
UTF-8 for those documents. Typically this means, that they are
incompatible with other of my documents and scripts and will appear
in another directory with an Apache-.htaccess file indicating the
different encoding.
I think, the Apache has an option with specific file name extensions too,
this can be used for directories with mixed encodings maybe.
Surely I will not explain this to other authors, if this question comes up,
because it is too complex for many authors.
This does not cause broken documents, the construct is just more fragile
and one has to care more, where to put and how to name files and one
has to switch the encoding in the editor for different projects. This is
only more work and more sources of possible errors, not recommendable
for every author.
Therefore maybe I will never create more than test documents for
'HTML5' just to avoid such complications.
With the new microdata section, 'HTML5' seemed to get more
interesting for authors (well, the CURIEs are still missing, but there
seems to be a workaround with entitiy definitions within the else
almost empty DOCTYPE), therefore it would have been interesting
to test this or to include this in tutorials for other authors, because
it has already a few more semantically relevant elements than
HTML4/XHTML1.x.

Olaf




Re: [HTML5] 2.8 Character encodings

by Anne van Kesteren-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, 04 Aug 2009 10:25:35 +0200, Dr. Olaf Hoffmann  
<Dr.O.Hoffmann@...> wrote:
> [snip] I think, up to know, it has not even a version indication,  
> therefore it
> is not obvious to me how to indicate, that a document is written in
> 'HTML5'.

This is by design. We're removing versioning from (X)HTML much like CSS  
does not have versioning. (To be clear, not everyone in the HTML WG agrees  
with this design choice.)


> But as already mentioned, for an author of an 'ISO-8859-1'-'HTML5'
> document apart from the version indication it is already a problem to
> specify the used encoding properly.

No, you can just specify it. Just like you can in HTML4.


> This problem appears while a document is written and has to be solved  
> before publication, therefore
> published documents are not broken, because they simply are not
> published due to this problem.

I do not follow this.


> Therefore if I start to write some test documents and this problem is
> not avoided and a version indication is possible, I think, I will use
> UTF-8 for those documents.

This seems like a good idea regardless.


> Typically this means, that they are
> incompatible with other of my documents and scripts and will appear
> in another directory with an Apache-.htaccess file indicating the
> different encoding.

That is one solution. You could also always indicate the encoding in the  
document instead and instruct Apache to not include the charset parameter.


> I think, the Apache has an option with specific file name extensions too,
> this can be used for directories with mixed encodings maybe.

That is an option too. You can also set headers on a per-file basis using  
the Files directive.


> Surely I will not explain this to other authors, if this question comes  
> up, because it is too complex for many authors.

Agreed. Encoding is largely misunderstood. It makes more sense for editors  
to start defaulting to UTF-8 going forward and have everyone use that, in  
my opinion.


> This does not cause broken documents, the construct is just more fragile
> and one has to care more, where to put and how to name files and one
> has to switch the encoding in the editor for different projects. This is
> only more work and more sources of possible errors, not recommendable
> for every author.

If you simply switch to UTF-8 for all future work this will become less  
and less of a problem. And then you've also covered other scripts may the  
need arise to use them.


> Therefore maybe I will never create more than test documents for
> 'HTML5' just to avoid such complications.

Ok.


> With the new microdata section, 'HTML5' seemed to get more
> interesting for authors (well, the CURIEs are still missing, but there
> seems to be a workaround with entitiy definitions within the else
> almost empty DOCTYPE), therefore it would have been interesting
> to test this or to include this in tutorials for other authors, because
> it has already a few more semantically relevant elements than
> HTML4/XHTML1.x.

Since HTML5 is no longer SGML based entity definitions there will not work  
and are non-conforming. The reason we did this was because other than the  
validator no software processed text/html resources in this way leading to  
a lot of author confusion because of the clear mismatch between the  
validator and other software.


--
Anne van Kesteren
http://annevankesteren.nl/


Re: [HTML5] 2.8 Character encodings

by Dr. Olaf Hoffmann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Anne van Kesteren:

> On Tue, 04 Aug 2009 10:25:35 +0200, Dr. Olaf Hoffmann
>
> <Dr.O.Hoffmann@...> wrote:
> > [snip] I think, up to know, it has not even a version indication,
> > therefore it
> > is not obvious to me how to indicate, that a document is written in
> > 'HTML5'.
>
> This is by design. We're removing versioning from (X)HTML much like CSS
> does not have versioning. (To be clear, not everyone in the HTML WG agrees
> with this design choice.)

Well, this is a problem for CSS too, because some properties are
defined differently in CSS2.1 than in CSS2.
I discovered this some time ago for example for clipping for
some SVG test documents, which appeared wrong in Opera.
SVG depends on CSS2, therefore these tests are still well
defined, applied to (X)HTML they are not testable anymore,
because CSS has no version indication.

For 'HTML5' - as long as I cannot simply write version="HTML5"
I cannot start to write HTML5 documents. Already this is a
'show stopper' for 'HTML5' currently. One can still discuss the
current draft, but for formal reasons one cannot write a
'HTML5' document ;o)
There is no problem to write HTML3.2, HTML4, XHTML1.0,
XHTML1.1 or XHTML+RDFa, even if for some of them the
version indication is not very elegant and not very relevant
for typical user agents.


>
> > But as already mentioned, for an author of an 'ISO-8859-1'-'HTML5'
> > document apart from the version indication it is already a problem to
> > specify the used encoding properly.
>
> No, you can just specify it. Just like you can in HTML4.

I can write the string, but indeed, if I do it, it means 'Windows-1252'.
Therefore effectively, I cannot indicate, that something is
'ISO-8859-1' and not 'Windows-1252'.

>
> > This problem appears while a document is written and has to be solved
> > before publication, therefore
> > published documents are not broken, because they simply are not
> > published due to this problem.
>
> I do not follow this.
>
> > Therefore if I start to write some test documents and this problem is
> > not avoided and a version indication is possible, I think, I will use
> > UTF-8 for those documents.
>
> This seems like a good idea regardless.

Sure, if you have no history with thousands of documents or scripts.

>
> > Typically this means, that they are
> > incompatible with other of my documents and scripts and will appear
> > in another directory with an Apache-.htaccess file indicating the
> > different encoding.
>
> That is one solution. You could also always indicate the encoding in the
> document instead and instruct Apache to not include the charset parameter.

Of course, the document should contain it too. However on many servers
authors have no direct control over the Apache defaults. Therefore it is
always a good idea to ensure, that this works indepentendly from gags
of the administrator.

>
> > I think, the Apache has an option with specific file name extensions too,
> > this can be used for directories with mixed encodings maybe.
>
> That is an option too. You can also set headers on a per-file basis using
> the Files directive.
>
> > Surely I will not explain this to other authors, if this question comes
> > up, because it is too complex for many authors.
>
> Agreed. Encoding is largely misunderstood. It makes more sense for editors
> to start defaulting to UTF-8 going forward and have everyone use that, in
> my opinion.
>
> > This does not cause broken documents, the construct is just more fragile
> > and one has to care more, where to put and how to name files and one
> > has to switch the encoding in the editor for different projects. This is
> > only more work and more sources of possible errors, not recommendable
> > for every author.
>
> If you simply switch to UTF-8 for all future work this will become less
> and less of a problem. And then you've also covered other scripts may the
> need arise to use them.
>

For some projects, it may take several years, until I update them
completely. On one server I still found HTML3.2 documents this
year ;o)
More often content is just added or minor bugs are fixed.
I think, this is the same for many authors having already
thousands of documents around somewhere.

Of course if you are 12 to 15 years old, starting your first
project, you could start just from the beginning with a currently
proper choice. However, when you start, you don't understand,
what a proper choice for the future is. Especially for them we
have to keep things simple to get some better quality of
documents in the future. Because 'HTML5' works off a lot
of historical relics and browser bugs, it is not a good
options for a simple start anyway.


> > Therefore maybe I will never create more than test documents for
> > 'HTML5' just to avoid such complications.
>
> Ok.
>
> > With the new microdata section, 'HTML5' seemed to get more
> > interesting for authors (well, the CURIEs are still missing, but there
> > seems to be a workaround with entitiy definitions within the else
> > almost empty DOCTYPE), therefore it would have been interesting
> > to test this or to include this in tutorials for other authors, because
> > it has already a few more semantically relevant elements than
> > HTML4/XHTML1.x.
>

--- off topic ;o) ---

> Since HTML5 is no longer SGML based entity definitions there will not work
> and are non-conforming. The reason we did this was because other than the
> validator no software processed text/html resources in this way leading to
> a lot of author confusion because of the clear mismatch between the
> validator and other software.

SVG tiny 1.2 documents have typically no doctype, but for this purpose,
it is still pretty useful and an SVG/XML-parser interpretes this.
Because for 'HTML5' it is possible to use an XML-parser, it should
be possible to use this important feature too. Of course, because
it is XML, one can simply start to mix with elements from other
languages too, if this appears to be more convenient as to use
microdata to indicate 'HTML5' elements or their content to
represent the same meaning as those elements from other
namespaces.
For those microdata information one mainly has to process this
(as CURIEs in XHTML+RDFa), if the semantical meaning gets
relevant, not for presentation in a simple browser. For them one
needs maybe only the attribute values to identify the elements
for styling with CSS.
And for example if these constructions refer to a human readable
specification (as I have created for literature currently), it is not
obvious, what a simple browser should do with this information.
This is the general problem with this approach filling semantical
gaps of languages with those extensions. For a presentation
with simple browsers an author cannot derive much more than
some styling. But if a language like (X)HTML has these gaps,
this is currently the only way to combine the technical functionality
of elements with detailed semantical meanings.
And because the technical functionalities of (X)HTML and SVG
are already interpreted in larger parts in common browsers,
formats/versions like XHTML+RDFa, SVG tiny 1.2, 'HTML5'
are interesting to explore methods how to provide both technical
functionalities and semantical meanings.
But if there are problems to indicate the encoding or the
version of the used format, this is contraproductive for authors
wanting to create long living and well defined and semantical
rich documents, not because there are specific presentation
problems in current browsers, but because one simply
cannot indicate, what one is currently doing - and one
self in ten years or others even later have to identify this.

Olaf





Re: [HTML5] 2.8 Character encodings

by Julian Reschke :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dr. Olaf Hoffmann wrote:
> ...
> I can write the string, but indeed, if I do it, it means 'Windows-1252'.
> Therefore effectively, I cannot indicate, that something is
> 'ISO-8859-1' and not 'Windows-1252'.
> ...

Olaf, from what you write it's not totally clear that you realize that
ISO-8859-1 is a proper subset of Windows-1252?

So the only difference would be the ability to diagnose problems in
documents that claim to be ISO-8859-1, but actually use C1 control codes.

That being said, I do agree with Larry that the spec should phrase it
differently.

BR, Julian


Re: [HTML5] 2.8 Character encodings

by Anne van Kesteren-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, 04 Aug 2009 12:17:49 +0200, Dr. Olaf Hoffmann  
<Dr.O.Hoffmann@...> wrote:
> Well, this is a problem for CSS too, because some properties are
> defined differently in CSS2.1 than in CSS2.

Things change. This is not necessarily a problem in my experience.


> I discovered this some time ago for example for clipping for
> some SVG test documents, which appeared wrong in Opera.
> SVG depends on CSS2, therefore these tests are still well
> defined, applied to (X)HTML they are not testable anymore,
> because CSS has no version indication.

This is one way of looking at the problem. Another way of looking at the  
problem is that specifications cannot incrementally evolve like  
implementations do and are therefore not always accurate in what they say.  
Just like normal languages Web languages change now and then.


> For 'HTML5' - as long as I cannot simply write version="HTML5"
> I cannot start to write HTML5 documents.

In effect your documents will be treated as HTML5 regardless.


> Already this is a 'show stopper' for 'HTML5' currently. One can still  
> discuss the
> current draft, but for formal reasons one cannot write a
> 'HTML5' document ;o)

As far as I know there are no formal reasons why one cannot write HTML5  
documents and publish them and in fact many people are authoring HTML5  
documents and publishing them.


> There is no problem to write HTML3.2, HTML4, XHTML1.0,
> XHTML1.1 or XHTML+RDFa, even if for some of them the
> version indication is not very elegant and not very relevant
> for typical user agents.

This assumes versioning is necessary. Experience with Web browsers shows  
that this is not needed and avoids a lot of complexity.


>> No, you can just specify it. Just like you can in HTML4.
>
> I can write the string, but indeed, if I do it, it means 'Windows-1252'.

Not for your authoring tool or a conformance checker.


> Therefore effectively, I cannot indicate, that something is
> 'ISO-8859-1' and not 'Windows-1252'.

You cannot indicate that something needs to be decoded as ISO-8859-1 by  
Web browsers for text/html content. This has been the case for a long time  
and is nothing new.


>>> Therefore if I start to write some test documents and this problem is
>>> not avoided and a version indication is possible, I think, I will use
>>> UTF-8 for those documents.
>>
>> This seems like a good idea regardless.
>
> Sure, if you have no history with thousands of documents or scripts.

More and more programming languages work with Unicode internally. What  
scripts would act up?


>>> Typically this means, that they are
>>> incompatible with other of my documents and scripts and will appear
>>> in another directory with an Apache-.htaccess file indicating the
>>> different encoding.
>>
>> That is one solution. You could also always indicate the encoding in the
>> document instead and instruct Apache to not include the charset  
>> parameter.
>
> Of course, the document should contain it too. However on many servers
> authors have no direct control over the Apache defaults. Therefore it is
> always a good idea to ensure, that this works indepentendly from gags
> of the administrator.

That is not what I'm saying. What I'm saying you could instruct Apache to  
not include the charset parameter so you do not have to maintain a  
document/charset mapping within Apache. Just within the documents.


>> If you simply switch to UTF-8 for all future work this will become less
>> and less of a problem. And then you've also covered other scripts may  
>> the need arise to use them.
>
> For some projects, it may take several years, until I update them
> completely. On one server I still found HTML3.2 documents this
> year ;o)

Yeah, this is nothing new :-) Lots of legacy content out there.


> More often content is just added or minor bugs are fixed.
> I think, this is the same for many authors having already
> thousands of documents around somewhere.

If you are just fixing minor bugs I do not see what HTML5 has to do with  
this.


> Of course if you are 12 to 15 years old, starting your first
> project, you could start just from the beginning with a currently
> proper choice. However, when you start, you don't understand,
> what a proper choice for the future is. Especially for them we
> have to keep things simple to get some better quality of
> documents in the future. Because 'HTML5' works off a lot
> of historical relics and browser bugs, it is not a good
> options for a simple start anyway.

HTML5 also removes a lot of historical relics. E.g. SGML. This simplifies  
things _a lot_ for authors in my opinion and experience in talking with  
Web developers about this. (And feedback I get from collegues who talk to  
Web developers on a near fulltime basis.)


> --- off topic ;o) ---
>
>> Since HTML5 is no longer SGML based entity definitions there will not  
>> work and are non-conforming. The reason we did this was because other  
>> than the validator no software processed text/html resources in this  
>> way leading to a lot of author confusion because of the clear mismatch  
>> between the
>> validator and other software.
>
> SVG tiny 1.2 documents have typically no doctype, but for this purpose,
> it is still pretty useful and an SVG/XML-parser interpretes this.
> Because for 'HTML5' it is possible to use an XML-parser, it should
> be possible to use this important feature too. Of course, because
> it is XML, one can simply start to mix with elements from other
> languages too, if this appears to be more convenient as to use
> microdata to indicate 'HTML5' elements or their content to
> represent the same meaning as those elements from other
> namespaces.

Ah. I did not realize you were talking about XHTML5 (HTML5 expressed in  
XML). In XHTML5 ISO-8859-1 just means ISO-8859-1, not Windows-1252. That  
is just done for text/html documents. You can indeed use DOCTYPE features  
in XHTML5 although I believe it is currently recommended that you do not  
use them.


--
Anne van Kesteren
http://annevankesteren.nl/


Re: [HTML5] 2.8 Character encodings

by Dr. Olaf Hoffmann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Julian Reschke:

> Dr. Olaf Hoffmann wrote:
> > ...
> > I can write the string, but indeed, if I do it, it means 'Windows-1252'.
> > Therefore effectively, I cannot indicate, that something is
> > 'ISO-8859-1' and not 'Windows-1252'.
> > ...
>
> Olaf, from what you write it's not totally clear that you realize that
> ISO-8859-1 is a proper subset of Windows-1252?
>

This seems to fit to what is noted for example at wikipedia
for ISO/IEC 8859-1, not for ISO-8859-1, these are
different for some control characters.
Therefore ISO-8859-1 and Windows-1252 seem
to be a superset of ISO/IEC 8859-1, but ISO-8859-1
is not a subset of Windows-1252, but the conflicting
characters are typically not used in documents with
correctly indicated ISO-8859-1 encoding.
http://en.wikipedia.org/wiki/ISO/IEC_8859-1.

What I personally use in (X)HTML is typically the ASCII
subset, where I do not even have to care about differences between
'ISO-8859-1' and 'UTF-8' - if a server administrator (or a
browser implementor) decides something surprising.
However, because masking of special characters like
Umlaute and the ß-ligature is not always available in
the (X)HTML style and for example for an XML parser like
Opera it seems to depend on the XHTML-version/doctype,
if the (X)HTML predefined entities are known, I have to
switch to more critical things.
Many other others rely already for many years on unmasked
special characters. And if they want to use and indicate
'ISO-8859-1', this should be possible. No problem too, if
they want to use and indicate 'Windows-1252'
as it is not problem to use and indicate 'UTF-8' - but
mixing up this in a specification means basically confusion
for some authors reading this, especially for those, who
already have problems to indicate the encoding they
used properly.
Therefore it is not a big practical problem in what browsers
currently do, if 'ISO-8859-1' is specified (if 'ISO-8859-1'
is used). It is more, that readers of the draft are
confused by the wording.
If it is noted something like: "If 'ISO-8859-1' etc is
indicated for a document encoding, a HTML5 praser
will/may use for the presentable characters
'Windows-1252' for decoding."
This would be already less confusing and does not change
the meaning of the string or document, it describes
only the behaviour or the parser, what is a difference.
For more advanced authors one may add something
like "Due to this rule, there may be no indication of
a wrong encoding information. Some not presentable control
characters of 'ISO-8859-1' might be presented as
presentable characters according to Windows-1252."
This indicates maybe the intended reason for this
behaviour and indicates too, that such parsers should
not be used to check proper encoding/decoding
(what is nevertheless done by many authors with
known consequences ;o)


> So the only difference would be the ability to diagnose problems in
> documents that claim to be ISO-8859-1, but actually use C1 control codes.
>
> That being said, I do agree with Larry that the spec should phrase it
> differently.
>
> BR, Julian





Re: [HTML5] 2.8 Character encodings

by Dr. Olaf Hoffmann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Anne van Kesteren:
> On Tue, 04 Aug 2009 12:17:49 +0200, Dr. Olaf Hoffmann
>
> <Dr.O.Hoffmann@...> wrote:
> > Well, this is a problem for CSS too, because some properties are
> > defined differently in CSS2.1 than in CSS2.
>
> Things change. This is not necessarily a problem in my experience.

That is, why to have different versions and a version indication is
useful. 'HTML5' changes or redefines the meaning and partly the
functionality of some elements too, compared to HTML4 (or XHTML1.x).
There is a difference for example for elements like small or object.
Again a version indication is useful, but of course does not solve the
problem completely (for example if the elements are used in
mixed XML-documents, indicating only the namespace, not the version).

And it is of course a problem for the interpretation of older documents,
at least, if the creation time is not conserved. Obviously a stylesheet
from 2000 does not mean CSS2.1, for 2007 no one knows if 2.0 or 2.1.
Not every HTML document without a version indication is 'HTML5' -
could be old tag soup as well. Without a version indication it should
be no problem, to interprete it with an arbitrary rule set, including
'HTML5', but this is no indication, that such a tag soup document
has a well defined meaning at all.

>
> > I discovered this some time ago for example for clipping for
> > some SVG test documents, which appeared wrong in Opera.
> > SVG depends on CSS2, therefore these tests are still well
> > defined, applied to (X)HTML they are not testable anymore,
> > because CSS has no version indication.
>
> This is one way of looking at the problem. Another way of looking at the
> problem is that specifications cannot incrementally evolve like
> implementations do and are therefore not always accurate in what they say.

For this example, both CSS2.0 and CSS2.1 have understandable
and testable definitions of the same property, they are just different
(by the way, the behaviour I found for Opera is different from both ;o)
Without version indication this is no evolution, this simply means,
that is has no relyable meaning (and a different interpretation in
different versions of browsers as well). Both meaning and
interpretation get unpredictable for authors, the second at least
for several years.


> Just like normal languages Web languages change now and then.
>

But not the versions and not the intended meaning of already
published documents. If this would be the case, it would be
impossible to write well defined documents/content in an electronic
format at all. Maybe the interpretations changes, because we
have other experiences now than when such a document was
written, but this does not change, what the author had in mind,
when the document was created.


> > For 'HTML5' - as long as I cannot simply write version="HTML5"
> > I cannot start to write HTML5 documents.
>
> In effect your documents will be treated as HTML5 regardless.
>

Here you mix up again interpretation and content.
If you like, you can interprete a 'microsoft word' document or
a postscript or PDF as well as 'HTML5', this does not necessarily
mean, that you get the intended meaning of the document with
such an interpretation and it is not obvious, how to interprete this
in a useful way, if author/server send them as text/html.
This can already happen with translations of documents from one
language to another (Shakespeare in german, Goethe in english?),
it is obviously even worse, if you try to interprete such documents
in the other language without a translation - but of course, you can
do it ;o)

A specification like that for 'HTML5' has more than the aspect
of interpretation.
For authors, it has the aspect too to indicate, specifiy, markup
the intended meaning. Once done, this does not change, if the
author does not change the document. But the interpretation
can change any time or can depend on the reader (or tool the
reader uses to interprete the document).


> > Already this is a 'show stopper' for 'HTML5' currently. One can still
> > discuss the
> > current draft, but for formal reasons one cannot write a
> > 'HTML5' document ;o)
>
> As far as I know there are no formal reasons why one cannot write HTML5
> documents and publish them and in fact many people are authoring HTML5
> documents and publishing them.
>

Well do you have samples?
How do they identify them as 'HTML5'? and distinguish from undefined
tag soup without a version indication?
Not, how these document are maybe interpreted today, what does the
author indicate, what they are?
Maybe in ten years they are interpreted as 'HTML6' - does it mean, that
an author has written them as 'HTML6' today?
If I have written some HTML tag soup in 1997, does it mean, that this is
'HTML5', just because this tag soup has no version indication? Does it
mean in 10 years, that I have written 'HTML6' already in 1997?
No, if the creation date is known, it simply means, that this document is
one of my first stupid attempts to write an HTML document, not more.
And a current browser will typically not interprete this old tag soup
like the netscape3.2 I used to 'check' the appearance of this tag soup,
what is no problem, because it is simply tag soup without
a requirement for a specific interpretation or a well defined meaning.


> > There is no problem to write HTML3.2, HTML4, XHTML1.0,
> > XHTML1.1 or XHTML+RDFa, even if for some of them the
> > version indication is not very elegant and not very relevant
> > for typical user agents.
>
> This assumes versioning is necessary. Experience with Web browsers shows
> that this is not needed and avoids a lot of complexity.
>

Web browser only interprete documents, and the common experience shows,
that they do it wrong and incomplete. And of course, it helps a lot to have a
look into the source code, find a well defined, structured document with
version indication for the interpretation, if the browser fails. One can try
to find the problem - more often bugs in the document than a real
problem with the browser. And one can try to find out, what was intended
by the author. For flawed documents, the intention of the author is often
different from the interpretation of the browser, what is basically a task
for the author to fix the bugs - one cannot expect, that a simple program
like a browser can guess, what was intented by the author of  a flawed
document. And even if some error management is defined in 'HTML5',
one cannot assume, that the result is often close to the intention of the
author, it is just one interpretation of a flawed document.


> >> No, you can just specify it. Just like you can in HTML4.
> >
> > I can write the string, but indeed, if I do it, it means 'Windows-1252'.
>
> Not for your authoring tool or a conformance checker.
>
> > Therefore effectively, I cannot indicate, that something is
> > 'ISO-8859-1' and not 'Windows-1252'.
>
> You cannot indicate that something needs to be decoded as ISO-8859-1 by
> Web browsers for text/html content. This has been the case for a long time
> and is nothing new.

It is at least new in 'HTML5', that one cannot indicate 'ISO-8859-1'
as encoding, what is a difference, again the problem to mix up the
content of a document with the interpretation of it. There is no way for
an author to prevent any interpretation. But the author can try to indicate,
what was intented.
And as I already mentioned, I have seen years ago already browsers
without this behaviour/bug, what was somehow interesting for
documents with wrong encoding information and unmasked euro signs ;o)





Re: [HTML5] 2.8 Character encodings

by Anne van Kesteren-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, 04 Aug 2009 17:12:29 +0200, Dr. Olaf Hoffmann  
<Dr.O.Hoffmann@...> wrote:
> And it is of course a problem for the interpretation of older documents,
> at least, if the creation time is not conserved. Obviously a stylesheet
> from 2000 does not mean CSS2.1, for 2007 no one knows if 2.0 or 2.1.

Actually, it probably does. Because actual deployed style sheets and  
implementations in Web browsers of CSS 2.0 were what guided the  
development of CSS 2.1. This is the same for HTML 4 and HTML 5.


>> This is one way of looking at the problem. Another way of looking at the
>> problem is that specifications cannot incrementally evolve like
>> implementations do and are therefore not always accurate in what they  
>> say.
>
> For this example, both CSS2.0 and CSS2.1 have understandable
> and testable definitions of the same property, they are just different
> (by the way, the behaviour I found for Opera is different from both ;o)

Please file a bug if you have not already. Thanks! The CSS WG considers  
CSS 2.0 to be mostly obsolete by the way. You can see that the /TR/CSS2/  
recently changed to point to CSS 2.1 as well.

I am just trying to explain the state of affairs with respect to CSS here.  
While you can claim it is supposed to be different, that does not make it  
so.


> Without version indication this is no evolution, this simply means,
> that is has no relyable meaning (and a different interpretation in
> different versions of browsers as well). Both meaning and
> interpretation get unpredictable for authors, the second at least
> for several years.

I think this is mostly because both HTML 4 and CSS 2.0 lacked rigorous  
testing of implementations and taking implementation and author feedback  
into account. With HTML 5 and CSS 2.1 we are trying to fix this mistake.


>> Just like normal languages Web languages change now and then.
>
> But not the versions and not the intended meaning of already
> published documents. If this would be the case, it would be
> impossible to write well defined documents/content in an electronic
> format at all. Maybe the interpretations changes, because we
> have other experiences now than when such a document was
> written, but this does not change, what the author had in mind,
> when the document was created.

Actually, what we find again and again is that what the majority of  
authors has in mind is the browser they are rendering their document or  
style sheet in. Not what the specification said. The latter only affects  
test suites and it is not worth preserving compatibility with those.


>> > For 'HTML5' - as long as I cannot simply write version="HTML5"
>> > I cannot start to write HTML5 documents.
>>
>> In effect your documents will be treated as HTML5 regardless.
>
> Here you mix up again interpretation and content.
> If you like, you can interprete a 'microsoft word' document or
> a postscript or PDF as well as 'HTML5', this does not necessarily
> mean, that you get the intended meaning of the document with
> such an interpretation and it is not obvious, how to interprete this
> in a useful way, if author/server send them as text/html.
> This can already happen with translations of documents from one
> language to another (Shakespeare in german, Goethe in english?),
> it is obviously even worse, if you try to interprete such documents
> in the other language without a translation - but of course, you can
> do it ;o)

This is the same point as above. Authors do not write against  
specifications.


>> > Already this is a 'show stopper' for 'HTML5' currently. One can still
>> > discuss the
>> > current draft, but for formal reasons one cannot write a
>> > 'HTML5' document ;o)
>>
>> As far as I know there are no formal reasons why one cannot write HTML5
>> documents and publish them and in fact many people are authoring HTML5
>> documents and publishing them.
>
> Well do you have samples?

There is

   http://html5gallery.com/

for instance, which collects sites made in HTML5.


> How do they identify them as 'HTML5'? and distinguish from undefined
> tag soup without a version indication?

That is not needed.


> Not, how these document are maybe interpreted today, what does the
> author indicate, what they are?

I'm not sure what you mean.


> Maybe in ten years they are interpreted as 'HTML6' - does it mean, that
> an author has written them as 'HTML6' today?

I assume that once we'll get to HTML 6 we make sure to do the same we did  
for HTML 5. That is studying existing content, implementations, etc. and  
go from there. If we succeed in what we want with HTML 5, HTML 6 will not  
have to make incompatible changes.


> If I have written some HTML tag soup in 1997, does it mean, that this is
> 'HTML5', just because this tag soup has no version indication? Does it
> mean in 10 years, that I have written 'HTML6' already in 1997?

Currently <!doctype html> is required for something to be considered HTML  
5. However, all HTML is consumed using the algorithm defined in the  
specification. (Implementations have always done this, though have  
differences between them because not everything was defined back in the  
days.)


>> This assumes versioning is necessary. Experience with Web browsers shows
>> that this is not needed and avoids a lot of complexity.
>
> Web browser only interprete documents, and the common experience shows,
> that they do it wrong and incomplete. And of course, it helps a lot to  
> have a look into the source code, find a well defined, structured  
> document with
> version indication for the interpretation, if the browser fails. One can  
> try to find the problem - more often bugs in the document than a real
> problem with the browser. And one can try to find out, what was intended
> by the author. For flawed documents, the intention of the author is often
> different from the interpretation of the browser, what is basically a  
> task for the author to fix the bugs - one cannot expect, that a simple  
> program
> like a browser can guess, what was intented by the author of  a flawed
> document. And even if some error management is defined in 'HTML5',
> one cannot assume, that the result is often close to the intention of the
> author, it is just one interpretation of a flawed document.

I do not see how this refutes my point. Requiring browsers to do more  
complex things will certainly not result in them doing less wrong or do  
things less incomplete.


>>>> No, you can just specify it. Just like you can in HTML4.
>>>
>>> I can write the string, but indeed, if I do it, it means
>> 'Windows-1252'.
>>
>> Not for your authoring tool or a conformance checker.
>>
>>> Therefore effectively, I cannot indicate, that something is
>>> 'ISO-8859-1' and not 'Windows-1252'.
>>
>> You cannot indicate that something needs to be decoded as ISO-8859-1 by
>> Web browsers for text/html content. This has been the case for a long  
>> time and is nothing new.
>
> It is at least new in 'HTML5', that one cannot indicate 'ISO-8859-1'
> as encoding, what is a difference, again the problem to mix up the
> content of a document with the interpretation of it.

I think you misunderstand the specification. It is an error if a document  
is labeled ISO-8859-1 and does not conform to ISO-8859-1. I said that  
above as well.


> There is no way for an author to prevent any interpretation. But the  
> author can try to indicate, what was intented.

Right.


> And as I already mentioned, I have seen years ago already browsers
> without this behaviour/bug, what was somehow interesting for
> documents with wrong encoding information and unmasked euro signs ;o)

There's a reason those browsers are no longer around.


--
Anne van Kesteren
http://annevankesteren.nl/


Re: [HTML5] 2.8 Character encodings

by Dr. Olaf Hoffmann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Anne van Kesteren:

....

> I am just trying to explain the state of affairs with respect to CSS here.
> While you can claim it is supposed to be different, that does not make it
> so.

No, I think, both CSS and 'HTML5' share the same problem without
a version indication. However, CSS is just about styling, in doubt
one can switch it off. (X)HTML is about content and it is more important
to have a well defined and not plurivalent meaning of elements
(or encoding or other things related to content).


....



> >> > For 'HTML5' - as long as I cannot simply write version="HTML5"
> >> > I cannot start to write HTML5 documents.
> >>
> >> In effect your documents will be treated as HTML5 regardless.
> >
> > Here you mix up again interpretation and content.
> > If you like, you can interprete a 'microsoft word' document or
> > a postscript or PDF as well as 'HTML5', this does not necessarily
> > mean, that you get the intended meaning of the document with
> > such an interpretation and it is not obvious, how to interprete this
> > in a useful way, if author/server send them as text/html.
> > This can already happen with translations of documents from one
> > language to another (Shakespeare in german, Goethe in english?),
> > it is obviously even worse, if you try to interprete such documents
> > in the other language without a translation - but of course, you can
> > do it ;o)
>
> This is the same point as above. Authors do not write against
> specifications.

Well not all authors ...
As an author I started to write a specification, because (X)HTML did
not cover, what I wanted to markup ;o)
And those people with microformats and RDF(a) indicate, that not
all authors want to write things, already appearing somehow in
common browsers, but have no semantical well defined meaning ;o)

Those authors, who care mainly about the appearance in current
browsers typically do not have long living documents in mind, even
if with some luck some of these documents remain several years.
Or they do not have much experience and rely to much on the
behaviour of the current version of their preferred browser.
Within the last ten years such authors typically had a lot of work
updating a few documents to the current behaviour of the newest
browser version. I think, this is not really a promising perspective
for electronic formats (as for computer hardware this is no
standard, this permanent handicraft work).


>
> >> > Already this is a 'show stopper' for 'HTML5' currently. One can still
> >> > discuss the
> >> > current draft, but for formal reasons one cannot write a
> >> > 'HTML5' document ;o)
> >>
> >> As far as I know there are no formal reasons why one cannot write HTML5
> >> documents and publish them and in fact many people are authoring HTML5
> >> documents and publishing them.
> >
> > Well do you have samples?
>
> There is
>
>    http://html5gallery.com/
>
> for instance, which collects sites made in HTML5.

Indeed, within the content, the meta description or keywords it often
appears, that those projects are intended to be HTML5. For several
others it can be guessed, because elements like header, footer and
section are used.
But that this information is spread along these meta element attributes
content or the content of the pages indicates even more, that a
version indication like version="HTML5" is missing.
Is it expected, that such an informations always appears within
the content or the description or keywords of a 'HTML5' document?
Or is it intended, that the used elements are analysed to identify 'HTML5'?
(I think, the gallery itself does not use elements specific for HTML5).
Is it defined in the current draft how to indicate 'HTML5' with meta
elements?
Looks currently like poor design of 'HTML5' - and that many of them
indicate the fact, that they use 'HTML5' with such workarounds looks
pretty much like a gap in the current draft ;o) Why else to note several
times within the document the used version in different ways,
surely because they try to assure, that their documents are really
identified somehow as 'HTML5' ;o)


>
> > How do they identify them as 'HTML5'? and distinguish from undefined
> > tag soup without a version indication?
>
> That is not needed.

Well, this dicussion and the samples you provided indicate, that there
is at least a desire for a version indication - maybe to get well defined
documents, maybe to show, that the author is a cool and funky designer
using already languages, which are still drafts, even if it is not known,
how to indicate, that they really use this cool new language version ;o)


>
> > Not, how these document are maybe interpreted today, what does the
> > author indicate, what they are?
>
> I'm not sure what you mean.
>

Because languages like XHTML+RDFa, HTML4, 'HTML5', SVG,
MathML, SMIL, RDF etc define somehow, what the meaning of the
content of an element is, and different versions define it (slightly)
different, the meaning can be only derived by knowing the version.
Sometimes, if the functionality changes too, the intended behaviour
can only be derived by knowing the version.
To create more than currently cool and funky designer pages it is
therefore important for some authors to indicate, what they really
mean, not just, how things appear. And if elements are defined
in different language versions to have different meanings, a version
indication is required.
Without it is still possible to write those single-serving pretty nice
and cool and funky designer pages beeing pretty nice and cool
and funky for a month or a year, because the majority of author
of those pages do not care about details. For more, you have to
care about details.

As already mentioned, 'HTML5' defines the meaning of elements
like small slightly different from other (X)HTML versions
(for small it is different from my typical use cases, which are
currently excluded by the new definition), therefore
a version indication for the complete document is important or
an author has to use the microdata on each element to indicate,
to which definition it belongs. Maybe one can set a microdata
information about the version on the root html element,
referencing the current draft, this may work currently as an
unambiguos version indication. This would implicate, that
the child elements belong to the same version, if not indicated
otherwise with even more microdata.



> > Maybe in ten years they are interpreted as 'HTML6' - does it mean, that
> > an author has written them as 'HTML6' today?
>
> I assume that once we'll get to HTML 6 we make sure to do the same we did
> for HTML 5. That is studying existing content, implementations, etc. and
> go from there. If we succeed in what we want with HTML 5, HTML 6 will not
> have to make incompatible changes.
>

Who knows?
The next generation may think, that semantical details are important
and will reject all this creating a completely new language version
avoiding all these historical meanders and it turns out, that every
document without an unambiguous indication should be rejected as
stupid tag soup ;o)


> > If I have written some HTML tag soup in 1997, does it mean, that this is
> > 'HTML5', just because this tag soup has no version indication? Does it
> > mean in 10 years, that I have written 'HTML6' already in 1997?
>
> Currently <!doctype html> is required for something to be considered HTML
> 5. However, all HTML is consumed using the algorithm defined in the
> specification. (Implementations have always done this, though have
> differences between them because not everything was defined back in the
> days.)
>

I think, this is not completely wrong for several HTML versions, and not
for XHTML or XHTML+RDFa and maybe the best choice too for
documents having an XHTML:html element as root element, but
elements from several other namespaces as well, maybe including
entity definitions within the doctype.


> >> This assumes versioning is necessary. Experience with Web browsers shows
> >> that this is not needed and avoids a lot of complexity.
> >
> > Web browser only interprete documents, and the common experience shows,
> > that they do it wrong and incomplete. And of course, it helps a lot to
> > have a look into the source code, find a well defined, structured
> > document with
> > version indication for the interpretation, if the browser fails. One can
> > try to find the problem - more often bugs in the document than a real
> > problem with the browser. And one can try to find out, what was intended
> > by the author. For flawed documents, the intention of the author is often
> > different from the interpretation of the browser, what is basically a
> > task for the author to fix the bugs - one cannot expect, that a simple
> > program
> > like a browser can guess, what was intented by the author of  a flawed
> > document. And even if some error management is defined in 'HTML5',
> > one cannot assume, that the result is often close to the intention of the
> > author, it is just one interpretation of a flawed document.
>
> I do not see how this refutes my point. Requiring browsers to do more
> complex things will certainly not result in them doing less wrong or do
> things less incomplete.


No, but if the audience is able to study the souce code and to derive,
what the author did and intended, this works often much better, because
the human intellectual capabilities are typically superior to that of a
simple browser (even if the history of HTML authors seem to implicate,
that one should not overestimate human capabilities ;o)
Well, unfortunately due to the limitations of capabilities I have to look
quite often in the source code - and if discussed with the author how
to markup texts in a technical and semantical meaningful way,
it helps simply to know instead of guessing what the author
tried to create and it shortens the discussion.
And maybe in the far future, if there is still some interest in my
documents it will hopefully help the audience to understand
some of my intentions, if I use proper semantical markup and
languages with well defined versions, which can be indicated
within a document.

>
> >>>> No, you can just specify it. Just like you can in HTML4.
> >>>
> >>> I can write the string, but indeed, if I do it, it means
> >>
> >> 'Windows-1252'.
> >>
> >> Not for your authoring tool or a conformance checker.
> >>
> >>> Therefore effectively, I cannot indicate, that something is
> >>> 'ISO-8859-1' and not 'Windows-1252'.
> >>
> >> You cannot indicate that something needs to be decoded as ISO-8859-1 by
> >> Web browsers for text/html content. This has been the case for a long
> >> time and is nothing new.
> >
> > It is at least new in 'HTML5', that one cannot indicate 'ISO-8859-1'
> > as encoding, what is a difference, again the problem to mix up the
> > content of a document with the interpretation of it.
>
> I think you misunderstand the specification. It is an error if a document
> is labeled ISO-8859-1 and does not conform to ISO-8859-1. I said that
> above as well.
>


It is noted:
"
When a user agent would otherwise use an encoding given in the first column of
the following table to either convert content to Unicode characters or
convert Unicode characters to bytes, it must instead use the encoding given
in the cell in the second column of the same row.
"

This indicates clearly, that the encoding information (not only the decoding)
is changed by this rule.
Therefore within the current 'HTML5' draft it is equivalent to
specify 'ISO-8859-1' or 'Windows-1252'.

And there is no indication, that this only applies to the text/html variant
of 'HTML5' and not to the application/xhtml+xml variant too
(you claimed in a previous mail, that it only applies to text/html -
where did you find that?)


> > There is no way for an author to prevent any interpretation. But the
> > author can try to indicate, what was intented.
>
> Right.
>
> > And as I already mentioned, I have seen years ago already browsers
> > without this behaviour/bug, what was somehow interesting for
> > documents with wrong encoding information and unmasked euro signs ;o)
>
> There's a reason those browsers are no longer around.

I think, it was a Konqueror and a Mozilla-Suite on Debian, I don't
remember exactly which versions, but not so old as several
tutorials about HTML and CSS I found within the last
years including sophisticated hints about related issues.



Re: [HTML5] 2.8 Character encodings

by Anne van Kesteren-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, 05 Aug 2009 10:58:43 +0200, Dr. Olaf Hoffmann <Dr.O.Hoffmann@...> wrote:
> No, I think, both CSS and 'HTML5' share the same problem without
> a version indication. However, CSS is just about styling, in doubt
> one can switch it off. (X)HTML is about content and it is more important
> to have a well defined and not plurivalent meaning of elements
> (or encoding or other things related to content).

If there is no versioning the latest version of the standard defines the meaning of elements. So that problem does not exist.


>> This is the same point as above. Authors do not write against
>> specifications.
>
> Well not all authors ...
> As an author I started to write a specification, because (X)HTML did
> not cover, what I wanted to markup ;o)

Writing _against_ a specification and writing _a_ specification are very different things.


> And those people with microformats and RDF(a) indicate, that not
> all authors want to write things, already appearing somehow in
> common browsers, but have no semantical well defined meaning ;o)

I do not understand this sentence.


> Those authors, who care mainly about the appearance in current
> browsers typically do not have long living documents in mind, even
> if with some luck some of these documents remain several years.

You make it sound like this is a fact, but this is not at all my experience.


> Or they do not have much experience and rely to much on the
> behaviour of the current version of their preferred browser.

Right, which is why better interoperability is important.


> Within the last ten years such authors typically had a lot of work
> updating a few documents to the current behaviour of the newest
> browser version. I think, this is not really a promising perspective
> for electronic formats (as for computer hardware this is no
> standard, this permanent handicraft work).

Do you have any evidence for this? Documents from 10 years ago typically still render fine.


>> There is
>>
>>    http://html5gallery.com/
>>
>> for instance, which collects sites made in HTML5.
>
> Indeed, within the content, the meta description or keywords it often
> appears, that those projects are intended to be HTML5. For several
> others it can be guessed, because elements like header, footer and
> section are used.

You mean "not used"? Those elements are part of HTML5.


> But that this information is spread along these meta element attributes
> content or the content of the pages indicates even more, that a
> version indication like version="HTML5" is missing.

I don't see why.


> Is it expected, that such an informations always appears within
> the content or the description or keywords of a 'HTML5' document?
> Or is it intended, that the used elements are analysed to identify  
> 'HTML5'?

The version does not need to be identified. As I said before, HTML is versionless.


> (I think, the gallery itself does not use elements specific for HTML5).
> Is it defined in the current draft how to indicate 'HTML5' with meta
> elements?

That would introduce versioning, so no.


> Looks currently like poor design of 'HTML5' - and that many of them
> indicate the fact, that they use 'HTML5' with such workarounds looks
> pretty much like a gap in the current draft ;o) Why else to note several
> times within the document the used version in different ways,
> surely because they try to assure, that their documents are really
> identified somehow as 'HTML5' ;o)

Why would they try to assure that their documents are identified as HTML5? Browsers process HTML in the same way regardless, so it does not matter. I've said all this before though and I'm feeling this discussion is just going in circles.


>> > How do they identify them as 'HTML5'? and distinguish from undefined
>> > tag soup without a version indication?
>>
>> That is not needed.
>
> Well, this dicussion and the samples you provided indicate, that there
> is at least a desire for a version indication - maybe to get well defined
> documents, maybe to show, that the author is a cool and funky designer
> using already languages, which are still drafts, even if it is not known,
> how to indicate, that they really use this cool new language version ;o)

Could you maybe indicate what you mean? I do not get the same impression as you looking at the source code of those sites.


>>> Not, how these document are maybe interpreted today, what does the
>>> author indicate, what they are?
>>
>> I'm not sure what you mean.
>
> Because languages like XHTML+RDFa, HTML4, 'HTML5', SVG,
> MathML, SMIL, RDF etc define somehow, what the meaning of the
> content of an element is, and different versions define it (slightly)
> different, the meaning can be only derived by knowing the version.

No, the meaning can only be derived by asking the author. The version does not have much to do with it. E.g. lots of authors abuse <blockquote> for indenting and others abuse longdesc="" for search engine spam. That is not the meaning of those HTML features.


> Sometimes, if the functionality changes too, the intended behaviour
> can only be derived by knowing the version.
> To create more than currently cool and funky designer pages it is
> therefore important for some authors to indicate, what they really
> mean, not just, how things appear. And if elements are defined
> in different language versions to have different meanings, a version
> indication is required.

Maybe in an ideal world this would be the case, but given that nobody wants to implement versioning, versioning makes things vastly more complex, and older specifications (e.g. HTML 4 and CSS 2.0) are very poorly written and ambiguous, this is unlikely to happen.

To do a step back, do you have an example of an HTML 4 page you once created that would get "weird meaning" in HTML 5?


> Without it is still possible to write those single-serving pretty nice
> and cool and funky designer pages beeing pretty nice and cool
> and funky for a month or a year, because the majority of author
> of those pages do not care about details. For more, you have to
> care about details.

I care about details.


> As already mentioned, 'HTML5' defines the meaning of elements
> like small slightly different from other (X)HTML versions
> (for small it is different from my typical use cases, which are
> currently excluded by the new definition), therefore
> a version indication for the complete document is important or
> an author has to use the microdata on each element to indicate,
> to which definition it belongs. Maybe one can set a microdata
> information about the version on the root html element,
> referencing the current draft, this may work currently as an
> unambiguos version indication. This would implicate, that
> the child elements belong to the same version, if not indicated
> otherwise with even more microdata.

I suppose you could come up with a convention for yourself, yes.


>> I assume that once we'll get to HTML 6 we make sure to do the same we  
>> did for HTML 5. That is studying existing content, implementations, etc. and
>> go from there. If we succeed in what we want with HTML 5, HTML 6 will  
>> not have to make incompatible changes.
>
> Who knows?

Nobody, but that's the idea.


> The next generation may think, that semantical details are important
> and will reject all this creating a completely new language version
> avoiding all these historical meanders and it turns out, that every
> document without an unambiguous indication should be rejected as
> stupid tag soup ;o)

Sure. Hopefully they study the history (e.g. with DOCTYPE switching) and realize it works poorly.


>> Currently <!doctype html> is required for something to be considered  
>> HTML 5. However, all HTML is consumed using the algorithm defined in the
>> specification. (Implementations have always done this, though have
>> differences between them because not everything was defined back in the
>> days.)
>
> I think, this is not completely wrong for several HTML versions, and not
> for XHTML or XHTML+RDFa and maybe the best choice too for
> documents having an XHTML:html element as root element, but
> elements from several other namespaces as well, maybe including
> entity definitions within the doctype.

I do not understand this sentence.


>> I do not see how this refutes my point. Requiring browsers to do more
>> complex things will certainly not result in them doing less wrong or do
>> things less incomplete.
>
> No, but if the audience is able to study the souce code and to derive,
> what the author did and intended, this works often much better, because
> the human intellectual capabilities are typically superior to that of a
> simple browser (even if the history of HTML authors seem to implicate,
> that one should not overestimate human capabilities ;o)
> Well, unfortunately due to the limitations of capabilities I have to look
> quite often in the source code - and if discussed with the author how
> to markup texts in a technical and semantical meaningful way,
> it helps simply to know instead of guessing what the author
> tried to create and it shortens the discussion.

In case of HTML they simply tried to write HTML and the latest specification of HTML gives the guidelines for that. Just like with CSS.


> And maybe in the far future, if there is still some interest in my
> documents it will hopefully help the audience to understand
> some of my intentions, if I use proper semantical markup and
> languages with well defined versions, which can be indicated
> within a document.

Since we want to remain backwards compatible with old sites I think that should be no problem.


>> I think you misunderstand the specification. It is an error if a  
>> document is labeled ISO-8859-1 and does not conform to ISO-8859-1. I said that
>> above as well.
>
> It is noted:
> "When a user agent would otherwise use an encoding given in the first  
> column of the following table to either convert content to Unicode characters or
> convert Unicode characters to bytes, it must instead use the encoding  
> given in the cell in the second column of the same row."
>
> This indicates clearly, that the encoding information (not only the  
> decoding) is changed by this rule.

This only applies to user agents as is clearly stated. Since you seem to care primarily about document meaning and conformance checkers (which have to report mismatches) I do not see how you consider that to be a problem.


> Therefore within the current 'HTML5' draft it is equivalent to
> specify 'ISO-8859-1' or 'Windows-1252'.

No. The former gives a character encoding error in conformance checkers if you use characters from the latter. (The former is a subset of the latter. I realize you said to Julian that this is not the case, but the preferred IANA name for the standard you pointed out _is_ ISO-8859-1.)


> And there is no indication, that this only applies to the text/html  
> variant of 'HTML5' and not to the application/xhtml+xml variant too
> (you claimed in a previous mail, that it only applies to text/html -
> where did you find that?)

Good point. That used to be the case. I think it got lost due to restructuring of sections. I filed a bug:

  http://www.w3.org/Bugs/Public/show_bug.cgi?id=7215


>>> And as I already mentioned, I have seen years ago already browsers
>>> without this behaviour/bug, what was somehow interesting for
>>> documents with wrong encoding information and unmasked euro signs ;o)
>>
>> There's a reason those browsers are no longer around.
>
> I think, it was a Konqueror and a Mozilla-Suite on Debian, I don't
> remember exactly which versions, but not so old as several
> tutorials about HTML and CSS I found within the last
> years including sophisticated hints about related issues.

If you can find pointers that'd be cool.


--
Anne van Kesteren
http://annevankesteren.nl/

< Prev | 1 - 2 | Next >