|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 | Next > |
|
|
[HTML5] 2.8 Character encodingsHello,
in the current draft are mentioned in 2.8 http://www.w3.org/TR/2009/WD-html5-20090423/infrastructure.html#character-encodings-0 some 'willful' misinterpretations of encoding information, for example to interprete a string like 'ISO-8859-1' as 'Windows-1252'. 1. Which string has an author to note, if he really wants to indicate, that the encoding is for example 'ISO-8859-1' and not 'Windows-1252'? 2. As far as I have seen, HTML5 has no version indication like previous versions of HTML had and other popular formats like SVG have. How can a browser identify, that a document is really intended as 'HTML5' with the implicated 'willful' misinterpretations of encoding information and no other HTMLversion? This assumes of course, that for other versions and formats there is no such 'willful' misinterpretation and the identification problem of encodings happens as usual by interpreting the string as provided by the author. Assuming that a viewer is able to identify a document somehow being a HTML5 document after looking into the content and for example a server sended 'ISO-8859-1' before, does this mean, that the viewer switches to or reparses the document with 'Windows-1252' again? Obviously it would be better to avoid such misinterpretation by using an encoding like UTF-8 not confused by the current HTML5 draft, however due to the history of older projects or server configurations it might be still convenient for many authors to continue to use 'ISO-8859-1' instead of other encodings, even if they switch for example from HTML4 to HTML5 for some documents. Best wishes Olaf |
|
|
Re: [HTML5] 2.8 Character encodingsOn Mon, 6 Jul 2009, Dr. Olaf Hoffmann wrote:
> > in the current draft are mentioned in 2.8 > http://www.w3.org/TR/2009/WD-html5-20090423/infrastructure.html#character-encodings-0 > some 'willful' misinterpretations of encoding information, for example > to interprete a string like 'ISO-8859-1' as 'Windows-1252'. > > 1. Which string has an author to note, if he really wants to indicate, that > the encoding is for example 'ISO-8859-1' and not 'Windows-1252'? "ISO-8859-1". If the author has really used that encoding, then there is no difference between them (1252 is a superset). > 2. As far as I have seen, HTML5 has no version indication like previous > versions of HTML had and other popular formats like SVG have. > How can a browser identify, that a document is really intended as > 'HTML5' with the implicated 'willful' misinterpretations of encoding > information and no other HTMLversion? It doesn't matter, all versions of HTML are in practice processed with these mappings. It is indeed why HTML5 has these mappings -- because browsers already did this. We wouldn't add these mappings if we didn't have to to handle legacy content (content in previous versions of HTML). > Assuming that a viewer is able to identify a document somehow being a > HTML5 document after looking into the content and for example a server > sended 'ISO-8859-1' before, does this mean, that the viewer switches to > or reparses the document with 'Windows-1252' again? I don't understand the question. > Obviously it would be better to avoid such misinterpretation by using an > encoding like UTF-8 not confused by the current HTML5 draft, however due > to the history of older projects or server configurations it might be > still convenient for many authors to continue to use 'ISO-8859-1' > instead of other encodings, even if they switch for example from HTML4 > to HTML5 for some documents. Hopefully my answers above will reassure you that this is not in fact a problem that authors will face. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' |
|
|
Re: [HTML5] 2.8 Character encodingsIan Hickson:
> On Mon, 6 Jul 2009, Dr. Olaf Hoffmann wrote: > > in the current draft are mentioned in 2.8 > > http://www.w3.org/TR/2009/WD-html5-20090423/infrastructure.html#character > >-encodings-0 some 'willful' misinterpretations of encoding information, > > for example to interprete a string like 'ISO-8859-1' as 'Windows-1252'. > > > > 1. Which string has an author to note, if he really wants to indicate, > > that the encoding is for example 'ISO-8859-1' and not 'Windows-1252'? > > "ISO-8859-1". If the author has really used that encoding, then there is > no difference between them (1252 is a superset). > I know, that there is only a difference for a few characters, not relevant for ISO-8859-1 usage (see tables at wikipedia for example). However, because I prefer to provide only well defined documents in publications, I want to be sure, that if ISO-8859-1 is mentioned within a document, that this really means ISO-8859-1 and not something else. The current draft defines something else, something I never used (and my preferred editor kate from KDE is able to distinguish between 'ISO-8859-1' and 'cp 1252' (Windows-1252 seems to be an alias for this encoding). The only way I can see with the current draft is not to use the string 'ISO-8859-1' at all for 'HTML5' because this format defines that this is interpreted as 'Windows-1252' and not as intended as 'ISO-8859-1'. This is a problem, because 'ISO-8859-1' is the default encoding for HTML4 for example. Therefore to switch from project still having HTML4 (and no XHTML already) to 'HTML5' seems to require to switch to UTF-8 to avoid plurivalences with the encoding due to the current draft. > > 2. As far as I have seen, HTML5 has no version indication like previous > > versions of HTML had and other popular formats like SVG have. > > How can a browser identify, that a document is really intended as > > 'HTML5' with the implicated 'willful' misinterpretations of encoding > > information and no other HTMLversion? > > It doesn't matter, all versions of HTML are in practice processed with > these mappings. It is indeed why HTML5 has these mappings -- because > browsers already did this. We wouldn't add these mappings if we didn't > have to to handle legacy content (content in previous versions of HTML). > Well for HTML4 and XHTML1.x and all other XML formats this is simple a bug of the browser, nothing to worry about for authors, because the string 'ISO-8859-1' has a well defined meaning in all these formats completely independent from the behaviour of current buggy browsers. And if some authors are forced by this bug to indicate 'cp 1252' as 'ISO-8859-1' I think, this is even more and indication, that this bug has to be fixed in browsers to inform those authors to fix their documents to get a well defined encoding for their document instead of hiding such a bug to prevent authors to fix such a nasty bug. Especially 'ISO-8859-1' authors currently do not have to worry about the bug, because they (typically) do not use the characters with different meaning in both encodings. With the current 'HTML5' draft already the indication as 'ISO-8859-1' is plurivalent and has to be avoided to create a well defined document in the format 'HTML5' (or 'HTML5' has to be avoided to create a well defined document). > > Assuming that a viewer is able to identify a document somehow being a > > HTML5 document after looking into the content and for example a server > > sended 'ISO-8859-1' before, does this mean, that the viewer switches to > > or reparses the document with 'Windows-1252' again? > > I don't understand the question. If a server, an XML-processing instruction or maybe a meta-element indicates the encoding as 'ISO-8859-1' a proper browser has to encode the document with 'ISO-8859-1' (with the implication that some characters defined differently in 'Windows-1252' do not have a useful graphical or acoustical representation in 'ISO-8859-1'). If within the processing, these hypothetical proper browser is able to detect somehow, that the current document is 'HTML5', the browser has to switch the encoding and some characters may be interpreted differently (maybe including a useful graphical or acoustical representation). Obviously there a several options for such a proper browser when and how to switch. Of course, this problem does not occur, if a buggy browser interpretes the encoding wrong for all formats, not only for 'HTML5' ;o) But 'HTML5' cannot redefine how to interprete encoding information for other formats or versions (HTML4, XHTML1, SVG, MathML, RDF, DAISY, FictionBook etc) > > > Obviously it would be better to avoid such misinterpretation by using an > > encoding like UTF-8 not confused by the current HTML5 draft, however due > > to the history of older projects or server configurations it might be > > still convenient for many authors to continue to use 'ISO-8859-1' > > instead of other encodings, even if they switch for example from HTML4 > > to HTML5 for some documents. > > Hopefully my answers above will reassure you that this is not in fact a > problem that authors will face. Yes and no - the behaviour of current buggy browsers for other formats is in practice not really a problem for authors of 'ISO-8859-1' documents, because these document typically will not contain the plurivalent characters. However, if it is specified that 'ISO-8859-1' is not 'ISO-8859-1' in 'HTML5', this is a general problem for authors wanting to create well defined documents with this encoding, because there is currently no string for 'HTML5' to provide the proper encoding information, because the common used string 'ISO-8859-1' is tainted or corrupted by the current 'HTML5' draft. If it is known, that many buggy browser use the wrong encoding, of course, this can be mentioned in the 'HTML5' draft as an (important) informational note for authors to be careful, but the draft should not redefine the meaning of the string incompatible to any other format. The 'HTML5' draft should not mix up reporting browser bugs with proper definitions. If this happens more often, thoughtful authors may get the impression, that 'HTML5' is only about the the behaviour of current buggy browsers and not something that defines a new version of (X)HTML - what is something completely differrent. Even if there is no browser without bugs, an author can still write well defined documents in such a format, but this is not possible, if the format itself has major plurivalences and contradiction. Indeed, though there was never a browser interpreting HTML4 complete and correct, it is still pretty useful to write documents (maybe less useful than XHTML, but 'HTML5' avoids this problem alread defining an XML variant too). Containing too many plurivalences, bloomers and historical stupidities due to browser bugs, 'HTML5' will never be useful to create well defined documents. |
|
|
Re: [HTML5] 2.8 Character encodingsOn Tue, 28 Jul 2009, Dr. Olaf Hoffmann wrote:
> > > > > > 1. Which string has an author to note, if he really wants to > > > indicate, that the encoding is for example 'ISO-8859-1' and not > > > 'Windows-1252'? > > > > "ISO-8859-1". If the author has really used that encoding, then there > > is no difference between them (1252 is a superset). > > I know, that there is only a difference for a few characters, not > relevant for ISO-8859-1 usage (see tables at wikipedia for example). > However, because I prefer to provide only well defined documents in > publications, I want to be sure, that if ISO-8859-1 is mentioned within > a document, that this really means ISO-8859-1 and not something else. If you use ISO-8859-1 correctly (i.e. no control characters) then there is no way to distinguish the behaviour you want from the behaviour the spec describes if you label your document as ISO-8859-1. > This is a problem, because 'ISO-8859-1' is the default encoding for > HTML4 for example. HTML4 has no default encoding. > Therefore to switch from project still having HTML4 (and no XHTML > already) to 'HTML5' seems to require to switch to UTF-8 to avoid > plurivalences with the encoding due to the current draft. If you're using ISO-8859-1 correctly, you can continue using it without any change whatsoever. > > > 2. As far as I have seen, HTML5 has no version indication like > > > previous versions of HTML had and other popular formats like SVG > > > have. How can a browser identify, that a document is really intended > > > as 'HTML5' with the implicated 'willful' misinterpretations of > > > encoding information and no other HTMLversion? > > > > It doesn't matter, all versions of HTML are in practice processed with > > these mappings. It is indeed why HTML5 has these mappings -- because > > browsers already did this. We wouldn't add these mappings if we didn't > > have to to handle legacy content (content in previous versions of > > HTML). > > Well for HTML4 and XHTML1.x and all other XML formats this is simple > a bug of the browser, nothing to worry about for authors, because the > string 'ISO-8859-1' has a well defined meaning in all these formats > completely independent from the behaviour of current buggy browsers. A bug that is done the same way by all browsers is a de facto standard, and unless you don't really care about how your document is seen by your readers, it is something you have to worry about. > And if some authors are forced by this bug to indicate 'cp 1252' as > 'ISO-8859-1' I think, this is even more and indication, that this bug > has to be fixed in browsers to inform those authors to fix their > documents to get a well defined encoding for their document instead of > hiding such a bug to prevent authors to fix such a nasty bug. That would be wonderful, but we can't get there from here since it would cause pages to break and thus browsers refuse to do it. > Especially 'ISO-8859-1' authors currently do not have to worry about > the bug, because they (typically) do not use the characters with > different meaning in both encodings. Indeed, as they are control characters there is no reason that they should ever use those characters at all. > With the current 'HTML5' draft already the indication as 'ISO-8859-1' is > plurivalent and has to be avoided to create a well defined document in > the format 'HTML5' (or 'HTML5' has to be avoided to create a well > defined document). It's well-defined, just not defined the way you would like. > If a server, an XML-processing instruction or maybe a meta-element > indicates the encoding as 'ISO-8859-1' a proper browser has to encode > the document with 'ISO-8859-1' (with the implication that some characters > defined differently in 'Windows-1252' do not have a useful graphical or > acoustical representation in 'ISO-8859-1'). If within the processing, these > hypothetical proper browser is able to detect somehow, that the current > document is 'HTML5', the browser has to switch the encoding and some > characters may be interpreted differently (maybe including a useful > graphical or acoustical representation). I assume you mean "decoding", not "encoding". A proper HTML5 browser will treat ISO-8859-1 as Windows-1252 for any text/html processing. There is no need for a magical detection ability. > But 'HTML5' cannot redefine how to interprete encoding information for > other formats or versions (HTML4, XHTML1, SVG, MathML, RDF, DAISY, > FictionBook etc) HTML5 redefines how to interpret encoding information for HTML4, as it replaces it. It does not affect processing of XML or other formats. (In practice, what HTML5 requires is what was implemented anyway for HTML4, so there is no actual difference.) > If it is known, that many buggy browser use the wrong encoding, of > course, this can be mentioned in the 'HTML5' draft as an (important) > informational note for authors to be careful, but the draft should not > redefine the meaning of the string incompatible to any other format. Yes, it should. I'm documenting what is actually implemented, and making sure we have interoperable behaviour, and making sure that new tools will work with legacy content interoperably with legacy user agents and future user agents. This requires a candid approach and cannot be achieved by beating around the bush in a politically correct way about what should ideally have happened vs what actually happens in the real world. Yes, this means ISO-8859-1's control characters can't be expressed in HTML documents. Tough. Deal with it. That's how implementations are, that's what legacy content relies on, and pretending otherwise is a big waste of everyone's time. > Indeed, though there was never a browser interpreting > HTML4 complete and correct Exactly. There wasn't. There _will_ be one that implements HTML5 completely and correctly. > Containing too many plurivalences, bloomers and historical stupidities > due to browser bugs, 'HTML5' will never be useful to create well defined > documents. It's already useful for this purpose. Being honest about what happens makes it _more_ useful. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' |
|
|
Re: [HTML5] 2.8 Character encodingsHello,
within the last ten or more years I have already a lot of experience especially with german authors (they typically want to use Umlaute and the ß-ligature and sometimes the Euro-sign) and their problems. Every month I have to explain again, how to distinguish between UTF-8 and ISO-8859-1(5) and that ISO-8859-1 has no BOM - respectively that this is displayed as visible characters in some browser versions etc. One has to explain, that the indication of the server (stupid or correct) is more relevant than indications within the document and how to get it correct with PHP or with the .htaccess file of the Apache web-server and that one cannot change the encoding within one document. And of course, I still have in mind the behaviour of the browsers I used years ago, when the Euro was introduced in europe and some authors tried to use the Euro-sign without masking within ISO-8859-1. There was a clear indication, that this does not work in these legacy browsers, therefore it is not true, that the interpretation of 'ISO-8859-1' as 'Windows-1252' is compatible with older browsers - they/some had no bug and indicated the not representable character for example with a question mark, a box etc. And indeed, it was simple to explain, that the author either has to use another encoding or has to mask the character to fix his/her bug. This simple approach is maybe corrupted now with bugs in current versions of browsers. 'HTML5' seems to introduce a new rule how to identify the encoding. The encoding problem is obviously already hardly understandable for many authors. Suddenly this new rule with some opaque method to identify 'HTML5' documents complicates the situation even more and makes it much harder to explain, what to do to get a well defined document or script output and how to fix bugs. Therefore the main questions remain open up to here: 1. How to indicate the 'ISO-8859-1' encoding within an 'HTML5' document and not 'Windows-1252', if an author wants to specify 'ISO-8859-1' and nothing else? 2. How does a proper viewer/browser identify, that a document is 'HTML5' and that this specific rule has to be applied, if 'ISO-8859-1' is indicated. 3. At which point the encoding information switches from the information given by the server or the XML processing instruction to the specific rule of 'HTML5' to interprete the string 'ISO-8859-1' as indication for 'Windows-1252'? Indeed, up to here, this is all about encoding information, not how a document is decoded by the viewer (buggy or not). Olaf |
|
|
Re: [HTML5] 2.8 Character encodingsJust so I'm clear, the problem you are defining is that HTML5 requires browsers to interpret ISO-8859-1 as Windows-1252 [1], and traditionally authors that used Windows-1252 encoding, but specified ISO-8859-1 for the character set, were given feedback of the mismatch because the browser would fail to render the correct glyph. And because this feedback has been removed in HTML5 (the correct glyph is shown despite the charset mismatch), it goes unnoticed by the author and the page remains misidentified as ISO-8859-1 when it's really Windows-1252. Is that correct?
It may be worthwhile to point out that Firefox 2 & 3 and Internet Explorer 7 & 8 all render content encoded as Windows-1252 but identified as ISO-8859-1 as Windows-1252 -- the exact behavior of HTML5. And it's possible older versions of FF/IE and other browsers also have this behavior. So the lack of feedback already exists, which may explain the preponderance of pages misidentified as ISO-8869-1. Given the above, I'd argue that HTML5 is more compatible with existing behavior than it would be if it required strictly render of ISO-8859-1. If you are curious, you can test your browser to see how it renders UTF-8 and Windows-1252 byte streams given a particular encoding: http://www.corry.biz/charset_mismatch.lasso For fun, try MacRoman encoding in Internet Explorer, you'll see it is rendered as Windows-1252. - Bil [1] http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#misinterpreted-for-compatibility Dr. Olaf Hoffmann wrote on 7/31/2009 5:53 AM: > Hello, > > within the last ten or more years I have already a lot > of experience especially with german authors (they > typically want to use Umlaute and the ß-ligature and > sometimes the Euro-sign) and their problems. > Every month I have to explain again, how to > distinguish between UTF-8 and ISO-8859-1(5) and > that ISO-8859-1 has no BOM - respectively that > this is displayed as visible characters in some > browser versions etc. > One has to explain, that the indication of the > server (stupid or correct) is more relevant than > indications within the document and how to > get it correct with PHP or with the .htaccess file > of the Apache web-server and that one cannot > change the encoding within one document. > > And of course, I still have in mind the behaviour > of the browsers I used years ago, when the > Euro was introduced in europe and some authors > tried to use the Euro-sign without masking > within ISO-8859-1. There was a clear indication, > that this does not work in these legacy browsers, > therefore it is not true, that the interpretation > of 'ISO-8859-1' as 'Windows-1252' is compatible > with older browsers - they/some had no bug and > indicated the not representable character for > example with a question mark, a box etc. > And indeed, it was simple to explain, that the > author either has to use another encoding or > has to mask the character to fix his/her bug. > This simple approach is maybe corrupted > now with bugs in current versions of browsers. > > > 'HTML5' seems to introduce a new rule how to > identify the encoding. The encoding problem > is obviously already hardly understandable > for many authors. Suddenly this new rule with > some opaque method to identify 'HTML5' > documents complicates the situation even more > and makes it much harder to explain, what to > do to get a well defined document or script > output and how to fix bugs. > > Therefore the main questions remain open up to > here: > > 1. How to indicate the 'ISO-8859-1' encoding > within an 'HTML5' document and not > 'Windows-1252', if an author wants to specify > 'ISO-8859-1' and nothing else? > > 2. How does a proper viewer/browser identify, > that a document is 'HTML5' and that this > specific rule has to be applied, if 'ISO-8859-1' > is indicated. > > 3. At which point the encoding information > switches from the information given by the > server or the XML processing instruction > to the specific rule of 'HTML5' to interprete > the string 'ISO-8859-1' as indication for > 'Windows-1252'? > > Indeed, up to here, this is all about encoding > information, not how a document is decoded > by the viewer (buggy or not). > > > Olaf > |
|
|
Re: [HTML5] 2.8 Character encodingsBil Corry:
> Just so I'm clear, the problem you are defining is that HTML5 requires > browsers to interpret ISO-8859-1 as Windows-1252 [1], and traditionally > authors that used Windows-1252 encoding, but specified ISO-8859-1 for the > character set, were given feedback of the mismatch because the browser > would fail to render the correct glyph. And because this feedback has been > removed in HTML5 (the correct glyph is shown despite the charset mismatch), > it goes unnoticed by the author and the page remains misidentified as > ISO-8859-1 when it's really Windows-1252. Is that correct? > With the still open questions I mainly try to find out whether it is possible to specify, that a 'HTML5' document has an encoding like 'ISO-8859-1' and not 'Windows-1252'. As far as I understand the specification, this is not possible, but then the other questions become interesting, because typically it is relevant what the server indicates, not what is mentioned explictely or implicitely in the document. And if 'HTML5' changes the relevance (what is not necessarily bad) of the encoding information, I want to know of course, how a viewer/browser identifies a document as 'HTML5', because for other formats or versions this behaviour is simply a bug and not a feature. This is more a problem of the missing version indication, for example the newest XHTML variant has the attribute version="XHTML+RDFa 1.0", the formal identification problem would be solved for 'HTML5' for example with version="HTML5", then the document is at least well defined and specific rules are applicable, if the specification is known to the decoder. A proper browser of course has to search for such a version indication to know, which encoding information applies, before the document is presented. Even if current browsers do it differently (wrong), this is no reason, that there is no rule required to define a correct way. > It may be worthwhile to point out that Firefox 2 & 3 and Internet Explorer > 7 & 8 all render content encoded as Windows-1252 but identified as > ISO-8859-1 as Windows-1252 -- the exact behavior of HTML5. And it's > possible older versions of FF/IE and other browsers also have this > behavior. So the lack of feedback already exists, which may explain the > preponderance of pages misidentified as ISO-8869-1. > > Given the above, I'd argue that HTML5 is more compatible with existing > behavior than it would be if it required strictly render of ISO-8859-1. > > If you are curious, you can test your browser to see how it renders UTF-8 > and Windows-1252 byte streams given a particular encoding: > > http://www.corry.biz/charset_mismatch.lasso > > For fun, try MacRoman encoding in Internet Explorer, you'll see it is > rendered as Windows-1252. > > As already mentioned, years ago there were browsers/versions without this bug and for typical 'ISO-8859-1' a slightly wrong decoding is not really a problem, as already discussed. A practical problem currently only appears, if a document has the wrong encoding information and a legacy browser without the bug is used to present it. Because not all legacy browsers have this bug, it is a misinformation of authors to make them believe, that everything is solved for buggy documents, because all browser compensate this bug with yet another browser bug. However, authors of documents with wrong encoding informations are guilty anyway, this is not really a problem of a specification. It is just more difficult to teach them to write proper documents. Indifferent or ignorant authors are not necessarily a problem for a specification, they are a problem mainly for the general audience. The questions are more about the problem, how to indicate, that a ' HTML5' document really has the encoding ISO-8859-1. This can be important for long living documents and archival storage. Because in 50 or 100 or 1000 years one cannot rely on the behaviour of browsers of the year 2009, but it might be still possible to decode well defined documents with completely different programs. To simplify this, one should have simple and intuitive indications and not such a bloomer like to write 'ISO-8859-1' if you mean 'Windows-1252'. With the current draft, one can only recommend 'HTML5'+UTF-8 or another format/version like XHTML+RDFa for long living documents and archival storage (what is not necessarily bad too, just something interesting to know for some people). Olaf |
|
|
Re: [HTML5] 2.8 Character encodingsDr. Olaf Hoffmann wrote on 7/31/2009 1:10 PM:
> With the still open questions I mainly try to find out whether it is > possible to specify, that a 'HTML5' document has an encoding like > 'ISO-8859-1' and not 'Windows-1252'. You do it the same way as you would for any character set, by specifying the content encoding as ISO-8859-1. Typically this is done via the Content-Type header: Content-Type: text/html; charset=ISO-8859-1 That header means, "This HTML document is in the ISO-8859-1 character set." By inference, it also means that it isn't Windows-1252, or UTF-8, etc. > As far as I understand the specification, this is not possible I think what you mean is it isn't possible to force the UA to use the ISO-8859-1 charset when specified and you're right. As I mentioned in my previous email, IE will display Windows-1252 when MacRoman is specified which is clearly wrong -- the glyphs don't even come close to matching each other. As an author you have to work around that. And as an author you must ensure the encoding is correct to get consistent results. > but > then the other questions become interesting, because typically > it is relevant what the server indicates, not what is mentioned > explictely or implicitely in the document. I haven't ever tested what happens when the content-type header doesn't match the meta, but considering it's incorrect, an author can hardly expect positive results. I wonder if HTML5 specifies the behavior in this case? > As already mentioned, years ago there were browsers/versions without this > bug and for typical 'ISO-8859-1' a slightly wrong decoding is not really a > problem, as already discussed. A practical problem currently only appears, > if a document has the wrong encoding information and a legacy browser > without the bug is used to present it. Because not all legacy browsers > have this bug, it is a misinformation of authors to make them believe, that > everything is solved for buggy documents, because all browser > compensate this bug with yet another browser bug. However, authors > of documents with wrong encoding informations are guilty anyway, > this is not really a problem of a specification. It is just more difficult to > teach them to write proper documents. Indifferent or ignorant > authors are not necessarily a problem for a specification, they are > a problem mainly for the general audience. This also describes the issue with browsers doing a best-guess with rendering HTML content that is malformed. The solution to both malformed HTML and misidentified charsets is to run the page through validator.w3.org -- both get flagged if wrong. If you want to see, try this: http://validator.w3.org/check?uri=http%3A%2F%2Fwww.corry.biz%2Fcharset_mismatch.lasso%3Fcharset%3DISO-8859-1 It (correctly) returns the error: Using windows-1252 instead of the declared encoding iso-8859-1. Line 22, Column 87: Unmappable byte sequence: 9d. So if your authors care about checking their markup with validator.w3.org, they will also have their charset checked as well. > The questions are more about the problem, how to indicate, > that a ' HTML5' document really has the encoding ISO-8859-1. > This can be important for long living documents and archival storage. > Because in 50 or 100 or 1000 years one cannot rely on the behaviour > of browsers of the year 2009, but it might be still possible to decode > well defined documents with completely different programs. > To simplify this, one should have simple and intuitive indications > and not such a bloomer like to write 'ISO-8859-1' if you mean > 'Windows-1252'. A 1000 years from now, if they do what HTML5 does now and use Windows-1252 when ISO-8859-1 is specified, they'll be guaranteed to correctly view the document (assuming it's in either ISO-8859-1 or Windows-1252). The same can not be said for viewing Windows-1252 as ISO-8859-1. > With the current draft, one can only recommend 'HTML5'+UTF-8 > or another format/version like XHTML+RDFa for long living > documents and archival storage (what is not necessarily bad too, > just something interesting to know for some people). UTF-8 isn't free from issues either -- I've seen Windows-1252 served as UTF-8 which produces illegal byte sequences. Or here's an example where the page (Windows-1252) doesn't specify a charset at all; in Firefox it's rendered as UTF-8 with broken bytes and IE it's rendered with the correct charset of Windows-1252: http://cspinet.org/new/200907301.html Which browser do you think they test their site with? Which browser do you think the end user thinks is broken? - Bil |
|
|
Re: [HTML5] 2.8 Character encodingsBil Corry:
> Dr. Olaf Hoffmann wrote on 7/31/2009 1:10 PM: > > With the still open questions I mainly try to find out whether it is > > possible to specify, that a 'HTML5' document has an encoding like > > 'ISO-8859-1' and not 'Windows-1252'. > > You do it the same way as you would for any character set, by specifying > the content encoding as ISO-8859-1. Typically this is done via the > Content-Type header: > > Content-Type: text/html; charset=ISO-8859-1 > > That header means, "This HTML document is in the ISO-8859-1 character set." > By inference, it also means that it isn't Windows-1252, or UTF-8, etc. This I know and is true for other formats but 'HTML5', the current draft of 'HTML5' has a specific rule, that this means 'Windows-1252' and not 'ISO-8859-1' - and this seems to supersede what the server indicates, if a viewer is able to identify is as a 'HTML5' document. > > > As far as I understand the specification, this is not possible > > I think what you mean is it isn't possible to force the UA to use the > ISO-8859-1 charset when specified and you're right. It is never possible for an author to force an arbitrary viewer to do something (specific). An careful author can only write well defined and valid documents. And an author can test it with some currently known viewers. The results of these tests are not necessarily representative for other viewers, not accessible for the author (for example because they are not published yet). > As I mentioned in my > previous email, IE will display Windows-1252 when MacRoman is specified > which is clearly wrong -- the glyphs don't even come close to matching each > other. As an author you have to work around that. And as an author you > must ensure the encoding is correct to get consistent results. > Maybe this browser does not know this encoding, however then the user should be warned, that in such a case something stupid might happen. The warning is clearly the problem of the implementor, not the author. Of course, the author can avoid such problems, using more common encodings like UTF-8 or ISO-8859-1. > > but > > then the other questions become interesting, because typically > > it is relevant what the server indicates, not what is mentioned > > explictely or implicitely in the document. > > I haven't ever tested what happens when the content-type header doesn't > match the meta, but considering it's incorrect, an author can hardly expect > positive results. I wonder if HTML5 specifies the behavior in this case? > I think, 'HTML5' only specifies a specific rule for these few strings, that 'Windows-1252' is used instead of them. Only if the author provides these strings, this indication is superseded, independently from the question, how the encoding was indicated. If the server sends with the content-type header UTF-8, the specific 'HTML5' rule does not apply and everything is interpreted as UTF-8, no matter what is mentioned within the document. > > As already mentioned, years ago there were browsers/versions without this > > bug and for typical 'ISO-8859-1' a slightly wrong decoding is not really > > a problem, as already discussed. A practical problem currently only > > appears, if a document has the wrong encoding information and a legacy > > browser without the bug is used to present it. Because not all legacy > > browsers have this bug, it is a misinformation of authors to make them > > believe, that everything is solved for buggy documents, because all > > browser > > compensate this bug with yet another browser bug. However, authors > > of documents with wrong encoding informations are guilty anyway, > > this is not really a problem of a specification. It is just more > > difficult to teach them to write proper documents. Indifferent or > > ignorant > > authors are not necessarily a problem for a specification, they are > > a problem mainly for the general audience. > > This also describes the issue with browsers doing a best-guess with > rendering HTML content that is malformed. The solution to both malformed > HTML and misidentified charsets is to run the page through validator.w3.org > -- both get flagged if wrong. If you want to see, try this: > > http://validator.w3.org/check?uri=http%3A%2F%2Fwww.corry.biz%2Fcharset_mis >match.lasso%3Fcharset%3DISO-8859-1 > > It (correctly) returns the error: > > Using windows-1252 instead of the declared encoding iso-8859-1. > Line 22, Column 87: Unmappable byte sequence: 9d. > > So if your authors care about checking their markup with validator.w3.org, > they will also have their charset checked as well. > Well, the validator has some bugs too. If no encoding is specified for text/*, iso-8859-1 should be used (according to HTTP). I noticed already some months ago, that the validator noted wrong bugs for XHTML+RDFa instead of indicating, that this cannot be validated yet (without an optional doctype within the document, however the version indication is done with the version attribute, not a doctype). I noted errors for SVG documents too. The validator is a pretty good help in many cases (less useful for SVG, because it checks almost no attribute values), but an author still has to be careful with the results... > > The questions are more about the problem, how to indicate, > > that a ' HTML5' document really has the encoding ISO-8859-1. > > This can be important for long living documents and archival storage. > > Because in 50 or 100 or 1000 years one cannot rely on the behaviour > > of browsers of the year 2009, but it might be still possible to decode > > well defined documents with completely different programs. > > To simplify this, one should have simple and intuitive indications > > and not such a bloomer like to write 'ISO-8859-1' if you mean > > 'Windows-1252'. > > A 1000 years from now, if they do what HTML5 does now and use Windows-1252 > when ISO-8859-1 is specified, they'll be guaranteed to correctly view the > document (assuming it's in either ISO-8859-1 or Windows-1252). The same > can not be said for viewing Windows-1252 as ISO-8859-1. > > > With the current draft, one can only recommend 'HTML5'+UTF-8 > > or another format/version like XHTML+RDFa for long living > > documents and archival storage (what is not necessarily bad too, > > just something interesting to know for some people). > > UTF-8 isn't free from issues either -- I've seen Windows-1252 served as > UTF-8 which produces illegal byte sequences. Or here's an example where > the page (Windows-1252) doesn't specify a charset at all; in Firefox it's > rendered as UTF-8 with broken bytes and IE it's rendered with the correct > charset of Windows-1252: > > http://cspinet.org/new/200907301.html > > Which browser do you think they test their site with? Which browser do you > think the end user thinks is broken? > The Gecko I use indicated ISO-8859-1 as well as Konqueror, no UTF-8 or Windows-1252, Opera notes a problem (unsupported) and notes too, that Windows-1252 is used. Because this tag soup is served as text/html without encoding information, I think, the Gecko and Konqueror are perfectly correct with ISO-8859-1. There is no indication, that this might be 'HTML5'. Therefore no specific rule from the 'HTML5' draft needs to be applied. The document does not indicate at all, which version of HTML it uses; looking into the source code, I think it uses a proprietary private slang of the author (even comments are noted wrong ;o) It cannot be expected from a program, that any version information can be derived from this tag soup. The presentation cannot be wrong, because it is undefined (a viewer may use any set of rules to interprete this or reject it as nonsense; a viewer can use 'HTML5' to try to interprete this, but can use any other set of rules as well, because the author does not indicate a version informtion at all). The encoding however seems to be defined (only) by the HTTP protocol, not by the document itself. Of course, ISO-8859-1 (and Windows-1252) are only compatible with UTF-8 for a basic set of characters. This can be seen more easily with documents not in english but in other languages like german, french, spanish, pretty well covered with ISO-8859-1 but for several important characters with different encoding in ISO-8859-1 and UTF-8. Because there is no indication, that this might be XML, there is no need to think, that UTF-8 is a proper choice for decoding. |
|
|
Re: [HTML5] 2.8 Character encodingsDr. Olaf Hoffmann wrote on 8/1/2009 10:08 AM:
> Bil Corry: >> Dr. Olaf Hoffmann wrote on 7/31/2009 1:10 PM: >>> With the still open questions I mainly try to find out whether it is >>> possible to specify, that a 'HTML5' document has an encoding like >>> 'ISO-8859-1' and not 'Windows-1252'. >> You do it the same way as you would for any character set, by specifying >> the content encoding as ISO-8859-1. Typically this is done via the >> Content-Type header: >> >> Content-Type: text/html; charset=ISO-8859-1 >> >> That header means, "This HTML document is in the ISO-8859-1 character set." >> By inference, it also means that it isn't Windows-1252, or UTF-8, etc. > > This I know and is true for other formats but 'HTML5', the current draft > of 'HTML5' has a specific rule, that this means 'Windows-1252' and > not 'ISO-8859-1' - and this seems to supersede what the server indicates, > if a viewer is able to identify is as a 'HTML5' document. I started to reply, but realized this thread is just going circular. At issue, you are claiming the HTML5 charset rules will create problems for authors -- can you provide some real-world examples? I would be very interested to some of your documents where your ISO-8859-1 encoding is broken by the HTML5 charset rules. - Bil |
|
|
Re: [HTML5] 2.8 Character encodingsBil Corry:
> > I started to reply, but realized this thread is just going circular. Indeed, the main questions remain open ;o) > > At issue, you are claiming the HTML5 charset rules will create problems for > authors -- can you provide some real-world examples? I would be very > interested to some of your documents where your ISO-8859-1 encoding is > broken by the HTML5 charset rules. > Well, because 'HTML5' is currently just a draft, I do not use it. I think, up to know, it has not even a version indication, therefore it is not obvious to me how to indicate, that a document is written in 'HTML5'. Until several of these issues are not solved and 'HTML5' is not really stable, I surely will not use it. Currently I use XHTML+RDFa for new projects and to fill semantical gaps of (X)HTML. But as already mentioned, for an author of an 'ISO-8859-1'-'HTML5' document apart from the version indication it is already a problem to specify the used encoding properly. This problem appears while a document is written and has to be solved before publication, therefore published documents are not broken, because they simply are not published due to this problem. Therefore if I start to write some test documents and this problem is not avoided and a version indication is possible, I think, I will use UTF-8 for those documents. Typically this means, that they are incompatible with other of my documents and scripts and will appear in another directory with an Apache-.htaccess file indicating the different encoding. I think, the Apache has an option with specific file name extensions too, this can be used for directories with mixed encodings maybe. Surely I will not explain this to other authors, if this question comes up, because it is too complex for many authors. This does not cause broken documents, the construct is just more fragile and one has to care more, where to put and how to name files and one has to switch the encoding in the editor for different projects. This is only more work and more sources of possible errors, not recommendable for every author. Therefore maybe I will never create more than test documents for 'HTML5' just to avoid such complications. With the new microdata section, 'HTML5' seemed to get more interesting for authors (well, the CURIEs are still missing, but there seems to be a workaround with entitiy definitions within the else almost empty DOCTYPE), therefore it would have been interesting to test this or to include this in tutorials for other authors, because it has already a few more semantically relevant elements than HTML4/XHTML1.x. Olaf |
|
|
Re: [HTML5] 2.8 Character encodingsOn Tue, 04 Aug 2009 10:25:35 +0200, Dr. Olaf Hoffmann
<Dr.O.Hoffmann@...> wrote: > [snip] I think, up to know, it has not even a version indication, > therefore it > is not obvious to me how to indicate, that a document is written in > 'HTML5'. This is by design. We're removing versioning from (X)HTML much like CSS does not have versioning. (To be clear, not everyone in the HTML WG agrees with this design choice.) > But as already mentioned, for an author of an 'ISO-8859-1'-'HTML5' > document apart from the version indication it is already a problem to > specify the used encoding properly. No, you can just specify it. Just like you can in HTML4. > This problem appears while a document is written and has to be solved > before publication, therefore > published documents are not broken, because they simply are not > published due to this problem. I do not follow this. > Therefore if I start to write some test documents and this problem is > not avoided and a version indication is possible, I think, I will use > UTF-8 for those documents. This seems like a good idea regardless. > Typically this means, that they are > incompatible with other of my documents and scripts and will appear > in another directory with an Apache-.htaccess file indicating the > different encoding. That is one solution. You could also always indicate the encoding in the document instead and instruct Apache to not include the charset parameter. > I think, the Apache has an option with specific file name extensions too, > this can be used for directories with mixed encodings maybe. That is an option too. You can also set headers on a per-file basis using the Files directive. > Surely I will not explain this to other authors, if this question comes > up, because it is too complex for many authors. Agreed. Encoding is largely misunderstood. It makes more sense for editors to start defaulting to UTF-8 going forward and have everyone use that, in my opinion. > This does not cause broken documents, the construct is just more fragile > and one has to care more, where to put and how to name files and one > has to switch the encoding in the editor for different projects. This is > only more work and more sources of possible errors, not recommendable > for every author. If you simply switch to UTF-8 for all future work this will become less and less of a problem. And then you've also covered other scripts may the need arise to use them. > Therefore maybe I will never create more than test documents for > 'HTML5' just to avoid such complications. Ok. > With the new microdata section, 'HTML5' seemed to get more > interesting for authors (well, the CURIEs are still missing, but there > seems to be a workaround with entitiy definitions within the else > almost empty DOCTYPE), therefore it would have been interesting > to test this or to include this in tutorials for other authors, because > it has already a few more semantically relevant elements than > HTML4/XHTML1.x. Since HTML5 is no longer SGML based entity definitions there will not work and are non-conforming. The reason we did this was because other than the validator no software processed text/html resources in this way leading to a lot of author confusion because of the clear mismatch between the validator and other software. -- Anne van Kesteren http://annevankesteren.nl/ |
|
|
Re: [HTML5] 2.8 Character encodingsAnne van Kesteren:
> On Tue, 04 Aug 2009 10:25:35 +0200, Dr. Olaf Hoffmann > > <Dr.O.Hoffmann@...> wrote: > > [snip] I think, up to know, it has not even a version indication, > > therefore it > > is not obvious to me how to indicate, that a document is written in > > 'HTML5'. > > This is by design. We're removing versioning from (X)HTML much like CSS > does not have versioning. (To be clear, not everyone in the HTML WG agrees > with this design choice.) Well, this is a problem for CSS too, because some properties are defined differently in CSS2.1 than in CSS2. I discovered this some time ago for example for clipping for some SVG test documents, which appeared wrong in Opera. SVG depends on CSS2, therefore these tests are still well defined, applied to (X)HTML they are not testable anymore, because CSS has no version indication. For 'HTML5' - as long as I cannot simply write version="HTML5" I cannot start to write HTML5 documents. Already this is a 'show stopper' for 'HTML5' currently. One can still discuss the current draft, but for formal reasons one cannot write a 'HTML5' document ;o) There is no problem to write HTML3.2, HTML4, XHTML1.0, XHTML1.1 or XHTML+RDFa, even if for some of them the version indication is not very elegant and not very relevant for typical user agents. > > > But as already mentioned, for an author of an 'ISO-8859-1'-'HTML5' > > document apart from the version indication it is already a problem to > > specify the used encoding properly. > > No, you can just specify it. Just like you can in HTML4. I can write the string, but indeed, if I do it, it means 'Windows-1252'. Therefore effectively, I cannot indicate, that something is 'ISO-8859-1' and not 'Windows-1252'. > > > This problem appears while a document is written and has to be solved > > before publication, therefore > > published documents are not broken, because they simply are not > > published due to this problem. > > I do not follow this. > > > Therefore if I start to write some test documents and this problem is > > not avoided and a version indication is possible, I think, I will use > > UTF-8 for those documents. > > This seems like a good idea regardless. Sure, if you have no history with thousands of documents or scripts. > > > Typically this means, that they are > > incompatible with other of my documents and scripts and will appear > > in another directory with an Apache-.htaccess file indicating the > > different encoding. > > That is one solution. You could also always indicate the encoding in the > document instead and instruct Apache to not include the charset parameter. Of course, the document should contain it too. However on many servers authors have no direct control over the Apache defaults. Therefore it is always a good idea to ensure, that this works indepentendly from gags of the administrator. > > > I think, the Apache has an option with specific file name extensions too, > > this can be used for directories with mixed encodings maybe. > > That is an option too. You can also set headers on a per-file basis using > the Files directive. > > > Surely I will not explain this to other authors, if this question comes > > up, because it is too complex for many authors. > > Agreed. Encoding is largely misunderstood. It makes more sense for editors > to start defaulting to UTF-8 going forward and have everyone use that, in > my opinion. > > > This does not cause broken documents, the construct is just more fragile > > and one has to care more, where to put and how to name files and one > > has to switch the encoding in the editor for different projects. This is > > only more work and more sources of possible errors, not recommendable > > for every author. > > If you simply switch to UTF-8 for all future work this will become less > and less of a problem. And then you've also covered other scripts may the > need arise to use them. > For some projects, it may take several years, until I update them completely. On one server I still found HTML3.2 documents this year ;o) More often content is just added or minor bugs are fixed. I think, this is the same for many authors having already thousands of documents around somewhere. Of course if you are 12 to 15 years old, starting your first project, you could start just from the beginning with a currently proper choice. However, when you start, you don't understand, what a proper choice for the future is. Especially for them we have to keep things simple to get some better quality of documents in the future. Because 'HTML5' works off a lot of historical relics and browser bugs, it is not a good options for a simple start anyway. > > Therefore maybe I will never create more than test documents for > > 'HTML5' just to avoid such complications. > > Ok. > > > With the new microdata section, 'HTML5' seemed to get more > > interesting for authors (well, the CURIEs are still missing, but there > > seems to be a workaround with entitiy definitions within the else > > almost empty DOCTYPE), therefore it would have been interesting > > to test this or to include this in tutorials for other authors, because > > it has already a few more semantically relevant elements than > > HTML4/XHTML1.x. > --- off topic ;o) --- > Since HTML5 is no longer SGML based entity definitions there will not work > and are non-conforming. The reason we did this was because other than the > validator no software processed text/html resources in this way leading to > a lot of author confusion because of the clear mismatch between the > validator and other software. SVG tiny 1.2 documents have typically no doctype, but for this purpose, it is still pretty useful and an SVG/XML-parser interpretes this. Because for 'HTML5' it is possible to use an XML-parser, it should be possible to use this important feature too. Of course, because it is XML, one can simply start to mix with elements from other languages too, if this appears to be more convenient as to use microdata to indicate 'HTML5' elements or their content to represent the same meaning as those elements from other namespaces. For those microdata information one mainly has to process this (as CURIEs in XHTML+RDFa), if the semantical meaning gets relevant, not for presentation in a simple browser. For them one needs maybe only the attribute values to identify the elements for styling with CSS. And for example if these constructions refer to a human readable specification (as I have created for literature currently), it is not obvious, what a simple browser should do with this information. This is the general problem with this approach filling semantical gaps of languages with those extensions. For a presentation with simple browsers an author cannot derive much more than some styling. But if a language like (X)HTML has these gaps, this is currently the only way to combine the technical functionality of elements with detailed semantical meanings. And because the technical functionalities of (X)HTML and SVG are already interpreted in larger parts in common browsers, formats/versions like XHTML+RDFa, SVG tiny 1.2, 'HTML5' are interesting to explore methods how to provide both technical functionalities and semantical meanings. But if there are problems to indicate the encoding or the version of the used format, this is contraproductive for authors wanting to create long living and well defined and semantical rich documents, not because there are specific presentation problems in current browsers, but because one simply cannot indicate, what one is currently doing - and one self in ten years or others even later have to identify this. Olaf |
|
|
Re: [HTML5] 2.8 Character encodingsDr. Olaf Hoffmann wrote:
> ... > I can write the string, but indeed, if I do it, it means 'Windows-1252'. > Therefore effectively, I cannot indicate, that something is > 'ISO-8859-1' and not 'Windows-1252'. > ... Olaf, from what you write it's not totally clear that you realize that ISO-8859-1 is a proper subset of Windows-1252? So the only difference would be the ability to diagnose problems in documents that claim to be ISO-8859-1, but actually use C1 control codes. That being said, I do agree with Larry that the spec should phrase it differently. BR, Julian |
|
|
Re: [HTML5] 2.8 Character encodingsOn Tue, 04 Aug 2009 12:17:49 +0200, Dr. Olaf Hoffmann
<Dr.O.Hoffmann@...> wrote: > Well, this is a problem for CSS too, because some properties are > defined differently in CSS2.1 than in CSS2. Things change. This is not necessarily a problem in my experience. > I discovered this some time ago for example for clipping for > some SVG test documents, which appeared wrong in Opera. > SVG depends on CSS2, therefore these tests are still well > defined, applied to (X)HTML they are not testable anymore, > because CSS has no version indication. This is one way of looking at the problem. Another way of looking at the problem is that specifications cannot incrementally evolve like implementations do and are therefore not always accurate in what they say. Just like normal languages Web languages change now and then. > For 'HTML5' - as long as I cannot simply write version="HTML5" > I cannot start to write HTML5 documents. In effect your documents will be treated as HTML5 regardless. > Already this is a 'show stopper' for 'HTML5' currently. One can still > discuss the > current draft, but for formal reasons one cannot write a > 'HTML5' document ;o) As far as I know there are no formal reasons why one cannot write HTML5 documents and publish them and in fact many people are authoring HTML5 documents and publishing them. > There is no problem to write HTML3.2, HTML4, XHTML1.0, > XHTML1.1 or XHTML+RDFa, even if for some of them the > version indication is not very elegant and not very relevant > for typical user agents. This assumes versioning is necessary. Experience with Web browsers shows that this is not needed and avoids a lot of complexity. >> No, you can just specify it. Just like you can in HTML4. > > I can write the string, but indeed, if I do it, it means 'Windows-1252'. Not for your authoring tool or a conformance checker. > Therefore effectively, I cannot indicate, that something is > 'ISO-8859-1' and not 'Windows-1252'. You cannot indicate that something needs to be decoded as ISO-8859-1 by Web browsers for text/html content. This has been the case for a long time and is nothing new. >>> Therefore if I start to write some test documents and this problem is >>> not avoided and a version indication is possible, I think, I will use >>> UTF-8 for those documents. >> >> This seems like a good idea regardless. > > Sure, if you have no history with thousands of documents or scripts. More and more programming languages work with Unicode internally. What scripts would act up? >>> Typically this means, that they are >>> incompatible with other of my documents and scripts and will appear >>> in another directory with an Apache-.htaccess file indicating the >>> different encoding. >> >> That is one solution. You could also always indicate the encoding in the >> document instead and instruct Apache to not include the charset >> parameter. > > Of course, the document should contain it too. However on many servers > authors have no direct control over the Apache defaults. Therefore it is > always a good idea to ensure, that this works indepentendly from gags > of the administrator. That is not what I'm saying. What I'm saying you could instruct Apache to not include the charset parameter so you do not have to maintain a document/charset mapping within Apache. Just within the documents. >> If you simply switch to UTF-8 for all future work this will become less >> and less of a problem. And then you've also covered other scripts may >> the need arise to use them. > > For some projects, it may take several years, until I update them > completely. On one server I still found HTML3.2 documents this > year ;o) Yeah, this is nothing new :-) Lots of legacy content out there. > More often content is just added or minor bugs are fixed. > I think, this is the same for many authors having already > thousands of documents around somewhere. If you are just fixing minor bugs I do not see what HTML5 has to do with this. > Of course if you are 12 to 15 years old, starting your first > project, you could start just from the beginning with a currently > proper choice. However, when you start, you don't understand, > what a proper choice for the future is. Especially for them we > have to keep things simple to get some better quality of > documents in the future. Because 'HTML5' works off a lot > of historical relics and browser bugs, it is not a good > options for a simple start anyway. HTML5 also removes a lot of historical relics. E.g. SGML. This simplifies things _a lot_ for authors in my opinion and experience in talking with Web developers about this. (And feedback I get from collegues who talk to Web developers on a near fulltime basis.) > --- off topic ;o) --- > >> Since HTML5 is no longer SGML based entity definitions there will not >> work and are non-conforming. The reason we did this was because other >> than the validator no software processed text/html resources in this >> way leading to a lot of author confusion because of the clear mismatch >> between the >> validator and other software. > > SVG tiny 1.2 documents have typically no doctype, but for this purpose, > it is still pretty useful and an SVG/XML-parser interpretes this. > Because for 'HTML5' it is possible to use an XML-parser, it should > be possible to use this important feature too. Of course, because > it is XML, one can simply start to mix with elements from other > languages too, if this appears to be more convenient as to use > microdata to indicate 'HTML5' elements or their content to > represent the same meaning as those elements from other > namespaces. Ah. I did not realize you were talking about XHTML5 (HTML5 expressed in XML). In XHTML5 ISO-8859-1 just means ISO-8859-1, not Windows-1252. That is just done for text/html documents. You can indeed use DOCTYPE features in XHTML5 although I believe it is currently recommended that you do not use them. -- Anne van Kesteren http://annevankesteren.nl/ |
|
|
Re: [HTML5] 2.8 Character encodingsJulian Reschke:
> Dr. Olaf Hoffmann wrote: > > ... > > I can write the string, but indeed, if I do it, it means 'Windows-1252'. > > Therefore effectively, I cannot indicate, that something is > > 'ISO-8859-1' and not 'Windows-1252'. > > ... > > Olaf, from what you write it's not totally clear that you realize that > ISO-8859-1 is a proper subset of Windows-1252? > This seems to fit to what is noted for example at wikipedia for ISO/IEC 8859-1, not for ISO-8859-1, these are different for some control characters. Therefore ISO-8859-1 and Windows-1252 seem to be a superset of ISO/IEC 8859-1, but ISO-8859-1 is not a subset of Windows-1252, but the conflicting characters are typically not used in documents with correctly indicated ISO-8859-1 encoding. http://en.wikipedia.org/wiki/ISO/IEC_8859-1. What I personally use in (X)HTML is typically the ASCII subset, where I do not even have to care about differences between 'ISO-8859-1' and 'UTF-8' - if a server administrator (or a browser implementor) decides something surprising. However, because masking of special characters like Umlaute and the ß-ligature is not always available in the (X)HTML style and for example for an XML parser like Opera it seems to depend on the XHTML-version/doctype, if the (X)HTML predefined entities are known, I have to switch to more critical things. Many other others rely already for many years on unmasked special characters. And if they want to use and indicate 'ISO-8859-1', this should be possible. No problem too, if they want to use and indicate 'Windows-1252' as it is not problem to use and indicate 'UTF-8' - but mixing up this in a specification means basically confusion for some authors reading this, especially for those, who already have problems to indicate the encoding they used properly. Therefore it is not a big practical problem in what browsers currently do, if 'ISO-8859-1' is specified (if 'ISO-8859-1' is used). It is more, that readers of the draft are confused by the wording. If it is noted something like: "If 'ISO-8859-1' etc is indicated for a document encoding, a HTML5 praser will/may use for the presentable characters 'Windows-1252' for decoding." This would be already less confusing and does not change the meaning of the string or document, it describes only the behaviour or the parser, what is a difference. For more advanced authors one may add something like "Due to this rule, there may be no indication of a wrong encoding information. Some not presentable control characters of 'ISO-8859-1' might be presented as presentable characters according to Windows-1252." This indicates maybe the intended reason for this behaviour and indicates too, that such parsers should not be used to check proper encoding/decoding (what is nevertheless done by many authors with known consequences ;o) > So the only difference would be the ability to diagnose problems in > documents that claim to be ISO-8859-1, but actually use C1 control codes. > > That being said, I do agree with Larry that the spec should phrase it > differently. > > BR, Julian |
|
|
Re: [HTML5] 2.8 Character encodingsAnne van Kesteren:
> On Tue, 04 Aug 2009 12:17:49 +0200, Dr. Olaf Hoffmann > > <Dr.O.Hoffmann@...> wrote: > > Well, this is a problem for CSS too, because some properties are > > defined differently in CSS2.1 than in CSS2. > > Things change. This is not necessarily a problem in my experience. That is, why to have different versions and a version indication is useful. 'HTML5' changes or redefines the meaning and partly the functionality of some elements too, compared to HTML4 (or XHTML1.x). There is a difference for example for elements like small or object. Again a version indication is useful, but of course does not solve the problem completely (for example if the elements are used in mixed XML-documents, indicating only the namespace, not the version). And it is of course a problem for the interpretation of older documents, at least, if the creation time is not conserved. Obviously a stylesheet from 2000 does not mean CSS2.1, for 2007 no one knows if 2.0 or 2.1. Not every HTML document without a version indication is 'HTML5' - could be old tag soup as well. Without a version indication it should be no problem, to interprete it with an arbitrary rule set, including 'HTML5', but this is no indication, that such a tag soup document has a well defined meaning at all. > > > I discovered this some time ago for example for clipping for > > some SVG test documents, which appeared wrong in Opera. > > SVG depends on CSS2, therefore these tests are still well > > defined, applied to (X)HTML they are not testable anymore, > > because CSS has no version indication. > > This is one way of looking at the problem. Another way of looking at the > problem is that specifications cannot incrementally evolve like > implementations do and are therefore not always accurate in what they say. For this example, both CSS2.0 and CSS2.1 have understandable and testable definitions of the same property, they are just different (by the way, the behaviour I found for Opera is different from both ;o) Without version indication this is no evolution, this simply means, that is has no relyable meaning (and a different interpretation in different versions of browsers as well). Both meaning and interpretation get unpredictable for authors, the second at least for several years. > Just like normal languages Web languages change now and then. > But not the versions and not the intended meaning of already published documents. If this would be the case, it would be impossible to write well defined documents/content in an electronic format at all. Maybe the interpretations changes, because we have other experiences now than when such a document was written, but this does not change, what the author had in mind, when the document was created. > > For 'HTML5' - as long as I cannot simply write version="HTML5" > > I cannot start to write HTML5 documents. > > In effect your documents will be treated as HTML5 regardless. > Here you mix up again interpretation and content. If you like, you can interprete a 'microsoft word' document or a postscript or PDF as well as 'HTML5', this does not necessarily mean, that you get the intended meaning of the document with such an interpretation and it is not obvious, how to interprete this in a useful way, if author/server send them as text/html. This can already happen with translations of documents from one language to another (Shakespeare in german, Goethe in english?), it is obviously even worse, if you try to interprete such documents in the other language without a translation - but of course, you can do it ;o) A specification like that for 'HTML5' has more than the aspect of interpretation. For authors, it has the aspect too to indicate, specifiy, markup the intended meaning. Once done, this does not change, if the author does not change the document. But the interpretation can change any time or can depend on the reader (or tool the reader uses to interprete the document). > > Already this is a 'show stopper' for 'HTML5' currently. One can still > > discuss the > > current draft, but for formal reasons one cannot write a > > 'HTML5' document ;o) > > As far as I know there are no formal reasons why one cannot write HTML5 > documents and publish them and in fact many people are authoring HTML5 > documents and publishing them. > Well do you have samples? How do they identify them as 'HTML5'? and distinguish from undefined tag soup without a version indication? Not, how these document are maybe interpreted today, what does the author indicate, what they are? Maybe in ten years they are interpreted as 'HTML6' - does it mean, that an author has written them as 'HTML6' today? If I have written some HTML tag soup in 1997, does it mean, that this is 'HTML5', just because this tag soup has no version indication? Does it mean in 10 years, that I have written 'HTML6' already in 1997? No, if the creation date is known, it simply means, that this document is one of my first stupid attempts to write an HTML document, not more. And a current browser will typically not interprete this old tag soup like the netscape3.2 I used to 'check' the appearance of this tag soup, what is no problem, because it is simply tag soup without a requirement for a specific interpretation or a well defined meaning. > > There is no problem to write HTML3.2, HTML4, XHTML1.0, > > XHTML1.1 or XHTML+RDFa, even if for some of them the > > version indication is not very elegant and not very relevant > > for typical user agents. > > This assumes versioning is necessary. Experience with Web browsers shows > that this is not needed and avoids a lot of complexity. > Web browser only interprete documents, and the common experience shows, that they do it wrong and incomplete. And of course, it helps a lot to have a look into the source code, find a well defined, structured document with version indication for the interpretation, if the browser fails. One can try to find the problem - more often bugs in the document than a real problem with the browser. And one can try to find out, what was intended by the author. For flawed documents, the intention of the author is often different from the interpretation of the browser, what is basically a task for the author to fix the bugs - one cannot expect, that a simple program like a browser can guess, what was intented by the author of a flawed document. And even if some error management is defined in 'HTML5', one cannot assume, that the result is often close to the intention of the author, it is just one interpretation of a flawed document. > >> No, you can just specify it. Just like you can in HTML4. > > > > I can write the string, but indeed, if I do it, it means 'Windows-1252'. > > Not for your authoring tool or a conformance checker. > > > Therefore effectively, I cannot indicate, that something is > > 'ISO-8859-1' and not 'Windows-1252'. > > You cannot indicate that something needs to be decoded as ISO-8859-1 by > Web browsers for text/html content. This has been the case for a long time > and is nothing new. It is at least new in 'HTML5', that one cannot indicate 'ISO-8859-1' as encoding, what is a difference, again the problem to mix up the content of a document with the interpretation of it. There is no way for an author to prevent any interpretation. But the author can try to indicate, what was intented. And as I already mentioned, I have seen years ago already browsers without this behaviour/bug, what was somehow interesting for documents with wrong encoding information and unmasked euro signs ;o) |
|
|
Re: [HTML5] 2.8 Character encodingsOn Tue, 04 Aug 2009 17:12:29 +0200, Dr. Olaf Hoffmann
<Dr.O.Hoffmann@...> wrote: > And it is of course a problem for the interpretation of older documents, > at least, if the creation time is not conserved. Obviously a stylesheet > from 2000 does not mean CSS2.1, for 2007 no one knows if 2.0 or 2.1. Actually, it probably does. Because actual deployed style sheets and implementations in Web browsers of CSS 2.0 were what guided the development of CSS 2.1. This is the same for HTML 4 and HTML 5. >> This is one way of looking at the problem. Another way of looking at the >> problem is that specifications cannot incrementally evolve like >> implementations do and are therefore not always accurate in what they >> say. > > For this example, both CSS2.0 and CSS2.1 have understandable > and testable definitions of the same property, they are just different > (by the way, the behaviour I found for Opera is different from both ;o) Please file a bug if you have not already. Thanks! The CSS WG considers CSS 2.0 to be mostly obsolete by the way. You can see that the /TR/CSS2/ recently changed to point to CSS 2.1 as well. I am just trying to explain the state of affairs with respect to CSS here. While you can claim it is supposed to be different, that does not make it so. > Without version indication this is no evolution, this simply means, > that is has no relyable meaning (and a different interpretation in > different versions of browsers as well). Both meaning and > interpretation get unpredictable for authors, the second at least > for several years. I think this is mostly because both HTML 4 and CSS 2.0 lacked rigorous testing of implementations and taking implementation and author feedback into account. With HTML 5 and CSS 2.1 we are trying to fix this mistake. >> Just like normal languages Web languages change now and then. > > But not the versions and not the intended meaning of already > published documents. If this would be the case, it would be > impossible to write well defined documents/content in an electronic > format at all. Maybe the interpretations changes, because we > have other experiences now than when such a document was > written, but this does not change, what the author had in mind, > when the document was created. Actually, what we find again and again is that what the majority of authors has in mind is the browser they are rendering their document or style sheet in. Not what the specification said. The latter only affects test suites and it is not worth preserving compatibility with those. >> > For 'HTML5' - as long as I cannot simply write version="HTML5" >> > I cannot start to write HTML5 documents. >> >> In effect your documents will be treated as HTML5 regardless. > > Here you mix up again interpretation and content. > If you like, you can interprete a 'microsoft word' document or > a postscript or PDF as well as 'HTML5', this does not necessarily > mean, that you get the intended meaning of the document with > such an interpretation and it is not obvious, how to interprete this > in a useful way, if author/server send them as text/html. > This can already happen with translations of documents from one > language to another (Shakespeare in german, Goethe in english?), > it is obviously even worse, if you try to interprete such documents > in the other language without a translation - but of course, you can > do it ;o) This is the same point as above. Authors do not write against specifications. >> > Already this is a 'show stopper' for 'HTML5' currently. One can still >> > discuss the >> > current draft, but for formal reasons one cannot write a >> > 'HTML5' document ;o) >> >> As far as I know there are no formal reasons why one cannot write HTML5 >> documents and publish them and in fact many people are authoring HTML5 >> documents and publishing them. > > Well do you have samples? There is http://html5gallery.com/ for instance, which collects sites made in HTML5. > How do they identify them as 'HTML5'? and distinguish from undefined > tag soup without a version indication? That is not needed. > Not, how these document are maybe interpreted today, what does the > author indicate, what they are? I'm not sure what you mean. > Maybe in ten years they are interpreted as 'HTML6' - does it mean, that > an author has written them as 'HTML6' today? I assume that once we'll get to HTML 6 we make sure to do the same we did for HTML 5. That is studying existing content, implementations, etc. and go from there. If we succeed in what we want with HTML 5, HTML 6 will not have to make incompatible changes. > If I have written some HTML tag soup in 1997, does it mean, that this is > 'HTML5', just because this tag soup has no version indication? Does it > mean in 10 years, that I have written 'HTML6' already in 1997? Currently <!doctype html> is required for something to be considered HTML 5. However, all HTML is consumed using the algorithm defined in the specification. (Implementations have always done this, though have differences between them because not everything was defined back in the days.) >> This assumes versioning is necessary. Experience with Web browsers shows >> that this is not needed and avoids a lot of complexity. > > Web browser only interprete documents, and the common experience shows, > that they do it wrong and incomplete. And of course, it helps a lot to > have a look into the source code, find a well defined, structured > document with > version indication for the interpretation, if the browser fails. One can > try to find the problem - more often bugs in the document than a real > problem with the browser. And one can try to find out, what was intended > by the author. For flawed documents, the intention of the author is often > different from the interpretation of the browser, what is basically a > task for the author to fix the bugs - one cannot expect, that a simple > program > like a browser can guess, what was intented by the author of a flawed > document. And even if some error management is defined in 'HTML5', > one cannot assume, that the result is often close to the intention of the > author, it is just one interpretation of a flawed document. I do not see how this refutes my point. Requiring browsers to do more complex things will certainly not result in them doing less wrong or do things less incomplete. >>>> No, you can just specify it. Just like you can in HTML4. >>> >>> I can write the string, but indeed, if I do it, it means >> 'Windows-1252'. >> >> Not for your authoring tool or a conformance checker. >> >>> Therefore effectively, I cannot indicate, that something is >>> 'ISO-8859-1' and not 'Windows-1252'. >> >> You cannot indicate that something needs to be decoded as ISO-8859-1 by >> Web browsers for text/html content. This has been the case for a long >> time and is nothing new. > > It is at least new in 'HTML5', that one cannot indicate 'ISO-8859-1' > as encoding, what is a difference, again the problem to mix up the > content of a document with the interpretation of it. I think you misunderstand the specification. It is an error if a document is labeled ISO-8859-1 and does not conform to ISO-8859-1. I said that above as well. > There is no way for an author to prevent any interpretation. But the > author can try to indicate, what was intented. Right. > And as I already mentioned, I have seen years ago already browsers > without this behaviour/bug, what was somehow interesting for > documents with wrong encoding information and unmasked euro signs ;o) There's a reason those browsers are no longer around. -- Anne van Kesteren http://annevankesteren.nl/ |
|
|
Re: [HTML5] 2.8 Character encodingsAnne van Kesteren:
.... > I am just trying to explain the state of affairs with respect to CSS here. > While you can claim it is supposed to be different, that does not make it > so. No, I think, both CSS and 'HTML5' share the same problem without a version indication. However, CSS is just about styling, in doubt one can switch it off. (X)HTML is about content and it is more important to have a well defined and not plurivalent meaning of elements (or encoding or other things related to content). .... > >> > For 'HTML5' - as long as I cannot simply write version="HTML5" > >> > I cannot start to write HTML5 documents. > >> > >> In effect your documents will be treated as HTML5 regardless. > > > > Here you mix up again interpretation and content. > > If you like, you can interprete a 'microsoft word' document or > > a postscript or PDF as well as 'HTML5', this does not necessarily > > mean, that you get the intended meaning of the document with > > such an interpretation and it is not obvious, how to interprete this > > in a useful way, if author/server send them as text/html. > > This can already happen with translations of documents from one > > language to another (Shakespeare in german, Goethe in english?), > > it is obviously even worse, if you try to interprete such documents > > in the other language without a translation - but of course, you can > > do it ;o) > > This is the same point as above. Authors do not write against > specifications. Well not all authors ... As an author I started to write a specification, because (X)HTML did not cover, what I wanted to markup ;o) And those people with microformats and RDF(a) indicate, that not all authors want to write things, already appearing somehow in common browsers, but have no semantical well defined meaning ;o) Those authors, who care mainly about the appearance in current browsers typically do not have long living documents in mind, even if with some luck some of these documents remain several years. Or they do not have much experience and rely to much on the behaviour of the current version of their preferred browser. Within the last ten years such authors typically had a lot of work updating a few documents to the current behaviour of the newest browser version. I think, this is not really a promising perspective for electronic formats (as for computer hardware this is no standard, this permanent handicraft work). > > >> > Already this is a 'show stopper' for 'HTML5' currently. One can still > >> > discuss the > >> > current draft, but for formal reasons one cannot write a > >> > 'HTML5' document ;o) > >> > >> As far as I know there are no formal reasons why one cannot write HTML5 > >> documents and publish them and in fact many people are authoring HTML5 > >> documents and publishing them. > > > > Well do you have samples? > > There is > > http://html5gallery.com/ > > for instance, which collects sites made in HTML5. Indeed, within the content, the meta description or keywords it often appears, that those projects are intended to be HTML5. For several others it can be guessed, because elements like header, footer and section are used. But that this information is spread along these meta element attributes content or the content of the pages indicates even more, that a version indication like version="HTML5" is missing. Is it expected, that such an informations always appears within the content or the description or keywords of a 'HTML5' document? Or is it intended, that the used elements are analysed to identify 'HTML5'? (I think, the gallery itself does not use elements specific for HTML5). Is it defined in the current draft how to indicate 'HTML5' with meta elements? Looks currently like poor design of 'HTML5' - and that many of them indicate the fact, that they use 'HTML5' with such workarounds looks pretty much like a gap in the current draft ;o) Why else to note several times within the document the used version in different ways, surely because they try to assure, that their documents are really identified somehow as 'HTML5' ;o) > > > How do they identify them as 'HTML5'? and distinguish from undefined > > tag soup without a version indication? > > That is not needed. Well, this dicussion and the samples you provided indicate, that there is at least a desire for a version indication - maybe to get well defined documents, maybe to show, that the author is a cool and funky designer using already languages, which are still drafts, even if it is not known, how to indicate, that they really use this cool new language version ;o) > > > Not, how these document are maybe interpreted today, what does the > > author indicate, what they are? > > I'm not sure what you mean. > Because languages like XHTML+RDFa, HTML4, 'HTML5', SVG, MathML, SMIL, RDF etc define somehow, what the meaning of the content of an element is, and different versions define it (slightly) different, the meaning can be only derived by knowing the version. Sometimes, if the functionality changes too, the intended behaviour can only be derived by knowing the version. To create more than currently cool and funky designer pages it is therefore important for some authors to indicate, what they really mean, not just, how things appear. And if elements are defined in different language versions to have different meanings, a version indication is required. Without it is still possible to write those single-serving pretty nice and cool and funky designer pages beeing pretty nice and cool and funky for a month or a year, because the majority of author of those pages do not care about details. For more, you have to care about details. As already mentioned, 'HTML5' defines the meaning of elements like small slightly different from other (X)HTML versions (for small it is different from my typical use cases, which are currently excluded by the new definition), therefore a version indication for the complete document is important or an author has to use the microdata on each element to indicate, to which definition it belongs. Maybe one can set a microdata information about the version on the root html element, referencing the current draft, this may work currently as an unambiguos version indication. This would implicate, that the child elements belong to the same version, if not indicated otherwise with even more microdata. > > Maybe in ten years they are interpreted as 'HTML6' - does it mean, that > > an author has written them as 'HTML6' today? > > I assume that once we'll get to HTML 6 we make sure to do the same we did > for HTML 5. That is studying existing content, implementations, etc. and > go from there. If we succeed in what we want with HTML 5, HTML 6 will not > have to make incompatible changes. > Who knows? The next generation may think, that semantical details are important and will reject all this creating a completely new language version avoiding all these historical meanders and it turns out, that every document without an unambiguous indication should be rejected as stupid tag soup ;o) > > If I have written some HTML tag soup in 1997, does it mean, that this is > > 'HTML5', just because this tag soup has no version indication? Does it > > mean in 10 years, that I have written 'HTML6' already in 1997? > > Currently <!doctype html> is required for something to be considered HTML > 5. However, all HTML is consumed using the algorithm defined in the > specification. (Implementations have always done this, though have > differences between them because not everything was defined back in the > days.) > I think, this is not completely wrong for several HTML versions, and not for XHTML or XHTML+RDFa and maybe the best choice too for documents having an XHTML:html element as root element, but elements from several other namespaces as well, maybe including entity definitions within the doctype. > >> This assumes versioning is necessary. Experience with Web browsers shows > >> that this is not needed and avoids a lot of complexity. > > > > Web browser only interprete documents, and the common experience shows, > > that they do it wrong and incomplete. And of course, it helps a lot to > > have a look into the source code, find a well defined, structured > > document with > > version indication for the interpretation, if the browser fails. One can > > try to find the problem - more often bugs in the document than a real > > problem with the browser. And one can try to find out, what was intended > > by the author. For flawed documents, the intention of the author is often > > different from the interpretation of the browser, what is basically a > > task for the author to fix the bugs - one cannot expect, that a simple > > program > > like a browser can guess, what was intented by the author of a flawed > > document. And even if some error management is defined in 'HTML5', > > one cannot assume, that the result is often close to the intention of the > > author, it is just one interpretation of a flawed document. > > I do not see how this refutes my point. Requiring browsers to do more > complex things will certainly not result in them doing less wrong or do > things less incomplete. No, but if the audience is able to study the souce code and to derive, what the author did and intended, this works often much better, because the human intellectual capabilities are typically superior to that of a simple browser (even if the history of HTML authors seem to implicate, that one should not overestimate human capabilities ;o) Well, unfortunately due to the limitations of capabilities I have to look quite often in the source code - and if discussed with the author how to markup texts in a technical and semantical meaningful way, it helps simply to know instead of guessing what the author tried to create and it shortens the discussion. And maybe in the far future, if there is still some interest in my documents it will hopefully help the audience to understand some of my intentions, if I use proper semantical markup and languages with well defined versions, which can be indicated within a document. > > >>>> No, you can just specify it. Just like you can in HTML4. > >>> > >>> I can write the string, but indeed, if I do it, it means > >> > >> 'Windows-1252'. > >> > >> Not for your authoring tool or a conformance checker. > >> > >>> Therefore effectively, I cannot indicate, that something is > >>> 'ISO-8859-1' and not 'Windows-1252'. > >> > >> You cannot indicate that something needs to be decoded as ISO-8859-1 by > >> Web browsers for text/html content. This has been the case for a long > >> time and is nothing new. > > > > It is at least new in 'HTML5', that one cannot indicate 'ISO-8859-1' > > as encoding, what is a difference, again the problem to mix up the > > content of a document with the interpretation of it. > > I think you misunderstand the specification. It is an error if a document > is labeled ISO-8859-1 and does not conform to ISO-8859-1. I said that > above as well. > It is noted: " When a user agent would otherwise use an encoding given in the first column of the following table to either convert content to Unicode characters or convert Unicode characters to bytes, it must instead use the encoding given in the cell in the second column of the same row. " This indicates clearly, that the encoding information (not only the decoding) is changed by this rule. Therefore within the current 'HTML5' draft it is equivalent to specify 'ISO-8859-1' or 'Windows-1252'. And there is no indication, that this only applies to the text/html variant of 'HTML5' and not to the application/xhtml+xml variant too (you claimed in a previous mail, that it only applies to text/html - where did you find that?) > > There is no way for an author to prevent any interpretation. But the > > author can try to indicate, what was intented. > > Right. > > > And as I already mentioned, I have seen years ago already browsers > > without this behaviour/bug, what was somehow interesting for > > documents with wrong encoding information and unmasked euro signs ;o) > > There's a reason those browsers are no longer around. I think, it was a Konqueror and a Mozilla-Suite on Debian, I don't remember exactly which versions, but not so old as several tutorials about HTML and CSS I found within the last years including sophisticated hints about related issues. |
|
|
Re: [HTML5] 2.8 Character encodingsOn Wed, 05 Aug 2009 10:58:43 +0200, Dr. Olaf Hoffmann <Dr.O.Hoffmann@...> wrote:
> No, I think, both CSS and 'HTML5' share the same problem without > a version indication. However, CSS is just about styling, in doubt > one can switch it off. (X)HTML is about content and it is more important > to have a well defined and not plurivalent meaning of elements > (or encoding or other things related to content). If there is no versioning the latest version of the standard defines the meaning of elements. So that problem does not exist. >> This is the same point as above. Authors do not write against >> specifications. > > Well not all authors ... > As an author I started to write a specification, because (X)HTML did > not cover, what I wanted to markup ;o) Writing _against_ a specification and writing _a_ specification are very different things. > And those people with microformats and RDF(a) indicate, that not > all authors want to write things, already appearing somehow in > common browsers, but have no semantical well defined meaning ;o) I do not understand this sentence. > Those authors, who care mainly about the appearance in current > browsers typically do not have long living documents in mind, even > if with some luck some of these documents remain several years. You make it sound like this is a fact, but this is not at all my experience. > Or they do not have much experience and rely to much on the > behaviour of the current version of their preferred browser. Right, which is why better interoperability is important. > Within the last ten years such authors typically had a lot of work > updating a few documents to the current behaviour of the newest > browser version. I think, this is not really a promising perspective > for electronic formats (as for computer hardware this is no > standard, this permanent handicraft work). Do you have any evidence for this? Documents from 10 years ago typically still render fine. >> There is >> >> http://html5gallery.com/ >> >> for instance, which collects sites made in HTML5. > > Indeed, within the content, the meta description or keywords it often > appears, that those projects are intended to be HTML5. For several > others it can be guessed, because elements like header, footer and > section are used. You mean "not used"? Those elements are part of HTML5. > But that this information is spread along these meta element attributes > content or the content of the pages indicates even more, that a > version indication like version="HTML5" is missing. I don't see why. > Is it expected, that such an informations always appears within > the content or the description or keywords of a 'HTML5' document? > Or is it intended, that the used elements are analysed to identify > 'HTML5'? The version does not need to be identified. As I said before, HTML is versionless. > (I think, the gallery itself does not use elements specific for HTML5). > Is it defined in the current draft how to indicate 'HTML5' with meta > elements? That would introduce versioning, so no. > Looks currently like poor design of 'HTML5' - and that many of them > indicate the fact, that they use 'HTML5' with such workarounds looks > pretty much like a gap in the current draft ;o) Why else to note several > times within the document the used version in different ways, > surely because they try to assure, that their documents are really > identified somehow as 'HTML5' ;o) Why would they try to assure that their documents are identified as HTML5? Browsers process HTML in the same way regardless, so it does not matter. I've said all this before though and I'm feeling this discussion is just going in circles. >> > How do they identify them as 'HTML5'? and distinguish from undefined >> > tag soup without a version indication? >> >> That is not needed. > > Well, this dicussion and the samples you provided indicate, that there > is at least a desire for a version indication - maybe to get well defined > documents, maybe to show, that the author is a cool and funky designer > using already languages, which are still drafts, even if it is not known, > how to indicate, that they really use this cool new language version ;o) Could you maybe indicate what you mean? I do not get the same impression as you looking at the source code of those sites. >>> Not, how these document are maybe interpreted today, what does the >>> author indicate, what they are? >> >> I'm not sure what you mean. > > Because languages like XHTML+RDFa, HTML4, 'HTML5', SVG, > MathML, SMIL, RDF etc define somehow, what the meaning of the > content of an element is, and different versions define it (slightly) > different, the meaning can be only derived by knowing the version. No, the meaning can only be derived by asking the author. The version does not have much to do with it. E.g. lots of authors abuse <blockquote> for indenting and others abuse longdesc="" for search engine spam. That is not the meaning of those HTML features. > Sometimes, if the functionality changes too, the intended behaviour > can only be derived by knowing the version. > To create more than currently cool and funky designer pages it is > therefore important for some authors to indicate, what they really > mean, not just, how things appear. And if elements are defined > in different language versions to have different meanings, a version > indication is required. Maybe in an ideal world this would be the case, but given that nobody wants to implement versioning, versioning makes things vastly more complex, and older specifications (e.g. HTML 4 and CSS 2.0) are very poorly written and ambiguous, this is unlikely to happen. To do a step back, do you have an example of an HTML 4 page you once created that would get "weird meaning" in HTML 5? > Without it is still possible to write those single-serving pretty nice > and cool and funky designer pages beeing pretty nice and cool > and funky for a month or a year, because the majority of author > of those pages do not care about details. For more, you have to > care about details. I care about details. > As already mentioned, 'HTML5' defines the meaning of elements > like small slightly different from other (X)HTML versions > (for small it is different from my typical use cases, which are > currently excluded by the new definition), therefore > a version indication for the complete document is important or > an author has to use the microdata on each element to indicate, > to which definition it belongs. Maybe one can set a microdata > information about the version on the root html element, > referencing the current draft, this may work currently as an > unambiguos version indication. This would implicate, that > the child elements belong to the same version, if not indicated > otherwise with even more microdata. I suppose you could come up with a convention for yourself, yes. >> I assume that once we'll get to HTML 6 we make sure to do the same we >> did for HTML 5. That is studying existing content, implementations, etc. and >> go from there. If we succeed in what we want with HTML 5, HTML 6 will >> not have to make incompatible changes. > > Who knows? Nobody, but that's the idea. > The next generation may think, that semantical details are important > and will reject all this creating a completely new language version > avoiding all these historical meanders and it turns out, that every > document without an unambiguous indication should be rejected as > stupid tag soup ;o) Sure. Hopefully they study the history (e.g. with DOCTYPE switching) and realize it works poorly. >> Currently <!doctype html> is required for something to be considered >> HTML 5. However, all HTML is consumed using the algorithm defined in the >> specification. (Implementations have always done this, though have >> differences between them because not everything was defined back in the >> days.) > > I think, this is not completely wrong for several HTML versions, and not > for XHTML or XHTML+RDFa and maybe the best choice too for > documents having an XHTML:html element as root element, but > elements from several other namespaces as well, maybe including > entity definitions within the doctype. I do not understand this sentence. >> I do not see how this refutes my point. Requiring browsers to do more >> complex things will certainly not result in them doing less wrong or do >> things less incomplete. > > No, but if the audience is able to study the souce code and to derive, > what the author did and intended, this works often much better, because > the human intellectual capabilities are typically superior to that of a > simple browser (even if the history of HTML authors seem to implicate, > that one should not overestimate human capabilities ;o) > Well, unfortunately due to the limitations of capabilities I have to look > quite often in the source code - and if discussed with the author how > to markup texts in a technical and semantical meaningful way, > it helps simply to know instead of guessing what the author > tried to create and it shortens the discussion. In case of HTML they simply tried to write HTML and the latest specification of HTML gives the guidelines for that. Just like with CSS. > And maybe in the far future, if there is still some interest in my > documents it will hopefully help the audience to understand > some of my intentions, if I use proper semantical markup and > languages with well defined versions, which can be indicated > within a document. Since we want to remain backwards compatible with old sites I think that should be no problem. >> I think you misunderstand the specification. It is an error if a >> document is labeled ISO-8859-1 and does not conform to ISO-8859-1. I said that >> above as well. > > It is noted: > "When a user agent would otherwise use an encoding given in the first > column of the following table to either convert content to Unicode characters or > convert Unicode characters to bytes, it must instead use the encoding > given in the cell in the second column of the same row." > > This indicates clearly, that the encoding information (not only the > decoding) is changed by this rule. This only applies to user agents as is clearly stated. Since you seem to care primarily about document meaning and conformance checkers (which have to report mismatches) I do not see how you consider that to be a problem. > Therefore within the current 'HTML5' draft it is equivalent to > specify 'ISO-8859-1' or 'Windows-1252'. No. The former gives a character encoding error in conformance checkers if you use characters from the latter. (The former is a subset of the latter. I realize you said to Julian that this is not the case, but the preferred IANA name for the standard you pointed out _is_ ISO-8859-1.) > And there is no indication, that this only applies to the text/html > variant of 'HTML5' and not to the application/xhtml+xml variant too > (you claimed in a previous mail, that it only applies to text/html - > where did you find that?) Good point. That used to be the case. I think it got lost due to restructuring of sections. I filed a bug: http://www.w3.org/Bugs/Public/show_bug.cgi?id=7215 >>> And as I already mentioned, I have seen years ago already browsers >>> without this behaviour/bug, what was somehow interesting for >>> documents with wrong encoding information and unmasked euro signs ;o) >> >> There's a reason those browsers are no longer around. > > I think, it was a Konqueror and a Mozilla-Suite on Debian, I don't > remember exactly which versions, but not so old as several > tutorials about HTML and CSS I found within the last > years including sophisticated hints about related issues. If you can find pointers that'd be cool. -- Anne van Kesteren http://annevankesteren.nl/ |
| < Prev | 1 - 2 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |