|
View:
New views
7 Messages
—
Rating Filter:
Alert me
|
|
|
document.write("\r"): the spec doesn't say how to handle it.The spec for document.write()
http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#dom-document-write says: "... have the tokenizer process the characters that were inserted, one at a time, processing resulting tokens as they are emitted, and stopping when the tokenizer reaches the insertion point..." But what happens if the last character written by document.write() is a carriage return? The HTML parsing spec says that CR followed by LF is ignored but CR followed by anything else is converted to LF. So if the last character is CR, then the tokenizer can't process all characters up to the insertion point because it needs to lookahead at the next character, right? Firefox, Chrome and Safari all seem to do the right thing: wait for the next character before tokenizing the CR. And I think this means that the description of document.write needs to be changed. (Opera, on the other hand, just gets this wrong and emits a CR character). Similarly, what should the tokenizer do if the document.write emits half of a UTF-16 surrogate pair as the last character? David |
|
|
Re: document.write("\r"): the spec doesn't say how to handle it.On Thu, Nov 3, 2011 at 1:57 AM, David Flanagan <dflanagan@...> wrote:
> Firefox, Chrome and Safari all seem to do the right thing: wait for the next > character before tokenizing the CR. See http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1247 Firefox tokenizes the CR immediately, emits an LF and then skips over the next character if it is an LF. When I designed the solution Firefox uses, I believed it was more correct and more compatible with legacy than whatever the spec said at the time. Chrome seems to wait for the next character before tokenizing the CR. > And I think this means that the description of document.write needs to be changed. All along, I've felt thought that having U+0000 and CRLF handling as a stream preprocessing step was bogus and both should happen upon tokenization. So far, I've managed to convince Hixie about U+0000 handling. > Similarly, what should the tokenizer do if the document.write emits half of > a UTF-16 surrogate pair as the last character? The parser operates on UTF-16 code units, so a lone surrogate is emitted. -- Henri Sivonen hsivonen@... http://hsivonen.iki.fi/ |
|
|
Re: document.write("\r"): the spec doesn't say how to handle it.On 11/3/11 4:21 AM, Henri Sivonen wrote:
> On Thu, Nov 3, 2011 at 1:57 AM, David Flanagan<dflanagan@...> wrote: >> Firefox, Chrome and Safari all seem to do the right thing: wait for the next >> character before tokenizing the CR. > See http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1247 I hadn't used the live dom viewer before. That's really useful! > Firefox tokenizes the CR immediately, emits an LF and then skips over > the next character if it is an LF. When I designed the solution > Firefox uses, I believed it was more correct and more compatible with > legacy than whatever the spec said at the time. I'm having a Duh! moment... I currently wait for the next character, but what you describe is also works, and allows the document.write() spec to make sense. > Chrome seems to wait for the next character before tokenizing the CR. > >> And I think this means that the description of document.write needs to be changed. > All along, I've felt thought that having U+0000 and CRLF handling as a > stream preprocessing step was bogus and both should happen upon > tokenization. So far, I've managed to convince Hixie about U+0000 > handling. Each tokenizer state would have to add a rule for CR that said "emit LF, save the current tokenizer state, and set the tokenizer state to "after CR state". Actually, tokenizer states that already have a rule for LF or whitespace would have to integrate this CR rule into that rule. Then new after CR state would have two rules. On LF it would skip the character and restore the saved state. On anything else it would push the character back and restore the saved state. >> Similarly, what should the tokenizer do if the document.write emits half of >> a UTF-16 surrogate pair as the last character? > The parser operates on UTF-16 code units, so a lone surrogate is emitted. The spec seems pretty unambiguous that it operates on codepoints (though I implemented mine using 16-bit code units). §13.2.1: " The input to the HTML parsing process consists of a stream of Unicode code points". Also §13.2.2.3 includes a list of codepoints beyond the BMP that are parse errors. And finally, the tests in http://code.google.com/p/html5lib/source/browse/testdata/tokenizer/unicodeCharsProblematic.test require unpaired surrogates to be converted to the U+FFFD replacement character. (Though my experience is that modifying my tokenizer to pass those tests causes other tests to fail, which makes me wonder whether unpaired surrogates are only supposed to be replaced in some but not all tokenizer states) Thanks, Henri! David |
|
|
Re: document.write("\r"): the spec doesn't say how to handle it.On Thu, Nov 3, 2011 at 8:13 PM, David Flanagan <dflanagan@...> wrote:
> Each tokenizer state would have to add a rule for CR that said "emit LF, > save the current tokenizer state, and set the tokenizer state to "after CR > state". The Validator.nu/Gecko tokenizer returns a "last input code unit processed was CR" flag to the caller. If the tokenizer sees a CR, the tokenizer processes it and returns to the caller immediately with the flag set to true. The caller is responsible for checking if the next input code unit is an LF, skipping over it and calling the tokenizer again. This way, the tokenizer itself does not need to have the capability of skipping over a character and the same capabilities that are normally used for dealing with arbitrary buffer boundaries and early returns after script end tags (or timers before the parser moved off the main thread) work. >> The parser operates on UTF-16 code units, so a lone surrogate is emitted. > > The spec seems pretty unambiguous that it operates on codepoints The spec is empirically wrong. The wrongness has been reported. The spec tries to retrofit Unicode theoretical purity onto legacy where no purity existed. The tokenizer operates on UTF-16 code units. document.write() feeds UTF-16 code units to the tokenizer without lone surrogate preprocessing. The tokenizer or the tree builder don't do anything about lone surrogates. When consuming a byte stream, the converter that converts (potentially unaligned and potentially foreign byte-order) the UTF-16-encoded byte stream into a stream of UTF-16 code units is responsible for treating unpaired surrogates as conversion errors. Sorry about not mentioning earlier that the problematic tests are also problematic in this sense. -- Henri Sivonen hsivonen@... http://hsivonen.iki.fi/ |
|
|
Re: document.write("\r"): the spec doesn't say how to handle it.On Wed, 2 Nov 2011, David Flanagan wrote:
> > The spec for document.write() > http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#dom-document-write > says: "... have the tokenizer process the characters that were inserted, one > at a time, processing resulting tokens as they are emitted, and stopping when > the tokenizer reaches the insertion point..." > > But what happens if the last character written by document.write() is a > carriage return? > > The HTML parsing spec says that CR followed by LF is ignored but CR > followed by anything else is converted to LF. So if the last character > is CR, then the tokenizer can't process all characters up to the > insertion point because it needs to lookahead at the next character, > right? I can remove the text "one at a time", if you like. Would that be satisfactory? Or I guess I could change the spec to say that the parser should process the characters, rather than the tokenizer, since really it's the whole shebang that needs to be involved (stream preprocessor and everything). Any opinions on what the right text is here? > Similarly, what should the tokenizer do if the document.write emits half > of a UTF-16 surrogate pair as the last character? Can you elaborate on what difficulty this would present? -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' |
|
|
Re: document.write("\r"): the spec doesn't say how to handle it.On Wed, Dec 14, 2011 at 2:00 AM, Ian Hickson <ian@...> wrote:
> I can remove the text "one at a time", if you like. Would that be > satisfactory? Or I guess I could change the spec to say that the parser > should process the characters, rather than the tokenizer, since really > it's the whole shebang that needs to be involved (stream preprocessor and > everything). Any opinions on what the right text is here? I'd like the CRLF preprocessing to be defined as an eager stateful operation so that there's one bit of state: "last was CR". Then, input is handled as follows: If the input character is CR, set "last was CR" to true and emit LF. If the input character is LF and "last was CR" is true, don't emit anything and set "last was CR" to false. If the input character is LF and "last was CR" is is false, emit LF. Else set "last was CR" to false and emit the input character. Where "emit" feeds into the tokenizer. By "eager", I mean that the operation described above doesn't buffer. I.e. the first case emits an LF upon seeing a CR without waiting for an LF also to appear in the input. -- Henri Sivonen hsivonen@... http://hsivonen.iki.fi/ |
|
|
Re: document.write("\r"): the spec doesn't say how to handle it.On Mon, 19 Dec 2011, Henri Sivonen wrote: > On Wed, Dec 14, 2011 at 2:00 AM, Ian Hickson <ian@...> wrote: > > I can remove the text "one at a time", if you like. Would that be > > satisfactory? Or I guess I could change the spec to say that the > > parser should process the characters, rather than the tokenizer, since > > really it's the whole shebang that needs to be involved (stream > > preprocessor and everything). Any opinions on what the right text is > > here? > > I'd like the CRLF preprocessing to be defined as an eager stateful > operation so that there's one bit of state: "last was CR". Then, input > is handled as follows: > If the input character is CR, set "last was CR" to true and emit LF. > If the input character is LF and "last was CR" is true, don't emit > anything and set "last was CR" to false. > If the input character is LF and "last was CR" is is false, emit LF. > Else set "last was CR" to false and emit the input character. I've also done the second change I suggest above. On Thu, 3 Nov 2011, David Flanagan wrote: > > The spec seems pretty unambiguous that it operates on codepoints (though > I implemented mine using 16-bit code units). §13.2.1: " The input to > the HTML parsing process consists of a stream of Unicode code points". > Also §13.2.2.3 includes a list of codepoints beyond the BMP that are > parse errors. And finally, the tests in > http://code.google.com/p/html5lib/source/browse/testdata/tokenizer/unicodeCharsProblematic.test > require unpaired surrogates to be converted to the U+FFFD replacement > character. (Though my experience is that modifying my tokenizer to pass > those tests causes other tests to fail, which makes me wonder whether > unpaired surrogates are only supposed to be replaced in some but not all > tokenizer states) defined in a way that is (in theory) black-box indistinguishable from UTF-16 handling, but without making "astral characters" into second-class citizens. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' |
| Free embeddable forum powered by Nabble | Forum Help |