WARNING: This server is unstable and will be retired in the next days. If you want to keep this forum available, please request immediately a migration on the Nabble Support forum. Forums that don't receive any migration request will be deleted forever.

document.write("\r"): the spec doesn't say how to handle it.

View: New views
7 Messages — Rating Filter:   Alert me  

document.write("\r"): the spec doesn't say how to handle it.

by David Flanagan-2 :: Rate this Message:

| View Threaded | Show Only this Message

The spec for document.write()
http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#dom-document-write 
says: "... have the tokenizer process the characters that were inserted,
one at a time, processing resulting tokens as they are emitted, and
stopping when the tokenizer reaches the insertion point..."

But what happens if the last character written by document.write() is a
carriage return?

The HTML parsing spec says that CR followed by LF is ignored but CR
followed by anything else is converted to LF.  So if the last character
is CR, then the tokenizer can't process all characters up to the
insertion point because it needs to lookahead at the next character, right?

Firefox, Chrome and Safari all seem to do the right thing: wait for the
next character before tokenizing the CR.  And I think this means that
the description of document.write needs to be changed.  (Opera, on the
other hand, just gets this wrong and emits a CR character).

Similarly, what should the tokenizer do if the document.write emits half
of a UTF-16 surrogate pair as the last character?

     David

Re: document.write("\r"): the spec doesn't say how to handle it.

by Henri Sivonen :: Rate this Message:

| View Threaded | Show Only this Message

On Thu, Nov 3, 2011 at 1:57 AM, David Flanagan <dflanagan@...> wrote:
> Firefox, Chrome and Safari all seem to do the right thing: wait for the next
> character before tokenizing the CR.

See http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1247

Firefox tokenizes the CR immediately, emits an LF and then skips over
the next character if it is an LF. When I designed the solution
Firefox uses, I believed it was more correct and more compatible with
legacy than whatever the spec said at the time.

Chrome seems to wait for the next character before tokenizing the CR.

> And I think this means that the description of document.write needs to be changed.

All along, I've felt thought that having U+0000 and CRLF handling as a
stream preprocessing step was bogus and both should happen upon
tokenization. So far, I've managed to convince Hixie about U+0000
handling.

> Similarly, what should the tokenizer do if the document.write emits half of
> a UTF-16 surrogate pair as the last character?

The parser operates on UTF-16 code units, so a lone surrogate is emitted.

--
Henri Sivonen
hsivonen@...
http://hsivonen.iki.fi/

Re: document.write("\r"): the spec doesn't say how to handle it.

by David Flanagan-2 :: Rate this Message:

| View Threaded | Show Only this Message

On 11/3/11 4:21 AM, Henri Sivonen wrote:
> On Thu, Nov 3, 2011 at 1:57 AM, David Flanagan<dflanagan@...>  wrote:
>> Firefox, Chrome and Safari all seem to do the right thing: wait for the next
>> character before tokenizing the CR.
> See http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1247
I hadn't used the live dom viewer before.  That's really useful!

> Firefox tokenizes the CR immediately, emits an LF and then skips over
> the next character if it is an LF. When I designed the solution
> Firefox uses, I believed it was more correct and more compatible with
> legacy than whatever the spec said at the time.
I'm having a Duh! moment... I currently wait for the next character, but
what you describe is also works, and allows the document.write() spec to
make sense.

> Chrome seems to wait for the next character before tokenizing the CR.
>
>> And I think this means that the description of document.write needs to be changed.
> All along, I've felt thought that having U+0000 and CRLF handling as a
> stream preprocessing step was bogus and both should happen upon
> tokenization. So far, I've managed to convince Hixie about U+0000
> handling.
Each tokenizer state would have to add a rule for CR that said  "emit
LF, save the current tokenizer state, and set the tokenizer state to
"after CR state".  Actually, tokenizer states that already have a rule
for LF or whitespace would have to integrate this CR rule into that
rule.  Then new after CR state would have two rules. On LF it would skip
the character and restore the saved state.  On anything else it would
push the character back and restore the saved state.

>> Similarly, what should the tokenizer do if the document.write emits half of
>> a UTF-16 surrogate pair as the last character?
> The parser operates on UTF-16 code units, so a lone surrogate is emitted.

The spec seems pretty unambiguous that it operates on codepoints (though
I implemented mine using 16-bit code units). §13.2.1: " The input to the
HTML parsing process consists of a stream of Unicode code points".  Also
§13.2.2.3 includes a list of codepoints beyond the BMP that are parse
errors.  And finally, the tests in
http://code.google.com/p/html5lib/source/browse/testdata/tokenizer/unicodeCharsProblematic.test 
require unpaired surrogates to be converted to the U+FFFD replacement
character.  (Though my experience is that modifying my tokenizer to pass
those tests causes other tests to fail, which makes me wonder whether
unpaired surrogates are only supposed to be replaced in some but not all
tokenizer states)
Thanks, Henri!

     David

Re: document.write("\r"): the spec doesn't say how to handle it.

by Henri Sivonen :: Rate this Message:

| View Threaded | Show Only this Message

On Thu, Nov 3, 2011 at 8:13 PM, David Flanagan <dflanagan@...> wrote:
> Each tokenizer state would have to add a rule for CR that said  "emit LF,
> save the current tokenizer state, and set the tokenizer state to "after CR
> state".

The Validator.nu/Gecko tokenizer returns a "last input code unit
processed was CR" flag to the caller. If the tokenizer sees a CR, the
tokenizer processes it and returns to the caller immediately with the
flag set to true. The caller is responsible for checking if the next
input code unit is an LF, skipping over it and calling the tokenizer
again. This way, the tokenizer itself does not need to have the
capability of skipping over a character and the same capabilities that
are normally used for dealing with arbitrary buffer boundaries and
early returns after script end tags (or timers before the parser moved
off the main thread) work.

>> The parser operates on UTF-16 code units, so a lone surrogate is emitted.
>
> The spec seems pretty unambiguous that it operates on codepoints

The spec is empirically wrong. The wrongness has been reported. The
spec tries to retrofit Unicode theoretical purity onto legacy where no
purity existed.

The tokenizer operates on UTF-16 code units. document.write() feeds
UTF-16 code units to the tokenizer without lone surrogate
preprocessing. The tokenizer or the tree builder don't do anything
about lone surrogates. When consuming a byte stream, the converter
that converts (potentially unaligned and potentially foreign
byte-order) the UTF-16-encoded byte stream into a stream of UTF-16
code units is responsible for treating unpaired surrogates as
conversion errors.

Sorry about not mentioning earlier that the problematic tests are also
problematic in this sense.

--
Henri Sivonen
hsivonen@...
http://hsivonen.iki.fi/

Re: document.write("\r"): the spec doesn't say how to handle it.

by Ian Hickson :: Rate this Message:

| View Threaded | Show Only this Message

On Wed, 2 Nov 2011, David Flanagan wrote:

>
> The spec for document.write()
> http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#dom-document-write
> says: "... have the tokenizer process the characters that were inserted, one
> at a time, processing resulting tokens as they are emitted, and stopping when
> the tokenizer reaches the insertion point..."
>
> But what happens if the last character written by document.write() is a
> carriage return?
>
> The HTML parsing spec says that CR followed by LF is ignored but CR
> followed by anything else is converted to LF.  So if the last character
> is CR, then the tokenizer can't process all characters up to the
> insertion point because it needs to lookahead at the next character,
> right?

I can remove the text "one at a time", if you like. Would that be
satisfactory? Or I guess I could change the spec to say that the parser
should process the characters, rather than the tokenizer, since really
it's the whole shebang that needs to be involved (stream preprocessor and
everything). Any opinions on what the right text is here?


> Similarly, what should the tokenizer do if the document.write emits half
> of a UTF-16 surrogate pair as the last character?

Can you elaborate on what difficulty this would present?

--
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: document.write("\r"): the spec doesn't say how to handle it.

by Henri Sivonen :: Rate this Message:

| View Threaded | Show Only this Message

On Wed, Dec 14, 2011 at 2:00 AM, Ian Hickson <ian@...> wrote:
> I can remove the text "one at a time", if you like. Would that be
> satisfactory? Or I guess I could change the spec to say that the parser
> should process the characters, rather than the tokenizer, since really
> it's the whole shebang that needs to be involved (stream preprocessor and
> everything). Any opinions on what the right text is here?

I'd like the CRLF preprocessing to be defined as an eager stateful
operation so that there's one bit of state: "last was CR". Then, input
is handled as follows:
If the input character is CR, set "last was CR" to true and emit LF.
If the input character is LF and "last was CR" is true, don't emit
anything and set "last was CR" to false.
If the input character is LF and "last was CR" is is false, emit LF.
Else set "last was CR" to false and emit the input character.

Where "emit" feeds into the tokenizer. By "eager", I mean that the
operation described above doesn't buffer. I.e. the first case emits an
LF upon seeing a CR without waiting for an LF also to appear in the
input.

--
Henri Sivonen
hsivonen@...
http://hsivonen.iki.fi/

Re: document.write("\r"): the spec doesn't say how to handle it.

by Ian Hickson :: Rate this Message:

| View Threaded | Show Only this Message


On Mon, 19 Dec 2011, Henri Sivonen wrote:

> On Wed, Dec 14, 2011 at 2:00 AM, Ian Hickson <ian@...> wrote:
> > I can remove the text "one at a time", if you like. Would that be
> > satisfactory? Or I guess I could change the spec to say that the
> > parser should process the characters, rather than the tokenizer, since
> > really it's the whole shebang that needs to be involved (stream
> > preprocessor and everything). Any opinions on what the right text is
> > here?
>
> I'd like the CRLF preprocessing to be defined as an eager stateful
> operation so that there's one bit of state: "last was CR". Then, input
> is handled as follows:
> If the input character is CR, set "last was CR" to true and emit LF.
> If the input character is LF and "last was CR" is true, don't emit
> anything and set "last was CR" to false.
> If the input character is LF and "last was CR" is is false, emit LF.
> Else set "last was CR" to false and emit the input character.
I've done something like this (but simpler to spec).

I've also done the second change I suggest above.


On Thu, 3 Nov 2011, David Flanagan wrote:

>
> The spec seems pretty unambiguous that it operates on codepoints (though
> I implemented mine using 16-bit code units). §13.2.1: " The input to
> the HTML parsing process consists of a stream of Unicode code points".  
> Also §13.2.2.3 includes a list of codepoints beyond the BMP that are
> parse errors.  And finally, the tests in
> http://code.google.com/p/html5lib/source/browse/testdata/tokenizer/unicodeCharsProblematic.test 
> require unpaired surrogates to be converted to the U+FFFD replacement
> character.  (Though my experience is that modifying my tokenizer to pass
> those tests causes other tests to fail, which makes me wonder whether
> unpaired surrogates are only supposed to be replaced in some but not all
> tokenizer states)
This has changed a bit. In particular, "Unicode code point" is currently
defined in a way that is (in theory) black-box indistinguishable from
UTF-16 handling, but without making "astral characters" into second-class
citizens.

--
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'