URL query component

View: New views
5 Messages — Rating Filter:   Alert me  

URL query component

by Anne van Kesteren-2 :: Rate this Message:

| View Threaded | Show Only this Message

The URL query component for URLs found in HTML (exact set still be to be  
defined I think) uses the page encoding when the page encoding is not  
utf-8/utf-16 (then it uses utf-8).

E.g. "?€" maps to "?%80" in a windows-1252 encoded page.

Currently browsers differ for what happens when the code point cannot be  
encoded. E.g. "?€"

Opera uses "?". Internet Explorer uses "?" (but when the URL hits the  
network layer, not when you inspect it via script). WebKit uses "&#...;".  
Gecko encodes it using utf-8.

What Gecko does makes the resulting data impossible to interpret.

What WebKit does is consistent with form submission. I like it.


Also, given that encoding behavior is not exposed besides form submission  
and URLs, consistently using "&#...;" for code points not represented in  
legacy encodings makes sense to me. Am I missing something?


--
Anne van Kesteren
http://annevankesteren.nl/

Re: URL query component

by and-py :: Rate this Message:

| View Threaded | Show Only this Message

On 2012-04-20 09:15, Anne van Kesteren wrote:
> Currently browsers differ for what happens when the code point cannot be encoded.
> What Gecko does [?%C2%A3] makes the resulting data impossible to interpret.
> What WebKit does [?%26%23163%3B] is consistent with form submission. I like it.

I do not! It makes the data impossible to recover just as Gecko does...
in fact worse, because at least Gecko preserves ASCII. With the WebKit
behaviour it becomes impossible to determine from an pure ASCII string
'£' whether the user really typed '€' or '£' into the input field.

It has the advantage of consistency with the POST behaviour, but that
behaviour is an unpleasant legacy hack which encourages a
misunderstanding of HTML-escaping that promotes XSS vulns. I would not
like to see it spread any further than it already has.

cheers,

--
And Clover
mailto:and@...
http://www.doxdesk.com/
gtalk:chat?jid=bobince@...

Re: URL query component

by Anne van Kesteren-2 :: Rate this Message:

| View Threaded | Show Only this Message

On Fri, 20 Apr 2012 14:37:10 +0200, And Clover <and-py@...> wrote:

> On 2012-04-20 09:15, Anne van Kesteren wrote:
>> Currently browsers differ for what happens when the code point cannot  
>> be encoded.
>> What Gecko does [?%C2%A3] makes the resulting data impossible to  
>> interpret.
>> What WebKit does [?%26%23163%3B] is consistent with form submission. I  
>> like it.
>
> I do not! It makes the data impossible to recover just as Gecko does...  
> in fact worse, because at least Gecko preserves ASCII. With the WebKit  
> behaviour it becomes impossible to determine from an pure ASCII string  
> '£' whether the user really typed '€' or '£' into the input  
> field.

You have the same problem with Gecko's behavior and multi-byte encodings.  
That's actually worse, since an erroneous three byte sequence will put the  
multi-byte decoders off.


> It has the advantage of consistency with the POST behaviour, but that  
> behaviour is an unpleasant legacy hack which encourages a  
> misunderstanding of HTML-escaping that promotes XSS vulns. I would not  
> like to see it spread any further than it already has.

It's both GET and POST. So really the only difference here is manually  
constructed URLs.

Also, I think we should flag all non-utf-8 usage. This is mostly about  
deciding behavior for legacy content, which will already be broken if it  
runs into this minor edge case.


--
Anne van Kesteren
http://annevankesteren.nl/

Re: URL query component

by Julian Reschke :: Rate this Message:

| View Threaded | Show Only this Message

On 2012-04-20 14:37, And Clover wrote:

> On 2012-04-20 09:15, Anne van Kesteren wrote:
>> Currently browsers differ for what happens when the code point cannot
>> be encoded.
>> What Gecko does [?%C2%A3] makes the resulting data impossible to
>> interpret.
>> What WebKit does [?%26%23163%3B] is consistent with form submission. I
>> like it.
>
> I do not! It makes the data impossible to recover just as Gecko does...
> in fact worse, because at least Gecko preserves ASCII. With the WebKit
> behaviour it becomes impossible to determine from an pure ASCII string
> '£' whether the user really typed '€' or '£' into the input
> field.
>
> It has the advantage of consistency with the POST behaviour, but that
> behaviour is an unpleasant legacy hack which encourages a
> misunderstanding of HTML-escaping that promotes XSS vulns. I would not
> like to see it spread any further than it already has.

+1

Indeed.

I think this is a case where you want to fail early (for some value of
"fail"); so maybe substituting with "?" makes most sense.

Do any servers *expect* the Webkit behavior? If they do so, why don't
they just fix the pages they serve to use UTF-8 to get consistent
behavior throughout?

Best regards, Julian

Re: URL query component

by Anne van Kesteren-2 :: Rate this Message:

| View Threaded | Show Only this Message

On Fri, 20 Apr 2012 15:19:10 +0200, Julian Reschke <julian.reschke@...>  
wrote:
> I think this is a case where you want to fail early (for some value of  
> "fail"); so maybe substituting with "?" makes most sense.
>
> Do any servers *expect* the Webkit behavior? If they do so, why don't  
> they just fix the pages they serve to use UTF-8 to get consistent  
> behavior throughout?

Given that every browser does something different I doubt anyone expects  
anything to work here. Note that this is an edge case, form submission,  
both GET and POST, uses the "&#...;" pattern whenever an encoder error is  
emitted. This is solely about URLs query parameters appearing as string  
value of HTML attributes that take URLs. Given that it is such an edge  
case, using the same encoder behavior seems nice as it means one code path  
less.

Having said that, if there are other places where we expose the encoder  
and there something other than "&#...;" or fatal error is required, that  
would be very interesting to know. I have not been able to think of  
anything myself thus far.


--
Anne van Kesteren
http://annevankesteren.nl/