multipart/form-data filename encoding: unicode and special characters

View: New views
12 Messages — Rating Filter:   Alert me  

multipart/form-data filename encoding: unicode and special characters

by Evan Jones-5 :: Rate this Message:

| View Threaded | Show Only this Message

I am not an experienced web standards wonk, so please forgive me if I'm making a mistake here.

When uploading files that contain special characters in their name, it appears to me that it is unspecified as to how those file names should be escaped. As a result, Webkit/Safari/Chrome appear to handle these filenames in one way, while Firefox handles them in another. I'm implementing the server side of this equation, and it is unclear to me what I should be doing. Am I missing something? Webkit even has a bug on this issue that states "I suggest working with WHATWG or HTML WG to get something specified in HTML5, and getting browsers converge on that." Is anyone working on this?


EXAMPLE

Create a file named: bàz'\"hi%22.txt  eg. using the unix command: touch bàz\'\\\"hi%22.txt


Firefox (13.0 beta on Mac) sends the following header, backslash escaping the double quote but not escaping the backslash.

Content-Disposition: form-data; name="somefile"; filename="bàz'\\"hi%22.txt"


Webkit (latest nightly r115711 on Mac): %-escapes the double quote, but does nothing to the literal %

Content-Disposition: form-data; name="somefile"; filename="bàz'\%22hi%22.txt"


THE SPECS: HTML5 states:

http://www.whatwg.org/specs/web-apps/current-work/multipage/association-of-controls-and-forms.html#multipart-form-data

Encode the (now mutated) form data set using the rules described by RFC 2388. […] File names […] must use the character encoding selected above, though the precise name may be approximated if necessary (e.g. […]). User agents must not use the RFC 2231 encoding suggested by RFC 2388.


… this seems contradictory: Encode using RFC 2388, but do not using the encoding suggested by the RFC. Worse, no browser actually follows the RFC (e.g. they all use UTF-8 encoded parameter values), so that doesn't seem like the right answer. Is there a way out of this mess?

Evan

--
http://evanjones.ca/


Re: multipart/form-data filename encoding: unicode and special characters

by Ashley Sheridan-3 :: Rate this Message:

| View Threaded | Show Only this Message

On Tue, 2012-05-01 at 21:12 -0400, Evan Jones wrote:

> I am not an experienced web standards wonk, so please forgive me if I'm making a mistake here.
>
> When uploading files that contain special characters in their name, it appears to me that it is unspecified as to how those file names should be escaped. As a result, Webkit/Safari/Chrome appear to handle these filenames in one way, while Firefox handles them in another. I'm implementing the server side of this equation, and it is unclear to me what I should be doing. Am I missing something? Webkit even has a bug on this issue that states "I suggest working with WHATWG or HTML WG to get something specified in HTML5, and getting browsers converge on that." Is anyone working on this?
>
>
> EXAMPLE
>
> Create a file named: bàz'\"hi%22.txt  eg. using the unix command: touch bàz\'\\\"hi%22.txt
>
>
> Firefox (13.0 beta on Mac) sends the following header, backslash escaping the double quote but not escaping the backslash.
>
> Content-Disposition: form-data; name="somefile"; filename="bàz'\\"hi%22.txt"
>
>
> Webkit (latest nightly r115711 on Mac): %-escapes the double quote, but does nothing to the literal %
>
> Content-Disposition: form-data; name="somefile"; filename="bàz'\%22hi%22.txt"
>
>
> THE SPECS: HTML5 states:
>
> http://www.whatwg.org/specs/web-apps/current-work/multipage/association-of-controls-and-forms.html#multipart-form-data
>
> Encode the (now mutated) form data set using the rules described by RFC 2388. […] File names […] must use the character encoding selected above, though the precise name may be approximated if necessary (e.g. […]). User agents must not use the RFC 2231 encoding suggested by RFC 2388.
>
>
> … this seems contradictory: Encode using RFC 2388, but do not using the encoding suggested by the RFC. Worse, no browser actually follows the RFC (e.g. they all use UTF-8 encoded parameter values), so that doesn't seem like the right answer. Is there a way out of this mess?
>
> Evan
>
> --
> http://evanjones.ca/
>


Although an issue with this test case, I would argue what valid problem
this may cause. It does implement many characters which are considered
unsafe in the most popular operating system file system (windows either
NTFS or FAT32), and therefore by association operating systems in which
the user is probably (even unconsciously) avoiding those characters
purely for interoperability reasons.

The Webkit method looks the better of the two with regards to how
server-side languages might interpret it, but it would need work to
ensure everything that should be escaped is, and that everything that is
unescaped on the server should be and is done so correctly.

--
Thanks,
Ash
http://www.ashleysheridan.co.uk



Re: multipart/form-data filename encoding: unicode and special characters

by Evan Jones-5 :: Rate this Message:

| View Threaded | Show Only this Message

On May 1, 2012, at 22:38 , Ashley Sheridan wrote:
> The Webkit method looks the better of the two with regards to how
> server-side languages might interpret it, but it would need work to
> ensure everything that should be escaped is, and that everything that is
> unescaped on the server should be and is done so correctly.

The problem is that currently I am unable to correctly "round trip" an uploaded file name. I would like users to upload a file, and be able to later download the file with the *exact same* file name. If you follow the specifications, this is not possible. Firefox is closer to the MIME RFCs (which specifies backslash quoting in quoted-strings), but apparently that will break IE6, 7, and 8:

https://bugs.webkit.org/show_bug.cgi?id=62107
http://java.net/jira/browse/JERSEY-759

Webkit's %-escaping behaviour is *not* part of the referenced MIME RFCs (which specifies either backslash quoting in quoted-strings, base64 encoding, or %-escaping in special "filename*=" arguments). Thus, if this is the "right answer," it should be specified somewhere. I'm assuming that this needs to be in the HTML5 spec, since HTTP calls this the "body" of the the POST and declares that it is outside the HTTP specification.

Webkit's escaping is also flawed (see bug 62107 above). Files with that contain %-escapes (eg. my%22file.txt, admittedly very rare) will get mangled, because there is no difference between my%22file.txt and my"file.txt.

Currently, I need to detect the browser in order to figure out what kind of unescaping to apply to the file name, and even then in some cases I can't figure out what the right file name is. Webkit claims this is a specification bug, so I'm hoping someone here might tell me if this is the case, and if so where can I file bugs, create test cases, etc?

Evan

--
http://evanjones.ca/


Re: multipart/form-data filename encoding: unicode and special characters

by Julian Reschke :: Rate this Message:

| View Threaded | Show Only this Message

On 2012-05-02 13:05, Evan Jones wrote:

> On May 1, 2012, at 22:38 , Ashley Sheridan wrote:
>> The Webkit method looks the better of the two with regards to how
>> server-side languages might interpret it, but it would need work to
>> ensure everything that should be escaped is, and that everything that is
>> unescaped on the server should be and is done so correctly.
>
> The problem is that currently I am unable to correctly "round trip" an uploaded file name. I would like users to upload a file, and be able to later download the file with the *exact same* file name. If you follow the specifications, this is not possible. Firefox is closer to the MIME RFCs (which specifies backslash quoting in quoted-strings), but apparently that will break IE6, 7, and 8:
>
> https://bugs.webkit.org/show_bug.cgi?id=62107
> http://java.net/jira/browse/JERSEY-759
>
> Webkit's %-escaping behaviour is *not* part of the referenced MIME RFCs (which specifies either backslash quoting in quoted-strings, base64 encoding, or %-escaping in special "filename*=" arguments). Thus, if this is the "right answer," it should be specified somewhere. I'm assuming that this needs to be in the HTML5 spec, since HTTP calls this the "body" of the the POST and declares that it is outside the HTTP specification.
>
> Webkit's escaping is also flawed (see bug 62107 above). Files with that contain %-escapes (eg. my%22file.txt, admittedly very rare) will get mangled, because there is no difference between my%22file.txt and my"file.txt.
>
> Currently, I need to detect the browser in order to figure out what kind of unescaping to apply to the file name, and even then in some cases I can't figure out what the right file name is. Webkit claims this is a specification bug, so I'm hoping someone here might tell me if this is the case, and if so where can I file bugs, create test cases, etc?
>
> Evan
>
> --
> http://evanjones.ca/

I did spend a considerable amount of time with Content-Disposition, the
*response* header field (resulting in RFC 6266 and
<http://greenbytes.de/tech/tc2231/>).

However, this has little to do with the representation in form uploads.
If browser implementers want to try something new that will not affect
the old code paths, supporting the encoding defined in RFC 5987 might be
the right thing to do (yes, it's ugly, but it's unambiguous).

Best regards, Julian

Re: multipart/form-data filename encoding: unicode and special characters

by Evan Jones-5 :: Rate this Message:

| View Threaded | Show Only this Message

On May 2, 2012, at 7:43 , Julian Reschke wrote:
> If browser implementers want to try something new that will not affect the old code paths, supporting the encoding defined in RFC 5987 might be the right thing to do (yes, it's ugly, but it's unambiguous).

It seems to me like that is a potential solution that could be evaluated. It would be nice to have both the HTTP response header and the POST form encoding be the same. However, a critical question is if the server software that parses the form headers would do the "right thing" if it sees both an ASCII fallback filename= and an escaped filename*= parameter in the Content-Disposition header. Without looking at any code, I suspect some will and some won't.

My conclusion: I would be willing to help with bugs, testing, test cases, looking at server code, etc related to this issue. However, I believe someone who is experienced with the technology and politics of web standards to really champion any change because I don't fully understand the processes or the issues. If I don't hear anything in a few days, I'll try filing some additional bugs with Webkit, Firefox, and the HTML5 spec and otherwise give up.

Thanks,

Evan Jones

--
http://evanjones.ca/


Re: multipart/form-data filename encoding: unicode and special characters

by Julian Reschke :: Rate this Message:

| View Threaded | Show Only this Message

On 2012-05-02 19:26, Evan Jones wrote:
> On May 2, 2012, at 7:43 , Julian Reschke wrote:
>> If browser implementers want to try something new that will not affect the old code paths, supporting the encoding defined in RFC 5987 might be the right thing to do (yes, it's ugly, but it's unambiguous).
>
> It seems to me like that is a potential solution that could be evaluated. It would be nice to have both the HTTP response header and the POST form encoding be the same. However, a critical question is if the server software that parses the form headers would do the "right thing" if it sees both an ASCII fallback filename= and an escaped filename*= parameter in the Content-Disposition header. Without looking at any code, I suspect some will and some won't.

I'm pretty sure everybody will ignore filename* for now. Which means
servers need to upgrade, but at least it would be an upgrade that
doesn't break any existing behavior.

> My conclusion: I would be willing to help with bugs, testing, test cases, looking at server code, etc related to this issue. However, I believe someone who is experienced with the technology and politics of web standards to really champion any change because I don't fully understand the processes or the issues. If I don't hear anything in a few days, I'll try filing some additional bugs with Webkit, Firefox, and the HTML5 spec and otherwise give up.
> ...

Sounds like a plan.

Best regards, Julian

Re: multipart/form-data filename encoding: unicode and special characters

by Anne van Kesteren-2 :: Rate this Message:

| View Threaded | Show Only this Message

On Tue, 01 May 2012 18:12:36 -0700, Evan Jones <evanj@...> wrote:
> … this seems contradictory: Encode using RFC 2388, but do not using the  
> encoding suggested by the RFC. Worse, no browser actually follows the  
> RFC (e.g. they all use UTF-8 encoded parameter values), so that doesn't  
> seem like the right answer. Is there a way out of this mess?

Yes. I think we should define multipart/form-data directly in HTML and  
thereby obsolete http://tools.ietf.org/html/rfc2388 as it is outdated and  
not maintained.


--
Anne van Kesteren
http://annevankesteren.nl/

Re: multipart/form-data filename encoding: unicode and special characters

by Evan Jones-5 :: Rate this Message:

| View Threaded | Show Only this Message

 On May 3, 2012, at 17:09 , Anne van Kesteren wrote:
> Yes. I think we should define multipart/form-data directly in HTML and thereby obsolete http://tools.ietf.org/html/rfc2388 as it is outdated and not maintained.

Right; that would be ideal. Despite the fact that HTML5 references that RFC, browsers don't really follow it.

I would be interested in trying to help with this, but again I would certainly need some guidance from people who know more about the vagaries of how the various browsers encode their form parameters / uploaded file names, and why things got that way. It probably would not be helpful for me to try to draft an update to the spec without getting the right implementers on board.

Evan

--
http://evanjones.ca/


Parent Message unknown Re: multipart/form-data filename encoding: unicode and special characters

by Anne van Kesteren-4 :: Rate this Message:

| View Threaded | Show Only this Message

On Fri, 04 May 2012 01:59:15 +0200, Evan Jones <evanj@...> wrote:
> I would be interested in trying to help with this, but again I would
> certainly need some guidance from people who know more about the
> vagaries of how the various browsers encode their form parameters /
> uploaded file names, and why things got that way. It probably would not
> be helpful for me to try to draft an update to the spec without getting
> the right implementers on board.

The best way to start with defining a legacy feature is figuring out
what the various implementations do. The next step is evaluating that
against what is most compatible and desirable and try to come up with
a proposal. Along with talking to / filing bugs on various
implementations to see if they can be converged. And then after a few
iterations (and sometimes quite a bit of time; e.g. the HTML parser
took about six years) you have standardized some legacy behavior and
made it fully interoperable. Additional benefit is that it's now
easier to extend, easier for new implementations, and easier for
existing implementations to rewrite their code if necessary.

PS: Apologies for potentially breaking threading by sending this
email. I'm switching email addresses.


--
Anne van Kesteren
http://annevankesteren.nl/

Re: multipart/form-data filename encoding: unicode and special characters

by Ian Hickson :: Rate this Message:

| View Threaded | Show Only this Message

On Thu, 3 May 2012, Evan Jones wrote:

> On May 3, 2012, at 17:09 , Anne van Kesteren wrote:
> >
> > Yes. I think we should define multipart/form-data directly in HTML and
> > thereby obsolete http://tools.ietf.org/html/rfc2388 as it is outdated
> > and not maintained.
>
> Right; that would be ideal. Despite the fact that HTML5 references that
> RFC, browsers don't really follow it.
>
> I would be interested in trying to help with this, but again I would
> certainly need some guidance from people who know more about the
> vagaries of how the various browsers encode their form parameters /
> uploaded file names, and why things got that way. It probably would not
> be helpful for me to try to draft an update to the spec without getting
> the right implementers on board.

If this is still something for which you have some time available, then
the starting point for anything like this would be test cases, lots and
lots of test cases. In this case, it would have to be something like a
server that echoes the precise bytes sent by the client, for a huge
variety of different setups:

 - various submission encodings
 - various form field names and types
 - various file submission filenames

...etc.

I'd be happy to advise if this is something that still interests you.

--
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: multipart/form-data filename encoding: unicode and special characters

by Julian Reschke :: Rate this Message:

| View Threaded | Show Only this Message

On 2012-07-09 23:01, Ian Hickson wrote:

> On Thu, 3 May 2012, Evan Jones wrote:
>> On May 3, 2012, at 17:09 , Anne van Kesteren wrote:
>>>
>>> Yes. I think we should define multipart/form-data directly in HTML and
>>> thereby obsolete http://tools.ietf.org/html/rfc2388 as it is outdated
>>> and not maintained.
>>
>> Right; that would be ideal. Despite the fact that HTML5 references that
>> RFC, browsers don't really follow it.
>>
>> I would be interested in trying to help with this, but again I would
>> certainly need some guidance from people who know more about the
>> vagaries of how the various browsers encode their form parameters /
>> uploaded file names, and why things got that way. It probably would not
>> be helpful for me to try to draft an update to the spec without getting
>> the right implementers on board.
>
> If this is still something for which you have some time available, then
> the starting point for anything like this would be test cases, lots and
> lots of test cases. In this case, it would have to be something like a
> server that echoes the precise bytes sent by the client, for a huge
> variety of different setups:
>
>   - various submission encodings
>   - various form field names and types
>   - various file submission filenames
>
> ...etc.
>
> I'd be happy to advise if this is something that still interests you.

I agree with the methodology. However I would suggest to simply revise
RFC 2388.

Best regards, Julian



Re: multipart/form-data filename encoding: unicode and special characters

by Ian Hickson :: Rate this Message:

| View Threaded | Show Only this Message

On Mon, 9 Jul 2012, Julian Reschke wrote:
>
> I agree with the methodology. However I would suggest to simply revise
> RFC 2388.

The precise details of the process of how it's done are up to whoever
writes the spec text, they're not really relevant to this list.

--
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'