Re: Notes on validome test suite / validators comparison

View: New views
19 Messages — Rating Filter:   Alert me  

Parent Message unknown Re: Notes on validome test suite / validators comparison

by Validome-Staff :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.
Hi Olivier,
 
Here is our statement:
 
 
Here Validome advices the user to use our XML-Validator, as a HTML-Validator is not the appropriated tool to check XML...;-)
 
> * http://www.validome.org/out/ena1005
 
Here we corrected our claims, sorry for not keeping the comparison up to date. BUT: Until you renewed the W3C-Validator by implementing LibXML, the announcements were right and you knew this about an year... (http://lists.w3.org/Archives/Public/www-validator/2006Apr/0072.html)
 
> * http://www.validome.org/out/ena4011
> HTML 4.01 document with no system Id.
> Validome sends a warning... Not necessary per the spec.
> W3C Markup validator passes validation.
> Why is W3C validator marked as faulty here? References please?
 
Other way: Where is specified, that System-Id can be missed?
 
> * http://www.validome.org/out/ena4012
> XHTML doctype without system Id, but valid public id.
> Validation should report an error (both validators do), but why does 
> validome count this as a fatal error?
 
That one has coding reasons.
 
> * http://www.validome.org/out/ena4023
> Validome says valid. OpenSP and W3C Markup validator says not valid.
> I'd tend to trust opensp here. The comparison page's claim that 
 validome is the only validator doing the right thing is very dubious.
 
> * http://www.validome.org/out/ena4024
> Ditto above. The comparison page's claim that validome is the only 
> validator doing the right thing is very dubious.
 
What is here dubious? It's about SGML (not HTML) documents.
 
 
> * http://www.validome.org/out/ena8
> W3C markup validator uses algorithm for charset detection, finds 
> none, uses fallback
> Validome uses... exactly the same algorith (to the point of having 
> almost the same error message...), finds no charset, yields a fatal 
> error.
> I'm very curious to know why validome passes and w3c markup validator 
> fails here. I think the opposite: validome's taste for fatal error is 
> a grave failure in usability.
 
The "old" W3C-Validator made a fallback o US-ASCII, the "new" to UTF-8. Can you explain this, please?
We asked many times W3C-Germany and Bjoern Hoehrmann in regard to the *correct* behaviour of an validator in the case of a fallback, but we didn't get any *exact* answer. In this case, the specs are very unexact and ambiguous. Please give us a *mandatory" answer - with a link reference to appropriate specifications - upon this case. The only clear case till now is XHTML, there validators should make a fallback to UTF-8 (depending on MIME-Type), HTML is still ambiguous...
 
 
> * http://www.validome.org/out/ena2008
> * http://www.validome.org/out/ena2009
> * http://www.validome.org/out/ena2010
Old, deprecated examples because of unclear and ambiguous specifications.
 
> * http://www.validome.org/out/ena2041
> The comparison page is incorrect. The W3C Markup validator has the 
> proper behavior here.
 
The W3C-Validator doesn't detect the conflict.
 
> * http://www.validome.org/out/ena5020
> I strongly disagree that the W3C Markup's validator behavior is 
> incorrect, here.
> text/html is allowed for XHTML 1.0
 
We don't claim here, the behaviour of the W3C-Validator is wrong, we say that we miss the appropriate note.
In accordance to http://www.w3.org/TR/2002/REC-xhtml1-20020801/#media a XHTML document should be delivered with MIME-Type text/html when it meets the guidelines of HTML compatibility. That is what a validator shoul claim and Validome does it.
 
> * http://www.validome.org/out/ena7003
> I'd like to see a reference for this.
 
http://www.w3.org/TR/html401/struct/links.html#h-12.2.3
"...The id attribute, on the other hand, may not contain character references."
 
 
> * http://www.validome.org/out/ena7005 (and 7006)
> This has nothing to do with validation. If validome emulates some of 
> the features of a link checker, compare it to link checkers, not 
> validator. This test is moot.
 
http://www.w3.org/TR/html401/struct/links.html#h-12.2.4
"A reference to an unavailable or unidentifiable resource is an error"
...
"If a user agent cannot locate a linked resource, it should alert the user"
 
Where is here the "moot"? The W3C-Specification is very clear in this case...
 
 
> * http://www.validome.org/out/ena3002
> This test is bogus. Sorry. An XML declaration also happens to be a 
> proper SGML PI. Giving a warning asking the HTML4 author "are you 
> sure you want this here" may be a good idea. Making this a fatal 
> error is wrong, wrong, wrong.
 
If a XML-declaration is allowed in SGML, I'd like to see a reference for this.
*If* this should be right, ehat about the priority of the encoding attribute within declararion vs. META-element??????
 
> * http://www.validome.org/out/ena3006
> The comparison page is incorrect. Output of a warning for a shorttag 
> construct is a good thing (dev version of w3c validator actually does 
> it) but not required. The current W3C Validator's behavior is not wrong.
>
> * http://www.validome.org/out/ena3007
> ditto. Learn about shorttags. Validome is actually wrong here, this 
> should not be reported as an error, at most a warning.
 
Oh, her we have hundred opinions of the case. Could you please show us a *exact* reference?
 
BTW:
 
At the moment, we are implementing the W3C Validator in a free out of the box software solution for Windows users, together with validome.
When trying to implement, there are some inconsitencies/bugs we found:
 
1. The W3C-SGML-Parser uses two catalog files: xml.soc and sgml.soc. Within xml.soc there are 21 points missing, all regarding SVG 1.1 Tiny and "SVG 1.1 Basic.
2. We missed 6 DTDs, necessary to get the download package running.
3. Your LibXML-Implementation was not correct - you just use the catalog files of your SGML-Parser instead of taking care of the the "official" catalog specification (http://www.xmlsoft.org/catalog.html#Simple).
Because of this, LibXML tries to get the external DTDs instead of the local ones.
 
You write within your CVS (http://dev.w3.org/cvsweb/validator/httpd/cgi-bin/check?rev=1.574&content-type=text/x-cvsweb-markup):
 
# [NOT] loading the XML catalog for entities resolution as it seems to cause a lot of unnecessary DTD/entities fetching (requires >= 1.53 if enabled)
#$xmlparser->load_catalog( File::Spec->catfile($CFG->{Paths}->{SGML}->{Library}, 'xml.soc') );
 
That is not right, as your implementation is not correct. BUT: Youcan download the fixes on http://www.validome.org/W3C_fix.rar, with the fixes it works (Problem 1+2+3 solved).
 
Best regards,
 
Alex

Re: Notes on validome test suite / validators comparison

by Frank Ellermann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Validome-Staff wrote:

> Validome advices the user to use our XML-Validator, as a HTML-Validator
> is not the appropriated tool to check XML...;-)

When I try to validate <http://idn.icann.org/IDNwiki> at your site
I get the same incorrect "valid" result as with the W3C validator.

For the W3C validator I know that it can't (yet) check URI syntax,
but it's disappointing that your validator also fails.  Is than an
issue in the "XHTML 1.0 transitional" schema or in your code ?

 Frank




Parent Message unknown Re: Notes on validome test suite / validators comparison

by Validome-Staff :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Hi Frank,

> Is than an
> issue in the "XHTML 1.0 transitional" schema or in your code ?

Neither an issue in our schema, nor an issue in our code. Our schema
validator in the current version simply verificates URIs in accordance to
appropriate demands:
http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#anyURI
 "such rules and restrictions are not part of type validity and are not
checked by ·minimally conforming· processors. Thus in practice the above
definition imposes only very modest obligations on ·minimally conforming·
processors. "

As you know, there is no so simple as you claim to provide a reliable URI
check. At the moment Validome processes URI handling with help of a (simple)
schema validation ...It's not exactly brilliant, but - as we know - there is
no validaor at the moment, which handles it much better. It is necessary to
develop another *concept* for handling URI check. There is someone in our
team, who currently works on URI handling concepts in our validator. As
"URI" is a nontrivial issue, it will take some more time for modelling and
coding an acceptable solution for a sustainable URI check. This will be
probably in V3.0 (current version: 2.6.1).

BTW:
Validome supports validation of IDN domains since May 2007:
http://www.validome.org/validate/?uri=http://www.h%c3%a4ndewaschen.de
This would also make sense for the W3C-Validator, as it can not handle IDN
domain validation till now:
http://validator.w3.org/check?uri=http%3A%2F%2Fwww.h%C3%A4ndewaschen.de&charset=%28detect+automatically%29&doctype=Inline&group=0



Re: Notes on validome test suite / validators comparison

by Frank Ellermann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Validome-Staff wrote:
 
> http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#anyURI
> "such rules and restrictions are not part of type validity
> and are not checked by ·minimally conforming· processors.
> Thus in practice the above definition imposes only very
> modest obligations on ·minimally conforming· processors. "

The 2nd edition 2004 still has the same text talking about
RFC 2396 as amended by 2732 instead of RFC 3986 (STD 66) -
okay, just checked it, STD 66 was published in January 2005.

> As you know, there is no so simple as you claim to provide
> a reliable URI check.

The regexp in STD 66 is a one-liner, and determining the set
of visible ASCII characters allowed in an URI is "possible".
(Actually it's trivial, but it took me almost year to figure
 it out with the help Roy and others on the W3C URI list. ;-)

> as we know - there is no validaor at the moment, which
> handles it much better.

Indeed, I just asked WDG and schneegans.de what they think,
they also said "valid".

 Frank



Parent Message unknown Re: Notes on validome test suite / validators comparison

by Validome-Staff :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Hi Frank,

> The regexp in STD 66 is a one-liner, and determining the set
> of visible ASCII characters allowed in an URI is "possible".

My post was not about the ASCII character issue only...There are some "URI"
problems more, a schema validator doesn't catch at the moment. Even MSXML
and Altova DON'T detect them. A "RFC Conformity Checker" for URIs is much
more than this single ASCII issue. We have already about 100 test cases on
issues, schema validators can not check. So, NONtrivial...;-)

Best regards,

Alex
---------------------------
http://www.validome.org/



Critical bug 4916 (was: Notes on validome test suite / validators comparison)

by Frank Ellermann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Alex wrote:
 
>> The regexp in STD 66 is a one-liner, and determining the set
>> of visible ASCII characters allowed in an URI is "possible".
 
> My post was not about the ASCII character issue only...There
> are some "URI" problems more, a schema validator doesn't catch
> at the moment.

Well, it's about time to fix this.  After the installation of a
"popular browser" on a "popular OS" virtually all applications
allowing to click on URIs could indirectly start malware.  It's
hard to decide whose fault that is, but saying that it's only
the fault of the user is no option.

All, please "vote" for bug 4916 and support its reclassification
as "critical" with "priority 1" for an immediate fix.  We all had
almost three years to think about RFC 3986 and 3987.  It's a good
thing that the IDN test finally forces some action.
 
> A "RFC Conformity Checker" for URIs is much more than this single
> ASCII issue.

The generic RFC 3986 syntax is no rocket science, just ignore all
idiosyncrasies of legacy definitions as in RFC 2368, admittedly
mailto: is a hard case.  The syntax in the expired mailto-bis draft
is better.

For a validator you're not forced to guess what invalid syntax is
supposed to mean, simply flag it as invalid and be done with it.

> NONtrivial...;-)

Maybe we can agree on an "interesting clerical task".  The xmpp
folks (i.e. Peter) had to fix their syntax for 3986-compatibility,
they (i.e. he) managed.

 Frank



Parent Message unknown Re: Critical bug 4916 (was: Notes on validome test suite / validators comparison)

by Validome-Staff :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.
Hi Frank,
 
>> The regexp in STD 66 is a one-liner, and determining the set
>> of visible ASCII characters allowed in an URI is "possible".
 
and
 
> all applications
> allowing to click on URIs could indirectly start malware.
 
Fixing the schema is one thing...What about the same issue within HTML documents?
No XHTML --> no schema validation...As you see, what we need is much more than a schema fix. It is a new concept and code for a reliable URI check in XHTML *and* HTML. A URI checker that covers ALL aspects of failures...
Frank, beeing honest, you could create hundreds of test cases concerning URI, even schema validators don't detect. Just try it and you'll see, it is about new, reliable solutions and not about "patchy fixes"...;-)
 
Perhaps it's time to antiquate the term "validator" as it is and seriously discuss about "conformity checkers", as a validator - as defined at the moment - can not keep pace with new requirements and fast application development of these days.
 
Best regards,
 
Alex

Re: Notes on validome test suite / validators comparison

by olivier Thereaux :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Hi Alex,

Thanks a lot for going through the list, and giving more references.  
This is very useful.

On Oct 20, 2007, at 00:05 , Validome-Staff wrote:
> Here Validome advices the user to use our XML-Validator, as a HTML-
> Validator is not the appropriated tool to check XML...;-)

Understood, but as I wrote, I think it's not very good usability to  
call this a fatal error, when you could transparently redirect to  
your XML checker.


> Here we corrected our claims, sorry for not keeping the comparison  
> up to date.

Appreciated.


>  > * http://www.validome.org/out/ena4011
> > HTML 4.01 document with no system Id.
> > Validome sends a warning... Not necessary per the spec.
> > W3C Markup validator passes validation.
> > Why is W3C validator marked as faulty here? References please?
>
> http://www.w3.org/TR/1999/REC-html401-19991224/struct/ 
> global.html#h-7.2
> Other way: Where is specified, that System-Id can be missed?

SGML, which HTML 4.01 is an application of.
Only in XML is the system identifier required, per:
http://www.w3.org/TR/xml/#NT-ExternalID


> > * http://www.validome.org/out/ena4023
> > Validome says valid. OpenSP and W3C Markup validator says not valid.
> > I'd tend to trust opensp here. The comparison page's claim that
>  validome is the only validator doing the right thing is very dubious.
>
> > * http://www.validome.org/out/ena4024
> > Ditto above. The comparison page's claim that validome is the only
> > validator doing the right thing is very dubious.
>
> What is here dubious? It's about SGML (not HTML) documents.

And?


> The "old" W3C-Validator made a fallback o US-ASCII, the "new" to  
> UTF-8. Can you explain this, please?
> We asked many times W3C-Germany and Bjoern Hoehrmann in regard to  
> the *correct* behaviour of an validator in the case of a fallback,  
> but we didn't get any *exact* answer. In this case, the specs are  
> very unexact and ambiguous. Please give us a *mandatory" answer -  
> with a link reference to appropriate specifications - upon this  
> case. The only clear case till now is XHTML, there validators  
> should make a fallback to UTF-8 (depending on MIME-Type), HTML is  
> still ambiguous...

There is no authoritative answer as far as I can tell, which supports  
my question: why do you consider your sending a fatal error the right  
thing to do, and other validators trying a fallback wrong? If there  
is no rule, you are not supposed to make arbitrary ones and claim you  
are the only ones to respect them.


> > * http://www.validome.org/out/ena7003
> > I'd like to see a reference for this.
>
> http://www.w3.org/TR/html401/struct/links.html#h-12.2.3
> "...The id attribute, on the other hand, may not contain character  
> references."

Interesting discrepancy between prose and DTD here, thanks for the  
pointer.


> > * http://www.validome.org/out/ena7005 (and 7006)
> > This has nothing to do with validation. If validome emulates some of
> > the features of a link checker, compare it to link checkers, not
> > validator. This test is moot.
>
> http://www.w3.org/TR/html401/struct/links.html#h-12.2.4
> "A reference to an unavailable or unidentifiable resource is an error"
> ...
> "If a user agent cannot locate a linked resource, it should alert  
> the user"
>
> Where is here the "moot"? The W3C-Specification is very clear in  
> this case...

This is the usual confusion between user agent conformance (which the  
sections you quote are about) and document conformance (which  
validome and the markup validator are checking).


>   * http://www.validome.org/out/ena3002
> > This test is bogus. Sorry. An XML declaration also happens to be a
> > proper SGML PI. Giving a warning asking the HTML4 author "are you
> > sure you want this here" may be a good idea. Making this a fatal
> > error is wrong, wrong, wrong.
>
> If a XML-declaration is allowed in SGML, I'd like to see a  
> reference for this.

What I have is the SGML spec, chapter 8. Processing instructions.


[on shorttags]
> Oh, her we have hundred opinions of the case. Could you please show  
> us a *exact* reference?

The best I have is the informative:
http://www.w3.org/TR/html401/appendix/notes.html#h-B.3.7
and the normative DTD, which allows the shorttags. As such, the spec  
clearly allows the construct, while informatively warning against it.


I'll reply to your notes on distributing the markup validator in a  
separate mail.

Thank you,
--
olivier




Re: validator catalogs

by olivier Thereaux :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Hi Alex,

On Oct 20, 2007, at 00:05 , Validome-Staff wrote:
> 1. The W3C-SGML-Parser uses two catalog files: xml.soc and  
> sgml.soc. Within xml.soc there are 21 points missing, all regarding  
> SVG 1.1 Tiny and "SVG 1.1 Basic.

The issues with SVG 1.1 Tiny and Basic are actualy a bit more  
complicated.
See this mail:
http://lists.w3.org/Archives/Public/www-svg/2007Oct/0005.html
I think the workaround we found last month is better, see:
http://lists.w3.org/Archives/Public/www-validator-cvs/2007Oct/0018.html

I note that you also added a number of modules and files for XHTML  
print and basic, good idea.


> 2. We missed 6 DTDs, necessary to get the download package running.

Added, thanks.

> 3. Your LibXML-Implementation was not correct - you just use the  
> catalog files of your SGML-Parser instead of taking care of the the  
> "official" catalog specification (http://www.xmlsoft.org/ 
> catalog.html#Simple).
> Because of this, LibXML tries to get the external DTDs instead of  
> the local ones.

Indeed, it was incorrect, but in the end we decided to not fix it,  
because loading of the catalogue is only supported after a certain  
version of XML::LibXML - hence we just didn't load anything and muted  
the entities errors. It may still be a good idea to fix it, although  
I'm not sure what version of XML::LibXML is supported by most systems.

Thank you,
--
olivier Thereaux - W3C - http://www.w3.org/People/olivier/
W3C Open Source Software: http://www.w3.org/Status





Re: Notes on validome test suite / validators comparison

by olivier Thereaux :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Frank,

On Oct 20, 2007, at 23:22 , Frank Ellermann wrote:
> When I try to validate <http://idn.icann.org/IDNwiki> at your site
> I get the same incorrect "valid" result as with the W3C validator.
>
> For the W3C validator I know that it can't (yet) check URI syntax,
> but it's disappointing that your validator also fails.  Is than an
> issue in the "XHTML 1.0 transitional" schema or in your code ?

I'm curious as to why you so adamantly want to ban non-ascii IRIs  
from HTML?

More on this later, but from what I am gathering from the experts,  
given the spirit of the specs (written before IDNs and IRIs) and the  
level of support for IDNs, barking at IRIs in href and src would be  
counterproductive for the internationalization of the web.

--
olivier


Re: validator catalogs

by olivier Thereaux :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Alex, all

On Oct 24, 2007, at 15:04 , olivier Thereaux wrote:
> I note that you also added a number of modules and files for XHTML  
> print and basic, good idea.

FWIW, the file included in the RAR had a number of typos, that would  
break any validator using them.

I committed to CVS a proper version, I suggest you use this if you  
are to package the w3c markup validator.

Thanks again.
--
olivier




Re: Notes on validome test suite / validators comparison

by Frank Ellermann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


olivier Thereaux wrote:
 
> I'm curious as to why you so adamantly want to ban non-ascii IRIs  
> from HTML?

Please tell me that you're joking.  Native IRIs are nice where they
are permitted.  But on ICANN's Wiki using XHTML 1.0 they will cause
havoc:

Sooner or later mediawiki will be fixed to generate valid XHTML 1.0,
translating native IRIs to equivalent URIs on the fly.  After all
that's REQUIRED for backwards compatibility in the numerous Wikis
based on mediawiki.  Users want that something happens when they
click on a link, without upgrading their browser.  And native IRIs
are designed to have an equivalent URI-form.

Sooner or later validators will be fixed to validate URIs, what with
all those "URI exploits" we've seen in the last weeks for XP after
the installation of IE7. And when validators do their job all users
who naively followed ICANN and W3C into the realms of "who cares
about validity if it works" will be seriously annoyed.

I can still tell you the day when the W3C validator started to flag
€ as invalid on a windows-1252 page. I was working on this
page, it was stunning.

> from what I am gathering from the experts, given the spirit of the
> specs (written before IDNs and IRIs) and the level of support for
> IDNs, barking at IRIs in href and src would be counterproductive
> for the internationalization of the web.

I'm curious which expert propagates to violate specifications.  Want
to know how long it took me to create an XHTML ersatz-DTD permitting
IRIs everywhere ?

30 minutes.  Check out
http://hmdmhdfmhdjmzdtjmzdtzktdkztdjz.googlepages.com/IDN-IRI-test.html

 Frank



Parent Message unknown Re: Notes on validome test suite / validators comparison

by Validome-Staff :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Hi Frank,

> Sooner or later validators will be fixed to validate URIs

I asked our guy working on it, he told me it will be beta at the beginning
of December 2007. Could you take a look on it before release? Severe
criticism is welcome.

Best regards,

Alex
------------------------
http://www.validome.org/



Re: IRIs in href (Was: Notes on validome test suite / validators comparison)

by olivier Thereaux :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Oct 25, 2007, at 03:39 , Frank Ellermann wrote:
> Users want that something happens when they
> click on a link, without upgrading their browser.  And native IRIs
> are designed to have an equivalent URI-form.

Lack of support for IRIs in legacy user agents is an issue, understood.
Now, if today the HTML 4.01 and XHTML 1.0 specs and above were  
updated to say "IRIs" instead of "URIs", what would you do?

As I wrote before, these specs were written before IRIs were a reality.
The HTML4 spec contains advice on how to treat "URIs containing non-
ASCII characters".
See   http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1
Although it clearly calls these illegal, it prepares the ground for  
IRIs (for which we didn't yet have that name at that time).

Saying that IRIs should not be used because they break in legacy  
software, is an argument I have sympathy for, but have trouble  
accepting. This reminds me of the situation whereby, in Japan, one  
still can't safely use unicode in mails, because so many MUAs or  
webmails just don't support it.

> Sooner or later validators will be fixed to validate URIs, what with
> all those "URI exploits" we've seen in the last weeks for XP after
> the installation of IE7.

This is irrelevant to the discussion about IRIs. Please don't use  
internationalization as a scapegoat for bad coding.

> I can still tell you the day when the W3C validator started to flag
> € as invalid on a windows-1252 page. I was working on this
> page, it was stunning.

There once was a bug, and IIRC it was fixed in a few hours. Now, how  
is that relevant to the discussion at hand?

> I'm curious which expert propagates to violate specifications.  Want
> to know how long it took me to create an XHTML ersatz-DTD permitting
> IRIs everywhere ? 30 minutes.

Here you must be joking, bluffing, or mistaken, Frank. The current  
XHTML DTD says that DTDs are CDATA, and thus any SGML or XML  
validator has to accept all the characters allowed in the document,  
which includes all those usable in IRIs.

--
olivier



Re: IRIs in href (Was: Notes on validome test suite / validators comparison)

by rubys :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


olivier Thereaux wrote:

>
>
> On Oct 25, 2007, at 03:39 , Frank Ellermann wrote:
>> Users want that something happens when they
>> click on a link, without upgrading their browser.  And native IRIs
>> are designed to have an equivalent URI-form.
>
> Lack of support for IRIs in legacy user agents is an issue, understood.
> Now, if today the HTML 4.01 and XHTML 1.0 specs and above were updated
> to say "IRIs" instead of "URIs", what would you do?
>
> As I wrote before, these specs were written before IRIs were a reality.
> The HTML4 spec contains advice on how to treat "URIs containing
> non-ASCII characters".
> See   http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1
> Although it clearly calls these illegal, it prepares the ground for IRIs
> (for which we didn't yet have that name at that time).
>
> Saying that IRIs should not be used because they break in legacy
> software, is an argument I have sympathy for, but have trouble
> accepting. This reminds me of the situation whereby, in Japan, one still
> can't safely use unicode in mails, because so many MUAs or webmails just
> don't support it.

What about saying that IRIs should not be used because RFC 3987 section
1.2 item (a) says that this standard is not intended to apply to any
protocol or format element unless those formats or protocols explicitly
say that IRIs are supported?

- Sam Ruby


Re: IRIs in href

by Frank Ellermann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


olivier Thereaux wrote:

> Now, if today the HTML 4.01 and XHTML 1.0 specs and above were  
> updated to say "IRIs" instead of "URIs", what would you do?

Maybe ditch the W3C and post the reasons in an Internet Draft.
I'd certainly consider it as unethical.

RFC 3987 does not "update" 3986.  The spec.s should be updated
with s/2396/3986/g, s/3066/4646/g, and similar clerical tasks,
e.g. explaining why xml:lang is forced to be still an NMTOKEN
wrt these document types.

But for incompatible modifications we need new document types.

Not worldwide "upgrade your browser" campaigns, some users
can't, and besides it's completely unnecessary, all IRIs by
definition have an equivalent URI working with "any browser".

> Saying that IRIs should not be used because they break in
> legacy software, is an argument I have sympathy for, but
> have trouble accepting.

I'm not suprised if folks active in the W3C don't care much
about "backwards compatibility".  But admittedly I was very
suprised when you introduced "let's not care about formally
valid" as new concept.  A user armed with an old text mode
browser could take out the ICANN IDN test, AFAIK "formally
invalid" is a FAIL in any accesibility test, isn't it ?

> This reminds me of the situation whereby, in Japan, one  
> still can't safely use unicode in mails, because so many
> MUAs or webmails just don't support it.

Maybe they have plausible reasons why they don't need or
don't like it.  BTW, the (formally valid) IDN test page
I've created last week was the first XHTML page where I
actually needed UTF-8.  Now I'm curious what browsers do
with an IRI in a legacy charset.  RFC 3987 allows this.

>> Sooner or later validators will be fixed to validate
>> URIs, what with all those "URI exploits" we've seen in
>> the last weeks for XP after the installation of IE7.
 
> This is irrelevant to the discussion about IRIs. Please
> don't use internationalization as a scapegoat for bad
> coding.

It's relevant for the discussion of bug 4916 submitted
by you 2007-08-07.  If that bug is fixed it might also
detect IRIs where only URIs are allowed.

Admittedly almost impossible for a validator based on
DTDs, maybe you end up with a clumsy hack working only
for a few very important document types.  

>> I can still tell you the day when the W3C validator
>> started to flag € as invalid on a windows-1252
>> page. I was working on this page, it was stunning.
 
> There once was a bug, and IIRC it was fixed in a few
> hours.

Two days after 911, it's good if it only took you a few
hours.  But it took me several months to figure out why
I need octet 128 instead of NCR €.

> Now, how is that relevant to the discussion at hand?

If everybody and his dog start to use IRIs in document
types where it's not permitted, and some time later an
improved validator informs them that this was invalid,
the disturbed users will be annoyed.

> The current XHTML DTD says that DTDs are CDATA

Sure, the details are specified in the prose, the DTD
only uses an entity name %URI;  It could also use %FOO;
or %IRI; as name.  Likewise the RFC 2396 in the DTD is
only a comment.  

 Frank



Re: IRIs in href

by "Martin J. Dürst" :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Frank Ellermann wrote:

>olivier Thereaux wrote:
>
>> Now, if today the HTML 4.01 and XHTML 1.0 specs and above were  
>> updated to say "IRIs" instead of "URIs", what would you do?
>
>Maybe ditch the W3C and post the reasons in an Internet Draft.
>I'd certainly consider it as unethical.

I can understand your feelings, but they'd only clarify what
they meant when they recommended, in
http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1,
to convert URIs with non-ASCII characters to UTF-8 and
then to use percent-encoding, or what they meant when
they used %URI; but declared it as being CDATA in the DTD.


>RFC 3987 does not "update" 3986.

Very correct. It was never intended as that, it was intended
to serve as a stable specification for all those specs
(including HTML4) that wanted to use the concept but had
to use circumscriptive language.


>The spec.s should be updated
>with s/2396/3986/g, s/3066/4646/g, and similar clerical tasks,
>e.g. explaining why xml:lang is forced to be still an NMTOKEN
>wrt these document types.
>
>But for incompatible modifications we need new document types.

The new document type would not at all differ in functionality
from the old one. The only changes might be comments and
the names of parameter entities, but as with programs, that
doesn't change the functionality at all.


>Not worldwide "upgrade your browser" campaigns, some users
>can't, and besides it's completely unnecessary, all IRIs by
>definition have an equivalent URI working with "any browser".

Yes, but people who actually can read and understand
"испытание" better than "testing", испытание
may be very helpful, whereas
%D0%B8%D1%81%D0%BF%D1%8B%D1%82%D0%B0%D0%BD%D0%B8%D0%B5
would be just garbage for them.


>> Saying that IRIs should not be used because they break in
>> legacy software, is an argument I have sympathy for, but
>> have trouble accepting.
>
>I'm not suprised if folks active in the W3C don't care much
>about "backwards compatibility".  But admittedly I was very
>suprised when you introduced "let's not care about formally
>valid" as new concept.

Formally valid means valid according to the DTD, I guess.
In this respect, IRIs have always been valid.

>A user armed with an old text mode
>browser could take out the ICANN IDN test, AFAIK "formally
>invalid" is a FAIL in any accesibility test, isn't it ?

I'm not sure what you are after here, but if you want to claim
that IRIs somehow are anti-accessibility, then I think you
should consider the following two points:
a) There are temporary accessibility issues and long-term
   accessibility issues. Temporary accessibility issues are
   issues of the kind "The current screen readers/audio
   browsers/... only support foo, so in order to be accessible,
   use foo, not bar". Once the technology has caught up
   (and accessibility technology improves in the same way
   other technology improves), such a requirement may no
   longer apply.
b) A Russian screen reader will definitely do a better job
   with испытание than with
   %D0%B8%D1%81%D0%BF%D1%8B%D1%82%D0%B0%D0%BD%D0%B8%D0%B5.
   Other screen readers may have problems, but then,
   испытание will mostly appear in Russian
   documents.


>> This reminds me of the situation whereby, in Japan, one  
>> still can't safely use unicode in mails, because so many
>> MUAs or webmails just don't support it.
>
>Maybe they have plausible reasons why they don't need or
>don't like it.  BTW, the (formally valid) IDN test page
>I've created last week was the first XHTML page where I
>actually needed UTF-8.  Now I'm curious what browsers do
>with an IRI in a legacy charset.  RFC 3987 allows this.

For some tests, please see
http://www.sw.it.aoyama.ac.jp/2005/iritest/,
in particular the "Legacy Human" section at
http://www.sw.it.aoyama.ac.jp/2005/iritest/HTML/index.html.


>>> Sooner or later validators will be fixed to validate
>>> URIs, what with all those "URI exploits" we've seen in
>>> the last weeks for XP after the installation of IE7.
>
>> This is irrelevant to the discussion about IRIs. Please
>> don't use internationalization as a scapegoat for bad
>> coding.
>
>It's relevant for the discussion of bug 4916 submitted
>by you 2007-08-07.  If that bug is fixed it might also
>detect IRIs where only URIs are allowed.

In the original mail, Olivier actually wrote:

>>>>
1) a parser to check that a given string is a proper URI/IRI
This surely already exists, hopefully as open source code, or even  
better, as a perl module. Does anyone want to investigate this?
>>>>

That then got shortened in the bug report.

But there is a rather fundamental reason why this actually
may be a bad idea: URIs/IRIs are supposed to be very
flexible. If somebody came along tomorrow with a very
great idea for an extension to the URI syntax, and the
community agreed with that extension, even if it wouldn't
fit the current syntax definition, then this would lead
to an update of the URI spec.

Let's show you the idea behind the above with a somewhat
more concrete example:
If you want to create some software that tries to spot
potential mistakes in an HTML document, I'd guess you'd
surely flag something like <a href='htpp://www.w3.org'...
But even actually reading the URI spec in detail, there's
nothing there that says it's illegal. Somebody could
register the "htpp:" scheme at any time. As another
example, consider the following:
   <img src='http://example.org/top.html'>
Again, this clearly looks like a mistake, one wouldn't
use a link to a Web page in an src attribute. But you
never know when some browsers might actually implement
something like a thumbnail view of a web page in such
a case (apart from the fact that you also don't know
that top.html is a Web page and not some image).
Again, <img src='mailto:abc@...'> looks like
nonsense, but again, it may make sense in the future.

The point is that URIs and IRIs are intended to be a very
general mechanism to connect resources on the Web, and
that any restriction has to be considered very carefully.


>Admittedly almost impossible for a validator based on
>DTDs, maybe you end up with a clumsy hack working only
>for a few very important document types.  

Yes, trying to restrict the syntax in a field declared
as CDATA (for attributes) or PCDATA (for elements) based
only on the name of the field, the name of a parameter
entity, or a comment found nearby would be difficult.


>>> I can still tell you the day when the W3C validator
>>> started to flag € as invalid on a windows-1252
>>> page. I was working on this page, it was stunning.
>
>> There once was a bug, and IIRC it was fixed in a few
>> hours.
>
>Two days after 911, it's good if it only took you a few
>hours.  But it took me several months to figure out why
>I need octet 128 instead of NCR €.

Sorry to be a bit direct here, but if it took you several
months to figure out why you need octet 128 rather than
NCR €, then at least at that point in time, you
didn't really know much about the fundamentals of Web
internationalization. If you look at the SGML declaration
and at the DTD, it's very clear that € is illegal
and non-valid.


>> Now, how is that relevant to the discussion at hand?
>
>If everybody and his dog start to use IRIs in document
>types where it's not permitted, and some time later an
>improved validator informs them that this was invalid,
>the disturbed users will be annoyed.

Well, this is a circular argument. "Let's annoy users
now so that we don't need to annoy them later." doesn't
make sense if "Let's not annoy them at all." is the
best option anyway.


>> The current XHTML DTD says that DTDs are CDATA
>
>Sure, the details are specified in the prose, the DTD
>only uses an entity name %URI;  It could also use %FOO;
>or %IRI; as name.  Likewise the RFC 2396 in the DTD is
>only a comment.  

Exactly. Validation means validation according to the DTD,
and parameter entity names and comments don't affect
this process.

Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@...    



Re: IRIs in href

by Frank Ellermann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Martin Duerst wrote:

> they'd only clarify whatthey meant when they recommended,
> in http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1,
> to convert URIs with non-ASCII characters to UTF-8 and
> then to use percent-encoding

The given example states that it's <strong>illegal</strong>,
after that it explains a best guess implementation clearly
written before you published RFC 3987.  It doesn't address
IDNs, IDNs didn't exist 1999.  When browsers try to guess
what broken URIs mean they could run into the recent flood
of "XP with IE7" security issues.

>> for incompatible modifications we need new document types.
 
> The new document type would not at all differ in functionality
> from the old one. The only changes might be comments and
> the names of parameter entities, but as with programs, that
> doesn't change the functionality at all.

It's a "formally valid" experiment with unencoded IRIs in links,
that can be (legally) relevant for accesibility.  It might also
help for an RFC 3987 implementation and interoperability report,
"it's illegal but it works" wouldn't be convincing (of course
there would be still atom and xmpp if all else fails).  Maybe
somebody creates a corresponding schema allowing to check IRI
syntax.
 
> %D0%B8%D1%81%D0%BF%D1%8B%D1%82%D0%B0%D0%BD%D0%B8%D0%B5
> would be just garbage for them.

The W3C validator apparently hates this in a system identifier:
http://hmdmhdfmhdjmzdtjmzdtzktdkztdjz.googlepages.com/IDN-XML-test.htm
(sorry, I can't read your unencoded ISO-2022-JP examples at the
 moment)

> Formally valid means valid according to the DTD, I guess.

No, I meant the prose 2396 specification of URI, not CDATA.

> Temporary accessibility issues are issues of the kind "The
> current screen readers/audio browsers/... only support foo,
> so in order to be accessible, use foo, not bar". Once the
> technology has caught up (and accessibility technology
> improves in the same way other technology improves), such
> a requirement may no longer apply.

"Temporary" can be a rather long time, RFC 2277 talks about
50 years wrt UTF-8.  Worldwide upgrades take some time.  The
"real" IDN TLD test started less than four weeks ago, and on
another list you argued that not much will happen before real
IDN TLDs are introduced.

> For some tests, please see
> http://www.sw.it.aoyama.ac.jp/2005/iritest/,

Thanks, Firefox 2.0.0.9 fails already in the Latin-1 "Bücher"
test, of course it works for UTF-8.  

What I had in mind would be minimally harder, using "Bücher"
in an unencoded IDN on a Web page using a legacy charset.
Obviously I can forget this for now, if it doesn't work in
an <ipath> it also won't work in an <ihost>.

> URIs/IRIs are supposed to be very flexible.

Actually I'm lost with LEIRIs, HRRIs, options allowing to
use unencoded ASCII characters in IRIs not permitted in
URIs, and the recent discussion about allowing unencoded
square brackets outside of <IP-literal>.  

With URIs it's clear, if they're valid they must match the
generic STD 66 syntax.  No unencoded spaces etc.

> If somebody came along tomorrow with a very great idea
> for an extension to the URI syntax, and the community
> agreed with that extension, even if it wouldn't fit the
> current syntax definition, then this would lead to an
> update of the URI spec.

There's no "updates RFC 3986" in the URI template draft.

> If you want to create some software that tries to spot
> potential mistakes in an HTML document, I'd guess you'd
> surely flag something like <a href='htpp://www.w3.org'...

Flag and warn yes, but it's no STD 66 syntax error.  The
tool could restrict schemes to registered schemes and allow
to configure additional unregistered schemes.

> example, consider the following:
>   <img src='http://example.org/top.html'>
> Again, this clearly looks like a mistake

The "top.html" is just a name, admittedly a bad name if
the resource is something that can be displayed as image.
A legal URI.  OTOH "bücher.html" isn't a legal URI.

> Again, <img src='mailto:abc@...'> looks like
> nonsense, but again, it may make sense in the future.

"Syntactically valid" isn't the same as "makes sense",
I think we don't disagree about this.  Where we might
disagree is about "syntactically invalid".  Browsers
are forced to make sense out of (some kinds of) garbage,
but a syntax check is supposed to report syntax errors.

> if it took you several months to figure out why you
> need octet 128 rather than NCR €, then at least
> at that point in time, you didn't really know much
> about the fundamentals of Web internationalization.

In 2001 I knew _nothing_ about it, I was armed with a
Netscape 2.02 not supporting UTF-8 and treating €
as Euro, an O'Reilly book with "XHTML" in its title
published 2000, the W3C validator for online syntax
checks, and a box with local codepages "850" + 437.

> Well, this is a circular argument. "Let's annoy users
> now so that we don't need to annoy them later." doesn't
> make sense if "Let's not annoy them at all." is the
> best option anyway.

"Let's not annoy them at all" won't fly if the STD 66
syntax is checked later.  It was good when the validator
finally (2001-09-13) informed me that € is crap, it
would have been better if it had done that a few months
earlier.  

 Frank



Re: IRIs in href

by Frank Ellermann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Martin Dürst wrote:
 
> For some tests, please see
> http://www.sw.it.aoyama.ac.jp/2005/iritest/,
> in particular the "Legacy Human" section at
> http://www.sw.it.aoyama.ac.jp/2005/iritest/HTML/index.html.

Update:  While Firefox 2 fails for the Latin-1 test case in this test suite at
http://www.sw.it.aoyama.ac.jp/2005/iritest/HTML/UTF-8check/iso-8859-1_deB/a_href/index.html
it works fine the two IRIs (Cyril and Greek) on my Koi8-R test page
http://hmdmhdfmhdjmzdtjmzdtzktdkztdjz.googlepages.com/IDN-IRI-koi8-r.html

 Frank