|
View:
New views
17 Messages
—
Rating Filter:
Alert me
|
|
|
Case insensitive mimetype matching edge caseI got a bug report[1] recently about a problem with case insensitive
globs. The spec has this to say: Applications MUST first try a case-sensitive match, then try again with the filename converted to lower-case if that fails. This is so that main.C will be seen as a C++ file, but IMAGE.GIF will still use the *.gif pattern. Now, take the case of foo.ps.gz, foo.PS.GZ, and foo.PS.gz. The first matches .ps.gz and .gz case sensitively, picking the longer match.The second one does the same, but case insensitively. The third one however, matches .gz case insensitively only, so picks that instead of trying an insensitive match... Not sure how to handle this. Maybe we should always do the insensitive match and handle conflicts like we do multiple globs with the same extension? [1] http://bugzilla.gnome.org/show_bug.cgi?id=575010 _______________________________________________ xdg mailing list xdg@... http://lists.freedesktop.org/mailman/listinfo/xdg |
|
|
Re: Case insensitive mimetype matching edge caseOn Wednesday 20 May 2009, Alexander Larsson wrote:
> I got a bug report[1] recently about a problem with case insensitive > globs. > > The spec has this to say: > Applications MUST first try a case-sensitive match, then try again with > the filename converted to lower-case if that fails. This is so that > main.C will be seen as a C++ file, but IMAGE.GIF will still use the > *.gif pattern. > > Now, take the case of foo.ps.gz, foo.PS.GZ, and foo.PS.gz. > The first matches .ps.gz and .gz case sensitively, picking the longer > match.The second one does the same, but case insensitively. > The third one however, matches .gz case insensitively only, so picks > that instead of trying an insensitive match... Ouch, indeed. $ kmimetypefinder foo.ps.gz application/x-gzpostscript $ kmimetypefinder foo.PS.gz application/x-gzip > Not sure how to handle this. Maybe we should always do the insensitive > match What was the use case for case-sensitive matches? Looking at the globs file... Ah, I see one use case for it: *.C (text/x-c++src) vs *.c (text/x-csrc). > and handle conflicts like we do multiple globs with the same > extension? I don't think this would work. What we do is: " If several globs matches, and sniffing gives a result we do: " if sniffed prio >= 80, use sniffed type " for glob_match in glob_matches: " if glob_match is subclass or equal to sniffed_type, use glob_match " " If several globs matches, and sniffing fails, or doesn't help: " fall back to the first glob match But we lose the information that *.C matched and not *.c, therefore we are looking at a random c-or-c++ file so adjusting glob priorities wouldn't help, and sniffing doesn't help with such files either. Seems to me that we should instead introduce an attribute for case-sensitivity: <glob pattern="*.C" case-sensitive="true"/> and do everything else case-insensitively. -- David Faure, faure@..., sponsored by Qt Software @ Nokia to work on KDE, Konqueror (http://www.konqueror.org), and KOffice (http://www.koffice.org). _______________________________________________ xdg mailing list xdg@... http://lists.freedesktop.org/mailman/listinfo/xdg |
|
|
Re: Case insensitive mimetype matching edge caseOn Wed, 2009-05-20 at 15:20 +0200, Alexander Larsson wrote:
> I got a bug report[1] recently about a problem with case insensitive > globs. > > The spec has this to say: > Applications MUST first try a case-sensitive match, then try again with > the filename converted to lower-case if that fails. This is so that > main.C will be seen as a C++ file, but IMAGE.GIF will still use the > *.gif pattern. > > Now, take the case of foo.ps.gz, foo.PS.GZ, and foo.PS.gz. > The first matches .ps.gz and .gz case sensitively, picking the longer > match.The second one does the same, but case insensitively. > The third one however, matches .gz case insensitively only, so picks > that instead of trying an insensitive match... > > Not sure how to handle this. Maybe we should always do the insensitive > match and handle conflicts like we do multiple globs with the same > extension? Instead of doing the case-sensitive math first, couldn't we do it last as a tie-breaker between matches of equal length? Round 1: lowercase(foo.C) matches lowercase(*.C) lowercase(foo.C) matches lowercase(*.c) We have two matches of equal length, so Round 2: foo.C matches *.C foo.C does not match *.c So *.C wins. This would require doing the case-insensitive match by lowercasing both the filename and the pattern. I don't know if that causes any problems. It would also mean that a match for *.ext.c would win over a match for *.C. But since *.C seems to be the only use case we have, I'm not sure that's an issue. -- Shaun _______________________________________________ xdg mailing list xdg@... http://lists.freedesktop.org/mailman/listinfo/xdg |
|
|
Re: Case insensitive mimetype matching edge caseOn Wednesday 05 August 2009, David Faure wrote:
> Seems to me that we should instead introduce an attribute for > case-sensitivity: <glob pattern="*.C" case-sensitive="true"/> > and do everything else case-insensitively. This would also fix bug https://bugs.freedesktop.org/show_bug.cgi?id=22634 because we could say that <glob pattern="core"/> should be case-sensitive="true" as well, so that it doesn't match files named "Core", which are definitely no core dumps. Any reason against case-sensitive="true"? -- David Faure, faure@..., sponsored by Qt Software @ Nokia to work on KDE, Konqueror (http://www.konqueror.org), and KOffice (http://www.koffice.org). _______________________________________________ xdg mailing list xdg@... http://lists.freedesktop.org/mailman/listinfo/xdg |
|
|
Re: Case insensitive mimetype matching edge caseOn Tue, 2009-08-18 at 15:25 +0200, David Faure wrote:
> On Wednesday 05 August 2009, David Faure wrote: > > Seems to me that we should instead introduce an attribute for > > case-sensitivity: <glob pattern="*.C" case-sensitive="true"/> > > and do everything else case-insensitively. > > This would also fix bug https://bugs.freedesktop.org/show_bug.cgi?id=22634 > because we could say that > <glob pattern="core"/> > should be case-sensitive="true" as well, so that it doesn't match files named > "Core", which are definitely no core dumps. > > Any reason against case-sensitive="true"? Probably not, except status quo. Feel free to fix. Cheers _______________________________________________ xdg mailing list xdg@... http://lists.freedesktop.org/mailman/listinfo/xdg |
|
|
Re: Case insensitive mimetype matching edge caseOn Tuesday 18 August 2009, Bastien Nocera wrote:
> On Tue, 2009-08-18 at 15:25 +0200, David Faure wrote: > > On Wednesday 05 August 2009, David Faure wrote: > > > Seems to me that we should instead introduce an attribute for > > > case-sensitivity: <glob pattern="*.C" case-sensitive="true"/> > > > and do everything else case-insensitively. > > > > This would also fix bug > > https://bugs.freedesktop.org/show_bug.cgi?id=22634 because we could say > > that > > <glob pattern="core"/> > > should be case-sensitive="true" as well, so that it doesn't match files > > named "Core", which are definitely no core dumps. > > > > Any reason against case-sensitive="true"? > > Probably not, except status quo. Feel free to fix. OK, I'm starting to work on this. One problem is again the generated globs2 file. It has to contain this attribute too, in some form, but any change in the format of this file breaks existing parsers. So either we need a globs3 file (*), or we need a hack like using comments: # case-sensitive=true 50:text/x-c++src:*.C (*) if we go for a globs3 file instead, adding another 26K of bloat even on small devices :-), I suggest we make this more extensible for future changes so that we don't need a globs4... We could specify that parsers should ignore everything after the last ":" they know about. That is, they should be ready for 50:text/x-c++src:*.C:[anything here] where they expect 50:text/x-c++src:*.C I guess for now the globs3 format would be 50:text/x-c++src:*.C:cs (cs == case sensitive) and 50:text/x-csrc:*.c but which could later be extended to 50:text/x-c++src:*.C:cs:[future extension] 50:text/x-csrc:*.c::[future extension] (note the empty "cs" field) What do you think? Extensible format in globs3, comment-based hack (probably a bit ugly to implement), or do you have another idea? -- David Faure, faure@..., sponsored by Qt Software @ Nokia to work on KDE, Konqueror (http://www.konqueror.org), and KOffice (http://www.koffice.org). _______________________________________________ xdg mailing list xdg@... http://lists.freedesktop.org/mailman/listinfo/xdg |
|
|
Re: Case insensitive mimetype matching edge caseOn Tue, 2009-08-18 at 19:42 +0200, David Faure wrote:
> On Tuesday 18 August 2009, Bastien Nocera wrote: > > On Tue, 2009-08-18 at 15:25 +0200, David Faure wrote: > > > On Wednesday 05 August 2009, David Faure wrote: > > > > Seems to me that we should instead introduce an attribute for > > > > case-sensitivity: <glob pattern="*.C" case-sensitive="true"/> > > > > and do everything else case-insensitively. > > > > > > This would also fix bug > > > https://bugs.freedesktop.org/show_bug.cgi?id=22634 because we could say > > > that > > > <glob pattern="core"/> > > > should be case-sensitive="true" as well, so that it doesn't match files > > > named "Core", which are definitely no core dumps. > > > > > > Any reason against case-sensitive="true"? > > > > Probably not, except status quo. Feel free to fix. > > OK, I'm starting to work on this. > > One problem is again the generated globs2 file. It has to contain this > attribute too, in some form, but any change in the format of this file breaks > existing parsers. > So either we need a globs3 file (*), or we need a hack like using comments: > # case-sensitive=true > 50:text/x-c++src:*.C > > (*) if we go for a globs3 file instead, adding another 26K of bloat even on > small devices :-), I suggest we make this more extensible for future changes > so that we don't need a globs4... We could specify that parsers should ignore > everything after the last ":" they know about. That is, they should be ready > for > 50:text/x-c++src:*.C:[anything here] > where they expect > 50:text/x-c++src:*.C > > I guess for now the globs3 format would be > 50:text/x-c++src:*.C:cs > (cs == case sensitive) and > 50:text/x-csrc:*.c > > but which could later be extended to > 50:text/x-c++src:*.C:cs:[future extension] > 50:text/x-csrc:*.c::[future extension] > (note the empty "cs" field) > > What do you think? Extensible format in globs3, comment-based hack > (probably a bit ugly to implement), or do you have another idea? Ugh. Additionally we have to extend the mime.cache format more. Maybe we can solve this with a hack. What about this: All case insensitive globs are converted to lower case in the globs file. Glob lookup is done by first matching the real filename against the globs, then (on failure) convert the name to lower case and try again. This will result in a case insensitive match except for things marked as case sensitive that has at least one uppercase character. We can't do case-sensitive matching of only-lowercase globs, but we don't currently have any example of this in the databases. _______________________________________________ xdg mailing list xdg@... http://lists.freedesktop.org/mailman/listinfo/xdg |
|
|
Re: Case insensitive mimetype matching edge caseOn Wednesday 19 August 2009, Alexander Larsson wrote:
> Ugh. Additionally we have to extend the mime.cache format more. Maybe we > can solve this with a hack. What about this: > > All case insensitive globs are converted to lower case in the globs > file. Glob lookup is done by first matching the real filename against > the globs, then (on failure) convert the name to lower case and try > again. This will result in a case insensitive match except for things > marked as case sensitive that has at least one uppercase character. > > We can't do case-sensitive matching of only-lowercase globs, but we > don't currently have any example of this in the databases. But I do want to do one of those, to solve bug 22634: I want <glob pattern="core"/> to be case-sensitive="true". How about a different hack: we generate in globs2 two lines, in case of case-sensitive: 50:text/x-c++src:*.C 50:text/x-c++src:*.C:cs Old parsers will create an entry for "*.C:cs", which will probably never match any real file, so no big deal, while new parsers will take the second line as an indication that the *.C glob (parsed one line above) should be understood to be case sensitive. -- David Faure, faure@..., sponsored by Qt Software @ Nokia to work on KDE, Konqueror (http://www.konqueror.org), and KOffice (http://www.koffice.org). _______________________________________________ xdg mailing list xdg@... http://lists.freedesktop.org/mailman/listinfo/xdg |
|
|
Re: Case insensitive mimetype matching edge caseOn Wed, 2009-08-19 at 10:02 +0200, David Faure wrote:
> On Wednesday 19 August 2009, Alexander Larsson wrote: > > Ugh. Additionally we have to extend the mime.cache format more. Maybe we > > can solve this with a hack. What about this: > > > > All case insensitive globs are converted to lower case in the globs > > file. Glob lookup is done by first matching the real filename against > > the globs, then (on failure) convert the name to lower case and try > > again. This will result in a case insensitive match except for things > > marked as case sensitive that has at least one uppercase character. > > > > We can't do case-sensitive matching of only-lowercase globs, but we > > don't currently have any example of this in the databases. > > But I do want to do one of those, to solve bug 22634: I want > <glob pattern="core"/> to be case-sensitive="true". > > How about a different hack: > we generate in globs2 two lines, in case of case-sensitive: > 50:text/x-c++src:*.C > 50:text/x-c++src:*.C:cs > Old parsers will create an entry for "*.C:cs", which will probably never match > any real file, so no big deal, while new parsers will take the second line as > an indication that the *.C glob (parsed one line above) should be understood > to be case sensitive. Hmmm. I like this one. Sounds good to me. But lets make it extensible when we're doing it, i.e. have a comma-separated list of flags with "cs" being one known one. Unknown flags are ignored, anything after another : is ignored. _______________________________________________ xdg mailing list xdg@... http://lists.freedesktop.org/mailman/listinfo/xdg |
|
|
Re: Case insensitive mimetype matching edge caseOn Wednesday 19 August 2009, Alexander Larsson wrote:
> On Wed, 2009-08-19 at 10:02 +0200, David Faure wrote: > > On Wednesday 19 August 2009, Alexander Larsson wrote: > > > Ugh. Additionally we have to extend the mime.cache format more. Maybe > > > we can solve this with a hack. What about this: > > > > > > All case insensitive globs are converted to lower case in the globs > > > file. Glob lookup is done by first matching the real filename against > > > the globs, then (on failure) convert the name to lower case and try > > > again. This will result in a case insensitive match except for things > > > marked as case sensitive that has at least one uppercase character. > > > > > > We can't do case-sensitive matching of only-lowercase globs, but we > > > don't currently have any example of this in the databases. > > > > But I do want to do one of those, to solve bug 22634: I want > > <glob pattern="core"/> to be case-sensitive="true". > > > > How about a different hack: > > we generate in globs2 two lines, in case of case-sensitive: > > 50:text/x-c++src:*.C > > 50:text/x-c++src:*.C:cs > > Old parsers will create an entry for "*.C:cs", which will probably never > > match any real file, so no big deal, while new parsers will take the > > second line as an indication that the *.C glob (parsed one line above) > > should be understood to be case sensitive. > > Hmmm. I like this one. Sounds good to me. But lets make it extensible > when we're doing it, i.e. have a comma-separated list of flags with > "cs" being one known one. Unknown flags are ignored, anything after > another : is ignored. I made the changes in the spec, in the definition of the two mimetypes, and in update-mime-database.c (for parsing, and globs2 generation). Please find patch attached (I can commit if you're ok with it). I included a suggested format change for the mimeinfo.cache file, but I'll have to let you implement that part, I don't know all the details about the suffix tree etc. Same for the xdgmime implementation. I like that this is going to improve performance, too: no need to do the two- step glob matching anymore (case insensitive + case sensitive), it will now be one -or- the other, for a given glob. -- David Faure, faure@..., sponsored by Qt Software @ Nokia to work on KDE, Konqueror (http://www.konqueror.org), and KOffice (http://www.koffice.org). [shared-mime-info-case-sensitive.diff] Index: ChangeLog =================================================================== RCS file: /cvs/mime/shared-mime-info/ChangeLog,v retrieving revision 1.551 diff -u -p -r1.551 ChangeLog --- ChangeLog 31 Jul 2009 15:15:36 -0000 1.551 +++ ChangeLog 19 Aug 2009 19:51:40 -0000 @@ -1,3 +1,16 @@ +2009-08-19 David Faure <faure@...> + + * shared-mime-info-spec.xml: Define new case-sensitive attribute, and + change the text from the two-step glob match to a single glob match, + that is either case sensitive or case insensitive. Add flags field to + globs2 file, define the "cs" flag. + + * freedesktop.org.xml.in: Add case-sensitive="true" to the globs *.C, + and core (Closes: #22634) + + * update-mime-database.c (process_freedesktop_node) + (write_out_glob2): Process case-sensitive flag, write it out in globs2. + 2009-07-31 Bastien Nocera <hadess@...> * tests/list: Fix test suite for those pesky PS files Index: freedesktop.org.xml.in =================================================================== RCS file: /cvs/mime/shared-mime-info/freedesktop.org.xml.in,v retrieving revision 1.416 diff -u -p -r1.416 freedesktop.org.xml.in --- freedesktop.org.xml.in 31 Jul 2009 15:13:45 -0000 1.416 +++ freedesktop.org.xml.in 19 Aug 2009 19:51:40 -0000 @@ -26,6 +26,7 @@ <!ELEMENT glob EMPTY> <!ATTLIST glob pattern CDATA #REQUIRED> <!ATTLIST glob weight CDATA #IMPLIED> + <!ATTLIST glob case-sensitive CDATA #IMPLIED> <!ELEMENT magic (match)+> <!ATTLIST magic priority CDATA #IMPLIED> @@ -1268,7 +1269,7 @@ command to generate the output files. <match type="string" value="Core\001" offset="0"/> <match type="string" value="Core\002" offset="0"/> </magic> - <glob pattern="core"/> + <glob pattern="core" case-sensitive="true"/> </mime-type> <mime-type type="application/x-cpio"> <_comment>CPIO archive</_comment> @@ -4292,7 +4293,7 @@ command to generate the output files. <glob pattern="*.cpp"/> <glob pattern="*.cxx"/> <glob pattern="*.cc"/> - <glob pattern="*.C"/> + <glob pattern="*.C" case-sensitive="true"/> <glob pattern="*.c++"/> </mime-type> <mime-type type="text/x-changelog"> Index: shared-mime-info-spec.xml =================================================================== RCS file: /cvs/mime/shared-mime-info/shared-mime-info-spec.xml,v retrieving revision 1.67 diff -u -p -r1.67 shared-mime-info-spec.xml --- shared-mime-info-spec.xml 5 Aug 2009 13:08:17 -0000 1.67 +++ shared-mime-info-spec.xml 19 Aug 2009 19:51:40 -0000 @@ -456,6 +456,7 @@ For example: ... 55:text/x-diff:*.patch 50:text/x-diff:*.diff +50:text/x-c++src:*.C:cs ... ]]></programlisting> </para> @@ -474,8 +475,8 @@ text/x-diff:*.diff ]]></programlisting> </para> <para> -Applications MUST first try a case-sensitive match, then try again with the -filename converted to lower-case if that fails. +Applications MUST match globs case-insensitively, except when the case-sensitive +attribute is set to true. This is so that <filename>main.C</filename> will be seen as a C++ file, but <filename>IMAGE.GIF</filename> will still use the *.gif pattern. </para> @@ -531,6 +532,16 @@ Group's package, which MUST be required specification. Since each application will then only be providing information about its own types, conflicts should be rare. </para> + <para> +The fourth field ("cs" in the first globs2 example) contains a list of comma-separated flags. +The flags currently defined are: cs (for case-sensitive). Implementations should ignore +unknown flags. + </para> + <para> +Implementations should also ignore further fields, so that the syntax of the globs2 file +can be extended in the future. Example: "50:text/x-c++src:*.C:cs,newflag:newfeature:somethingelse" +should currently be parsed as "50:text/x-c++src:*.C:cs". + </para> </sect2> <sect2> <title>The magic files</title> @@ -717,21 +728,23 @@ Parents: LiteralList: 4 CARD32 N_LITERALS -12*N_LITERALS LiteralEntry +16*N_LITERALS LiteralEntry LiteralEntry: 4 CARD32 LITERAL_OFFSET 4 CARD32 MIME_TYPE_OFFSET 4 CARD32 WEIGHT +4 CARD32 FLAGS (0x1: case-sensitive) GlobList: 4 CARD32 N_GLOBS -12*N_GLOBS GlobEntry +16*N_GLOBS GlobEntry GlobEntry: 4 CARD32 GLOB_OFFSET 4 CARD32 MIME_TYPE_OFFSET 4 CARD32 WEIGHT +4 CARD32 FLAGS (0x1: case-sensitive) ReverseSuffixTree: 4 CARD32 N_ROOTS Index: update-mime-database.c =================================================================== RCS file: /cvs/mime/shared-mime-info/update-mime-database.c,v retrieving revision 1.51 diff -u -p -r1.51 update-mime-database.c --- update-mime-database.c 20 Apr 2009 16:45:43 -0000 1.51 +++ update-mime-database.c 19 Aug 2009 19:51:40 -0000 @@ -94,6 +94,7 @@ struct _Glob { char *pattern; Type *type; gboolean noglob; + gboolean case_sensitive; }; struct _Magic { @@ -307,6 +308,24 @@ static int get_weight(xmlNode *node) return get_int_attribute (node, "weight"); } +/* Return the value of a false/true attribute, which defaults to false. + * Returns 0 or 1. + */ +static gboolean get_boolean_attribute(xmlNode *node, const char* name) +{ + char *attr; + attr = my_xmlGetNsProp(node, name, NULL); + if (attr) + { + if (strcmp (attr, "true") == 0) + { + return TRUE; + } + xmlFree(attr); + } + return FALSE; +} + /* Process a <root-XML> element by adding a rule to namespace_hash */ static void add_namespace(Type *type, const char *namespaceURI, const char *localName, GError **error) @@ -363,8 +382,10 @@ static gboolean process_freedesktop_node { gchar *pattern; gint weight; + gboolean case_sensitive; weight = get_weight(field); + case_sensitive = get_boolean_attribute(field, "case-sensitive"); if (weight == -1) { @@ -382,6 +403,7 @@ static gboolean process_freedesktop_node glob->pattern = g_strdup (pattern); glob->type = type; glob->weight = weight; + glob->case_sensitive = case_sensitive; list = g_list_append (list, glob); g_hash_table_insert(globs_hash, g_strdup (glob->pattern), list); xmlFree(pattern); @@ -406,6 +428,7 @@ static gboolean process_freedesktop_node glob->type = type; glob->weight = 0; glob->noglob = TRUE; + glob->case_sensitive = FALSE; list = g_list_append (list, glob); g_hash_table_insert(globs_hash, g_strdup (glob->pattern), list); copy_to_xml = TRUE; @@ -847,6 +870,7 @@ static void write_out_glob2(GList *globs GList *list; Glob *glob; Type *type; + gboolean need_flags; for (list = globs ; list; list = list->next) { glob = (Glob *)list->data; @@ -855,9 +879,22 @@ static void write_out_glob2(GList *globs g_warning("* Glob patterns can't contain literal newlines " "(%s in type %s/%s)\n", glob->pattern, type->media, type->subtype); - else + else + { + /* Always write the line without the flags, for older parsers */ g_fprintf(stream, "%d:%s/%s:%s\n", - glob->weight, type->media, type->subtype, glob->pattern); + glob->weight, type->media, type->subtype, glob->pattern); + + need_flags = FALSE; + if (glob->case_sensitive) + need_flags = TRUE; + + if (need_flags) { + g_fprintf(stream, "%d:%s/%s:%s%s\n", + glob->weight, type->media, type->subtype, glob->pattern, + glob->case_sensitive ? ":cs" : ""); + } + } } } _______________________________________________ xdg mailing list xdg@... http://lists.freedesktop.org/mailman/listinfo/xdg |
|
|
Re: Case insensitive mimetype matching edge caseOn Wed, 2009-08-19 at 21:53 +0200, David Faure wrote:
> On Wednesday 19 August 2009, Alexander Larsson wrote: > > On Wed, 2009-08-19 at 10:02 +0200, David Faure wrote: > > > On Wednesday 19 August 2009, Alexander Larsson wrote: > > > > Ugh. Additionally we have to extend the mime.cache format more. Maybe > > > > we can solve this with a hack. What about this: > > > > > > > > All case insensitive globs are converted to lower case in the globs > > > > file. Glob lookup is done by first matching the real filename against > > > > the globs, then (on failure) convert the name to lower case and try > > > > again. This will result in a case insensitive match except for things > > > > marked as case sensitive that has at least one uppercase character. > > > > > > > > We can't do case-sensitive matching of only-lowercase globs, but we > > > > don't currently have any example of this in the databases. > > > > > > But I do want to do one of those, to solve bug 22634: I want > > > <glob pattern="core"/> to be case-sensitive="true". > > > > > > How about a different hack: > > > we generate in globs2 two lines, in case of case-sensitive: > > > 50:text/x-c++src:*.C > > > 50:text/x-c++src:*.C:cs > > > Old parsers will create an entry for "*.C:cs", which will probably never > > > match any real file, so no big deal, while new parsers will take the > > > second line as an indication that the *.C glob (parsed one line above) > > > should be understood to be case sensitive. > > > > Hmmm. I like this one. Sounds good to me. But lets make it extensible > > when we're doing it, i.e. have a comma-separated list of flags with > > "cs" being one known one. Unknown flags are ignored, anything after > > another : is ignored. > > Good idea. > I made the changes in the spec, in the definition of the two mimetypes, > and in update-mime-database.c (for parsing, and globs2 generation). > Please find patch attached (I can commit if you're ok with it). > > I included a suggested format change for the mimeinfo.cache file, but I'll > have to let you implement that part, I don't know all the details about the > suffix tree etc. Same for the xdgmime implementation. > > I like that this is going to improve performance, too: no need to do the two- > step glob matching anymore (case insensitive + case sensitive), it will now be > one -or- the other, for a given glob. Do you have the equivalent xdgmime changes as well? Cheers _______________________________________________ xdg mailing list xdg@... http://lists.freedesktop.org/mailman/listinfo/xdg |
|
|
Re: Case insensitive mimetype matching edge caseOn Wednesday 19 August 2009, Bastien Nocera wrote:
> On Wed, 2009-08-19 at 21:53 +0200, David Faure wrote: > > On Wednesday 19 August 2009, Alexander Larsson wrote: > > > On Wed, 2009-08-19 at 10:02 +0200, David Faure wrote: > > > > On Wednesday 19 August 2009, Alexander Larsson wrote: > > > > > Ugh. Additionally we have to extend the mime.cache format more. > > > > > Maybe we can solve this with a hack. What about this: > > > > > > > > > > All case insensitive globs are converted to lower case in the globs > > > > > file. Glob lookup is done by first matching the real filename > > > > > against the globs, then (on failure) convert the name to lower case > > > > > and try again. This will result in a case insensitive match except > > > > > for things marked as case sensitive that has at least one uppercase > > > > > character. > > > > > > > > > > We can't do case-sensitive matching of only-lowercase globs, but we > > > > > don't currently have any example of this in the databases. > > > > > > > > But I do want to do one of those, to solve bug 22634: I want > > > > <glob pattern="core"/> to be case-sensitive="true". > > > > > > > > How about a different hack: > > > > we generate in globs2 two lines, in case of case-sensitive: > > > > 50:text/x-c++src:*.C > > > > 50:text/x-c++src:*.C:cs > > > > Old parsers will create an entry for "*.C:cs", which will probably > > > > never match any real file, so no big deal, while new parsers will > > > > take the second line as an indication that the *.C glob (parsed one > > > > line above) should be understood to be case sensitive. > > > > > > Hmmm. I like this one. Sounds good to me. But lets make it extensible > > > when we're doing it, i.e. have a comma-separated list of flags with > > > "cs" being one known one. Unknown flags are ignored, anything after > > > another : is ignored. > > > > Good idea. > > I made the changes in the spec, in the definition of the two mimetypes, > > and in update-mime-database.c (for parsing, and globs2 generation). > > Please find patch attached (I can commit if you're ok with it). > > > > I included a suggested format change for the mimeinfo.cache file, but > > I'll have to let you implement that part, I don't know all the details > > about the suffix tree etc. Same for the xdgmime implementation. > > > > I like that this is going to improve performance, too: no need to do the > > two- step glob matching anymore (case insensitive + case sensitive), it > > will now be one -or- the other, for a given glob. > > Do you have the equivalent xdgmime changes as well? No, my C hacking is not that good (and I thought it was dependent on the mimeinfo.cache change, but now I see it probably is not). Can I let you do it? It should be "as simple as" parsing the fields and splitting at commas, and and replacing the two-step glob matching with a single-step glob matching (either case sensitive or case insensitive). I'm doing exactly that in the KDE implementation as we speak. -- David Faure, faure@..., sponsored by Qt Software @ Nokia to work on KDE, Konqueror (http://www.konqueror.org), and KOffice (http://www.koffice.org). _______________________________________________ xdg mailing list xdg@... http://lists.freedesktop.org/mailman/listinfo/xdg |
|
|
Re: Case insensitive mimetype matching edge caseOn Wednesday 19 August 2009, David Faure wrote:
> I made the changes in the spec, in the definition of the two mimetypes, > and in update-mime-database.c (for parsing, and globs2 generation). > Please find patch attached (I can commit if you're ok with it). I forgot one thing. For "foo.C" not to match the glob "*.c", we should make that one case-sensitive as well. Updated patch for freedesktop.org.xml.in attached. -- David Faure, faure@..., sponsored by Qt Software @ Nokia to work on KDE, Konqueror (http://www.konqueror.org), and KOffice (http://www.koffice.org). [freedesktop.org.xml.in.diff] Index: freedesktop.org.xml.in =================================================================== RCS file: /cvs/mime/shared-mime-info/freedesktop.org.xml.in,v retrieving revision 1.416 diff -u -p -r1.416 freedesktop.org.xml.in --- freedesktop.org.xml.in 31 Jul 2009 15:13:45 -0000 1.416 +++ freedesktop.org.xml.in 19 Aug 2009 22:59:24 -0000 @@ -26,6 +26,7 @@ <!ELEMENT glob EMPTY> <!ATTLIST glob pattern CDATA #REQUIRED> <!ATTLIST glob weight CDATA #IMPLIED> + <!ATTLIST glob case-sensitive CDATA #IMPLIED> <!ELEMENT magic (match)+> <!ATTLIST magic priority CDATA #IMPLIED> @@ -1268,7 +1269,7 @@ command to generate the output files. <match type="string" value="Core\001" offset="0"/> <match type="string" value="Core\002" offset="0"/> </magic> - <glob pattern="core"/> + <glob pattern="core" case-sensitive="true"/> </mime-type> <mime-type type="application/x-cpio"> <_comment>CPIO archive</_comment> @@ -4292,7 +4293,7 @@ command to generate the output files. <glob pattern="*.cpp"/> <glob pattern="*.cxx"/> <glob pattern="*.cc"/> - <glob pattern="*.C"/> + <glob pattern="*.C" case-sensitive="true"/> <glob pattern="*.c++"/> </mime-type> <mime-type type="text/x-changelog"> @@ -4331,7 +4332,7 @@ command to generate the output files. <_comment>C source code</_comment> <sub-class-of type="text/plain"/> <alias type="text/x-c"/> - <glob pattern="*.c"/> + <glob pattern="*.c" case-sensitive="true"/> <magic priority="30"> <match type="string" value="/*" offset="0"/> <match type="string" value="//" offset="0"/> _______________________________________________ xdg mailing list xdg@... http://lists.freedesktop.org/mailman/listinfo/xdg |
|
|
Re: Case insensitive mimetype matching edge caseOn Wed, 2009-08-19 at 21:53 +0200, David Faure wrote:
> On Wednesday 19 August 2009, Alexander Larsson wrote: > > On Wed, 2009-08-19 at 10:02 +0200, David Faure wrote: > > > On Wednesday 19 August 2009, Alexander Larsson wrote: > > > > Ugh. Additionally we have to extend the mime.cache format more. Maybe > > > > we can solve this with a hack. What about this: > > > > > > > > All case insensitive globs are converted to lower case in the globs > > > > file. Glob lookup is done by first matching the real filename against > > > > the globs, then (on failure) convert the name to lower case and try > > > > again. This will result in a case insensitive match except for things > > > > marked as case sensitive that has at least one uppercase character. > > > > > > > > We can't do case-sensitive matching of only-lowercase globs, but we > > > > don't currently have any example of this in the databases. > > > > > > But I do want to do one of those, to solve bug 22634: I want > > > <glob pattern="core"/> to be case-sensitive="true". > > > > > > How about a different hack: > > > we generate in globs2 two lines, in case of case-sensitive: > > > 50:text/x-c++src:*.C > > > 50:text/x-c++src:*.C:cs > > > Old parsers will create an entry for "*.C:cs", which will probably never > > > match any real file, so no big deal, while new parsers will take the > > > second line as an indication that the *.C glob (parsed one line above) > > > should be understood to be case sensitive. > > > > Hmmm. I like this one. Sounds good to me. But lets make it extensible > > when we're doing it, i.e. have a comma-separated list of flags with > > "cs" being one known one. Unknown flags are ignored, anything after > > another : is ignored. > > Good idea. > I made the changes in the spec, in the definition of the two mimetypes, > and in update-mime-database.c (for parsing, and globs2 generation). > Please find patch attached (I can commit if you're ok with it). We must also mention that if a case sensitive match matches that has priority over the case insensitive match, otherwise the *.c vs *.C match will not work. > I included a suggested format change for the mimeinfo.cache file, but I'll > have to let you implement that part, I don't know all the details about the > suffix tree etc. Same for the xdgmime implementation. I don't think the mimeinfo.cache changes are quite right. The literals are stored sorted and looked up with a binary search. We can't apply the flag once we've found the match as we won't match without the flag. Rather we have to add an additional CaseInsensitiveLiteralList. Also, we'd have to specify that the elements in this list is stored in lower case, sorted, so that case insensitive bsearch works. For the glob list a simple flag works. The suffix tree is a search tree starting from the end of the filename. It works like this: Take a string "foobar.tar.gz", then start at the node of the tree root and look for the current char (z in this case) by doing a binary search on the children. If you find z, follow that and continue looking for a hit on the next char. If any search fails to match with the current char, look for a child with character 0, if found, these point to the matching mimetype. If no such hit, return back to backtrack looking for shorter matches. This has the same issue as the literals, so we have to make there be two trees, one case sensitive and one case insensitive. Furthermore, these changes are incompatible, so we need to bump the minor version in the header. _______________________________________________ xdg mailing list xdg@... http://lists.freedesktop.org/mailman/listinfo/xdg |
|
|
Re: Case insensitive mimetype matching edge caseOn Friday 21 August 2009, Alexander Larsson wrote:
> On Wed, 2009-08-19 at 21:53 +0200, David Faure wrote: > > On Wednesday 19 August 2009, Alexander Larsson wrote: > > > On Wed, 2009-08-19 at 10:02 +0200, David Faure wrote: > > > > On Wednesday 19 August 2009, Alexander Larsson wrote: > > > > > Ugh. Additionally we have to extend the mime.cache format more. > > > > > Maybe we can solve this with a hack. What about this: > > > > > > > > > > All case insensitive globs are converted to lower case in the globs > > > > > file. Glob lookup is done by first matching the real filename > > > > > against the globs, then (on failure) convert the name to lower case > > > > > and try again. This will result in a case insensitive match except > > > > > for things marked as case sensitive that has at least one uppercase > > > > > character. > > > > > > > > > > We can't do case-sensitive matching of only-lowercase globs, but we > > > > > don't currently have any example of this in the databases. > > > > > > > > But I do want to do one of those, to solve bug 22634: I want > > > > <glob pattern="core"/> to be case-sensitive="true". > > > > > > > > How about a different hack: > > > > we generate in globs2 two lines, in case of case-sensitive: > > > > 50:text/x-c++src:*.C > > > > 50:text/x-c++src:*.C:cs > > > > Old parsers will create an entry for "*.C:cs", which will probably > > > > never match any real file, so no big deal, while new parsers will > > > > take the second line as an indication that the *.C glob (parsed one > > > > line above) should be understood to be case sensitive. > > > > > > Hmmm. I like this one. Sounds good to me. But lets make it extensible > > > when we're doing it, i.e. have a comma-separated list of flags with > > > "cs" being one known one. Unknown flags are ignored, anything after > > > another : is ignored. > > > > Good idea. > > I made the changes in the spec, in the definition of the two mimetypes, > > and in update-mime-database.c (for parsing, and globs2 generation). > > Please find patch attached (I can commit if you're ok with it). > > We must also mention that if a case sensitive match matches that has > priority over the case insensitive match, otherwise the *.c vs *.C match > will not work. I think it's simpler to make *.c case-sensitive as well, so that we don't need to care about priorities. That's what I did in my most recent patch sent to this list and in my implementation, works fine. > > I included a suggested format change for the mimeinfo.cache file, but > > I'll have to let you implement that part, I don't know all the details > > about the suffix tree etc. Same for the xdgmime implementation. > > I don't think the mimeinfo.cache changes are quite right. The literals > are stored sorted and looked up with a binary search. We can't apply the > flag once we've found the match as we won't match without the flag. > Rather we have to add an additional CaseInsensitiveLiteralList. Also, > we'd have to specify that the elements in this list is stored in lower > case, sorted, so that case insensitive bsearch works. OK.... this is exactly why I didn't touch mimeinfo.cache, I don't know enough about it. Can I let you do the changes there? > This has the same issue as the literals, so we have to make there be two > trees, one case sensitive and one case insensitive. I see. Well there are only three case-sensitive globs right now, you could also just go for a linear list for those ;-) -- David Faure, faure@..., sponsored by Qt Software @ Nokia to work on KDE, Konqueror (http://www.konqueror.org), and KOffice (http://www.koffice.org). _______________________________________________ xdg mailing list xdg@... http://lists.freedesktop.org/mailman/listinfo/xdg |
|
|
Re: Case insensitive mimetype matching edge caseDavid Faure a écrit :
[...] > I think it's simpler to make *.c case-sensitive as well, so that we don't need > to care about priorities. That's what I did in my most recent patch sent to > this list and in my implementation, works fine. It seems to m ethat the lack of a defined order will make it hard for third-parties (ISVs) to add case-sensitive globs that conflict with an existing case-insensitive glob. So imagine we go back a few decades and an ISV ports his compiler for a newfangled object-oriented programming language stored in .C source files. Because there's already a glob for .c files that ISV would be unable to add a glob for .C files without also modifying the file shipped by the desktop environment for .c files. Bad! Of course this kind of situation may also never happen again. (and creating a new extension that only differs in case with an existing one was a pretty bad idea in the first place) -- Francois Gouget fgouget@... _______________________________________________ xdg mailing list xdg@... http://lists.freedesktop.org/mailman/listinfo/xdg |
|
|
Re: Case insensitive mimetype matching edge caseOn Wednesday 02 September 2009, Francois Gouget wrote:
> David Faure a écrit : > [...] > > > I think it's simpler to make *.c case-sensitive as well, so that we don't > > need to care about priorities. That's what I did in my most recent patch > > sent to this list and in my implementation, works fine. > > It seems to m ethat the lack of a defined order will make it hard for > third-parties (ISVs) to add case-sensitive globs that conflict with an > existing case-insensitive glob. > > So imagine we go back a few decades and an ISV ports his compiler for a > newfangled object-oriented programming language stored in .C source > files. Because there's already a glob for .c files that ISV would be > unable to add a glob for .C files without also modifying the file > shipped by the desktop environment for .c files. Bad! > > Of course this kind of situation may also never happen again. > (and creating a new extension that only differs in case with an existing > one was a pretty bad idea in the first place) Yes. I can see the theoretical problem, but I don't think solving it is worth it, because such a setup (case-specific file extensions) is a very bad idea in the first place (confusing and not portable). I'd much rather keep it simple (there are enough "priority" concepts in the spec already) and fast. -- David Faure, faure@..., sponsored by Nokia to work on KDE, Konqueror (http://www.konqueror.org), and KOffice (http://www.koffice.org). _______________________________________________ xdg mailing list xdg@... http://lists.freedesktop.org/mailman/listinfo/xdg |
| Free embeddable forum powered by Nabble | Forum Help |