|
View:
New views
9 Messages
—
Rating Filter:
Alert me
|
|
|
Optimize jabber_id_new()
by Mark Doliner
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message The jabber_id_new() function in libpurple/protocols/jabber/jutil.c is
pretty expensive. It creates a JabberID struct given the string version of a Jabber username (i.e. it splits "mark.doliner@.../Home" into "mark.doliner" "gmail.com" and "Home"). It also lowercases the node and domain, does utf8 normalization, and does stringprep validation to ensure the JID is comprised only of characters allowed by the XMPP RFC. We've optimized this function at Meebo. In our testing we found that the vast majority of JIDs are made of these characters: a-z A-Z 0-9 @ / { | } ~ . [ \ ] ^ _ ; And so we do a quick first pass over the given string. If the string contains only these characters than we skip g_utf8_normalize() and skip stringprep and only lowercase the node and domain. Otherwise we do everything. How do people feel about me checking this change into the jabber code in libpurple? Meebo probably has a larger percentage of English-speaking users than Pidgin, so maybe our results are unfairly biased. Does anyone know how common non-ASCII JIDs are? I suspect that even for the case where the jid contains non-ASCII characters our optimized version won't be very much slower, and might even be faster (it makes one pass over the string to determine the location of @ and / instead of calling g_utf8_strchr() twice (but that's easy to fix on its own)). In other words: How does everyone feel about the attached patch? -Mark [optimize_jabber_id_new.diff] # # old_revision [22e14265a47cdddb4c9eb1ee0d2ce2989f9fce61] # # patch "libpurple/protocols/jabber/jutil.c" # from [912799b76983833354a4420afc830bbb9b370f21] # to [31de021d4923c45a328aa1f6ae688d9b0a44e4d6] # ============================================================ --- libpurple/protocols/jabber/jutil.c 912799b76983833354a4420afc830bbb9b370f21 +++ libpurple/protocols/jabber/jutil.c 31de021d4923c45a328aa1f6ae688d9b0a44e4d6 @@ -103,20 +103,86 @@ jabber_id_new(const char *str) JabberID* jabber_id_new(const char *str) { - char *at; - char *slash; + const char *at = NULL; + const char *slash = NULL; + const char *c; + gboolean needs_validation = FALSE; char *node = NULL; char *domain; JabberID *jid; - if(!str || !g_utf8_validate(str, -1, NULL)) + if (!str) return NULL; + for (c = str; *c != '\0'; c++) { + switch (*c) { + case '@': + if (!at) + at = c; + break; + case '/': + if (!at) { + /* + * If at is NULL then c is still pointing to the node + * name, and node names can not contain forward slashes + */ + return NULL; + } + if (!slash) + slash = c; + break; + default: + /* make sure this character falls within the allowed ascii characters + * specified in the nodeprep RFC. If it's outside of this range, + * the character is probably unicode and will be validated using the + * more expensive UTF-8 compliant nodeprep functions + */ + if ( !( ('a' <= *c && *c <= '~') || /*a-z{|}~*/ + ('.' <= *c && *c <= '9') || /*./0123456789*/ + ('A' <= *c && *c <= '_') || /*A-Z[\]^_*/ + (*c == ';') )) /*;*/ + { + needs_validation = TRUE; + } + break; + } + } + jid = g_new0(JabberID, 1); - at = g_utf8_strchr(str, -1, '@'); - slash = g_utf8_strchr(str, -1, '/'); + if (!needs_validation) { + /* No UTF-8 characters in the jid--just lowercase and return */ + if (at) { + jid->node = g_utf8_strdown(str, at-str); + if(slash) { + jid->domain = g_utf8_strdown(at + 1, slash - (at + 1)); + jid->resource = g_strdup(slash + 1); + } else { + jid->domain = g_utf8_strdown(at + 1, -1); + } + } else { + if(slash) { + jid->domain = g_utf8_strdown(str, slash - str); + jid->resource = g_strdup(slash + 1); + } else { + jid->domain = g_utf8_strdown(str, -1); + } + } + return jid; + } + /* + * If we get here, there are some non-ASCII chars in the string, so + * we'll need to validate it, normalize, and finally do a full jabber + * nodeprep on the jid. + */ + + if (!g_utf8_validate(str, -1, NULL)) { + jabber_id_free(jid); + return NULL; + } + + /* normalization */ if(at) { node = g_utf8_normalize(str, at-str, G_NORMALIZE_NFKC); if(slash) { @@ -144,6 +210,7 @@ jabber_id_new(const char *str) g_free(domain); } + /* and finally the jabber nodeprep */ if(!jabber_nodeprep_validate(jid->node) || !jabber_nameprep_validate(jid->domain) || !jabber_resourceprep_validate(jid->resource)) { _______________________________________________ Devel mailing list Devel@... http://pidgin.im/cgi-bin/mailman/listinfo/devel |
|
|
Re: Optimize jabber_id_new()
by datallah
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message On Tue, Jun 30, 2009 at 4:31 AM, Mark Doliner<mark@...> wrote:
> The jabber_id_new() function in libpurple/protocols/jabber/jutil.c is > pretty expensive. <SNIP/> > How do people feel about me checking this change into the jabber code > in libpurple? Meebo probably has a larger percentage of > English-speaking users than Pidgin, so maybe our results are unfairly > biased. Does anyone know how common non-ASCII JIDs are? > > I suspect that even for the case where the jid contains non-ASCII > characters our optimized version won't be very much slower, and might > even be faster (it makes one pass over the string to determine the > location of @ and / instead of calling g_utf8_strchr() twice (but > that's easy to fix on its own)). > > In other words: How does everyone feel about the attached patch? My (completely gut-based) guess is that non-ASCII JIDs are going to be pretty uncommon "in the wild". My impression is that even for non-ASCII languages, people tend to use ASCII transliterations. Perhaps Peter Saint-Andre can offer some more useful context-specific insight . I think the patch looks very reasonable - even in the case where it ends up having to g_utf8_validate(), the new overhead appears to be very minimal. -D _______________________________________________ Devel mailing list Devel@... http://pidgin.im/cgi-bin/mailman/listinfo/devel |
|
|
Re: Optimize jabber_id_new()
by Ethan Blanton-3
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message Mark Doliner spake unto us the following wisdom:
> The jabber_id_new() function in libpurple/protocols/jabber/jutil.c is > pretty expensive. It creates a JabberID struct given the string > version of a Jabber username (i.e. it splits > "mark.doliner@.../Home" into "mark.doliner" "gmail.com" and > "Home"). It also lowercases the node and domain, does utf8 > normalization, and does stringprep validation to ensure the JID is > comprised only of characters allowed by the XMPP RFC. > > We've optimized this function at Meebo. In our testing we found that > the vast majority of JIDs are made of these characters: a-z A-Z 0-9 @ > / { | } ~ . [ \ ] ^ _ ; And so we do a quick first pass over the > given string. If the string contains only these characters than we > skip g_utf8_normalize() and skip stringprep and only lowercase the > node and domain. Otherwise we do everything. > > How do people feel about me checking this change into the jabber code > in libpurple? Meebo probably has a larger percentage of > English-speaking users than Pidgin, so maybe our results are unfairly > biased. Does anyone know how common non-ASCII JIDs are? expensive enough that Meebo cares about them, we should avoid them when they are unnecessary. If it turns out that the short-circuit checks are too expensive when jids *are* non-ASCII (which, looking at the source, I doubt), we can revisit this again. I share your and Daniel's intuition that most jids will be ASCII anyway. I say commit it as-is. :-) Ethan -- The laws that forbid the carrying of arms are laws [that have no remedy for evils]. They disarm only those who are neither inclined nor determined to commit crimes. -- Cesare Beccaria, "On Crimes and Punishments", 1764 _______________________________________________ Devel mailing list Devel@... http://pidgin.im/cgi-bin/mailman/listinfo/devel |
|
|
Re: Optimize jabber_id_new()
by Peter Saint-Andre-2
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1 On 6/30/09 10:33 AM, Ethan Blanton wrote: > Mark Doliner spake unto us the following wisdom: >> The jabber_id_new() function in libpurple/protocols/jabber/jutil.c is >> pretty expensive. It creates a JabberID struct given the string >> version of a Jabber username (i.e. it splits >> "mark.doliner@.../Home" into "mark.doliner" "gmail.com" and >> "Home"). It also lowercases the node and domain, does utf8 >> normalization, and does stringprep validation to ensure the JID is >> comprised only of characters allowed by the XMPP RFC. >> >> We've optimized this function at Meebo. In our testing we found that >> the vast majority of JIDs are made of these characters: a-z A-Z 0-9 @ >> / { | } ~ . [ \ ] ^ _ ; And so we do a quick first pass over the >> given string. If the string contains only these characters than we >> skip g_utf8_normalize() and skip stringprep and only lowercase the >> node and domain. Otherwise we do everything. >> >> How do people feel about me checking this change into the jabber code >> in libpurple? Meebo probably has a larger percentage of >> English-speaking users than Pidgin, so maybe our results are unfairly >> biased. Does anyone know how common non-ASCII JIDs are? > > This seems very reasonable to me. If the "expensive" checks are > expensive enough that Meebo cares about them, we should avoid them > when they are unnecessary. If it turns out that the short-circuit > checks are too expensive when jids *are* non-ASCII (which, looking at > the source, I doubt), we can revisit this again. I share your and > Daniel's intuition that most jids will be ASCII anyway. This is my experience as well (e.g., at the jabber.org service). Sounds like a smart optimization to me. Peter - -- Peter Saint-Andre https://stpeter.im/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkpKP2wACgkQNL8k5A2w/vzMfwCfbRs5NCEa6DhOSFS3FYD1QBT4 bSQAn0dNCHpPEk3yaYTL78MBe/p2bup3 =gJm8 -----END PGP SIGNATURE----- _______________________________________________ Devel mailing list Devel@... http://pidgin.im/cgi-bin/mailman/listinfo/devel |
|
|
Re: Optimize jabber_id_new()
by Paul Aurich-4
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message And Mark Doliner spoke on 06/30/2009 01:31 AM, saying:
<snip> > In other words: How does everyone feel about the attached patch? Like Daniel and Ethan, I think this is fine, with a few comments (see below). > > -Mark > + for (c = str; *c != '\0'; c++) { + switch (*c) { + case '@': + if (!at) + at = c; + break; + case '/': + if (!at) { + /* + * If at is NULL then c is still pointing to the node + * name, and node names can not contain forward slashes + */ + return NULL; + } + if (!slash) + slash = c; + break; + default: This introduces a few corner cases where invalid JIDs would get through (I think something like "foo@@bar.com"). That's probably not too much of an issue. It's also technically valid to have a JID of the form "domain/resource", which this would fail to validate (again, not actually too big of an issue). I'd envision something like this (assuming enum { NODE, DOMAIN, RESOURCE } state), which I don't think should add much overhead, but addresses both of those issues: case '@': if (state == NODE && !at) { at = c; state = DOMAIN; } else if (state == DOMAIN) { /* An '@' isn't valid in the domain portion of a JID */ return NULL; } break; case '/': if (!slash) { slash = c; state = RESOURCE; if (*(c+1) == '\0') /* * A JID containing '/' cannot have a 0-length * resource. */ return NULL; } break; For the needs_validation == FALSE case, I would recommend moving to g_ascii_strdown(). ~Paul _______________________________________________ Devel mailing list Devel@... http://pidgin.im/cgi-bin/mailman/listinfo/devel |
|
|
Re: Optimize jabber_id_new()
by Mark Doliner
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message On Tue, Jun 30, 2009 at 11:04 AM, Paul Aurich<paul@...> wrote:
> And Mark Doliner spoke on 06/30/2009 01:31 AM, saying: > <snip> >> In other words: How does everyone feel about the attached patch? > > Like Daniel and Ethan, I think this is fine, with a few comments (see below). Cool. I'll make a few changes and check something in. > + for (c = str; *c != '\0'; c++) { > + switch (*c) { > + case '@': > + if (!at) > + at = c; > + break; > + case '/': > + if (!at) { > + /* > + * If at is NULL then c is still pointing to the node > + * name, and node names can not contain forward slashes > + */ > + return NULL; > + } > + if (!slash) > + slash = c; > + break; > + default: > > This introduces a few corner cases where invalid JIDs would get through (I > think something like "foo@@bar.com"). That's probably not too much of an > issue. It's also technically valid to have a JID of the form > "domain/resource", which this would fail to validate (again, not actually > too big of an issue). Yeah, probably not too common, but I think correctness here is pretty important. > I'd envision something like this (assuming enum { NODE, DOMAIN, RESOURCE } > state), which I don't think should add much overhead, but addresses both of > those issues: > > case '@': > if (state == NODE && !at) { > at = c; > state = DOMAIN; > } else if (state == DOMAIN) { > /* An '@' isn't valid in the domain portion of a JID */ > return NULL; > } > break; > > case '/': > if (!slash) { > slash = c; > state = RESOURCE; > > if (*(c+1) == '\0') > /* > * A JID containing '/' cannot have a 0-length > * resource. > */ > return NULL; > } > break; > > For the needs_validation == FALSE case, I would recommend moving to > g_ascii_strdown(). I think these changes make sense. I went with g_utf8_strdown() instead of g_ascii_strdown() because the former takes a length argument, and we want to dup and lowercase pieces of the string. I'll check if g_utf8_strdown() is significantly more expensive, and if it is then maybe write our own g_ascii_strdown_len(). I just realized that there might also be a problem where we allow some characters in the node that should only be allowed in the resource. I'll try to unit test this before checking it in. -Mark _______________________________________________ Devel mailing list Devel@... http://pidgin.im/cgi-bin/mailman/listinfo/devel |
|
|
Re: Optimize jabber_id_new()
by Paul Aurich-4
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message On Jun 30, 2009, at 11:51, Mark Doliner wrote:
> I think these changes make sense. I went with g_utf8_strdown() > instead of g_ascii_strdown() because the former takes a length > argument, and we want to dup and lowercase pieces of the string. I'll > check if g_utf8_strdown() is significantly more expensive, and if it > is then maybe write our own g_ascii_strdown_len(). I think you might be confusing g_strdown with g_ascii_strdown, which has the same type signature as g_utf8_strdown. ~Paul _______________________________________________ Devel mailing list Devel@... http://pidgin.im/cgi-bin/mailman/listinfo/devel |
|
|
Re: Optimize jabber_id_new()
by Mark Doliner
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message On Tue, Jun 30, 2009 at 12:05 PM, Paul Aurich<paul@...> wrote:
> On Jun 30, 2009, at 11:51, Mark Doliner wrote: >> >> I think these changes make sense. I went with g_utf8_strdown() >> instead of g_ascii_strdown() because the former takes a length >> argument, and we want to dup and lowercase pieces of the string. I'll >> check if g_utf8_strdown() is significantly more expensive, and if it >> is then maybe write our own g_ascii_strdown_len(). > > I think you might be confusing g_strdown with g_ascii_strdown, which has > the same type signature as g_utf8_strdown. Woah, yeah, or maybe I was looking at g_ascii_tolower(). Thanks for pointing that out! -Mark _______________________________________________ Devel mailing list Devel@... http://pidgin.im/cgi-bin/mailman/listinfo/devel |
|
|
Re: Optimize jabber_id_new()
by Jan Kaluza
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message On Tue, Jun 30, 2009 at 5:28 PM, Daniel Atallah<daniel.atallah@...> wrote:
> On Tue, Jun 30, 2009 at 4:31 AM, Mark Doliner<mark@...> wrote: >> The jabber_id_new() function in libpurple/protocols/jabber/jutil.c is >> pretty expensive. > > <SNIP/> > >> How do people feel about me checking this change into the jabber code >> in libpurple? Meebo probably has a larger percentage of >> English-speaking users than Pidgin, so maybe our results are unfairly >> biased. Does anyone know how common non-ASCII JIDs are? >> >> I suspect that even for the case where the jid contains non-ASCII >> characters our optimized version won't be very much slower, and might >> even be faster (it makes one pass over the string to determine the >> location of @ and / instead of calling g_utf8_strchr() twice (but >> that's easy to fix on its own)). >> >> In other words: How does everyone feel about the attached patch? > > My (completely gut-based) guess is that non-ASCII JIDs are going to be > pretty uncommon "in the wild". My impression is that even for > non-ASCII languages, people tend to use ASCII transliterations. > Perhaps Peter Saint-Andre can offer some more useful context-specific > insight . In Czech Republic (our language uses non-ascii characters) people use in almost 100% cases ascii JIDs, because they are used to do it because of emails... So I agree with you :). > I think the patch looks very reasonable - even in the case where it > ends up having to g_utf8_validate(), the new overhead appears to be > very minimal. > > -D > > _______________________________________________ > Devel mailing list > Devel@... > http://pidgin.im/cgi-bin/mailman/listinfo/devel > Jan "HanzZ" Kaluza _______________________________________________ Devel mailing list Devel@... http://pidgin.im/cgi-bin/mailman/listinfo/devel |
| Free embeddable forum powered by Nabble | Forum Help |