Optimize jabber_id_new()

View: New views
9 Messages — Rating Filter:   Alert me  

Optimize jabber_id_new()

by Mark Doliner :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

The jabber_id_new() function in libpurple/protocols/jabber/jutil.c is
pretty expensive.  It creates a JabberID struct given the string
version of a Jabber username (i.e. it splits
"mark.doliner@.../Home" into "mark.doliner" "gmail.com" and
"Home").  It also lowercases the node and domain, does utf8
normalization, and does stringprep validation to ensure the JID is
comprised only of characters allowed by the XMPP RFC.

We've optimized this function at Meebo.  In our testing we found that
the vast majority of JIDs are made of these characters: a-z A-Z 0-9 @
/ { | } ~ . [ \ ] ^ _ ;  And so we do a quick first pass over the
given string.  If the string contains only these characters than we
skip g_utf8_normalize() and skip stringprep and only lowercase the
node and domain.  Otherwise we do everything.

How do people feel about me checking this change into the jabber code
in libpurple?  Meebo probably has a larger percentage of
English-speaking users than Pidgin, so maybe our results are unfairly
biased.  Does anyone know how common non-ASCII JIDs are?

I suspect that even for the case where the jid contains non-ASCII
characters our optimized version won't be very much slower, and might
even be faster (it makes one pass over the string to determine the
location of @ and / instead of calling g_utf8_strchr() twice (but
that's easy to fix on its own)).

In other words: How does everyone feel about the attached patch?

-Mark

[optimize_jabber_id_new.diff]

#
# old_revision [22e14265a47cdddb4c9eb1ee0d2ce2989f9fce61]
#
# patch "libpurple/protocols/jabber/jutil.c"
#  from [912799b76983833354a4420afc830bbb9b370f21]
#    to [31de021d4923c45a328aa1f6ae688d9b0a44e4d6]
#
============================================================
--- libpurple/protocols/jabber/jutil.c 912799b76983833354a4420afc830bbb9b370f21
+++ libpurple/protocols/jabber/jutil.c 31de021d4923c45a328aa1f6ae688d9b0a44e4d6
@@ -103,20 +103,86 @@ jabber_id_new(const char *str)
 JabberID*
 jabber_id_new(const char *str)
 {
- char *at;
- char *slash;
+ const char *at = NULL;
+ const char *slash = NULL;
+ const char *c;
+ gboolean needs_validation = FALSE;
  char *node = NULL;
  char *domain;
  JabberID *jid;
 
- if(!str || !g_utf8_validate(str, -1, NULL))
+ if (!str)
  return NULL;
 
+ for (c = str; *c != '\0'; c++) {
+ switch (*c) {
+ case '@':
+ if (!at)
+ at = c;
+ break;
+ case '/':
+ if (!at) {
+ /*
+ * If at is NULL then c is still pointing to the node
+ * name, and node names can not contain forward slashes
+ */
+ return NULL;
+ }
+ if (!slash)
+ slash = c;
+ break;
+ default:
+ /* make sure this character falls within the allowed ascii characters
+ * specified in the nodeprep RFC.  If it's outside of this range,
+ * the character is probably unicode and will be validated using the
+ * more expensive UTF-8 compliant nodeprep functions
+ */
+ if ( !( ('a' <= *c && *c <= '~') ||  /*a-z{|}~*/
+ ('.' <= *c && *c <= '9') ||  /*./0123456789*/
+ ('A' <= *c && *c <= '_') ||  /*A-Z[\]^_*/
+ (*c == ';')  ))              /*;*/
+ {
+ needs_validation = TRUE;
+ }
+ break;
+ }
+ }
+
  jid = g_new0(JabberID, 1);
 
- at = g_utf8_strchr(str, -1, '@');
- slash = g_utf8_strchr(str, -1, '/');
+ if (!needs_validation) {
+ /* No UTF-8 characters in the jid--just lowercase and return */
+ if (at) {
+ jid->node = g_utf8_strdown(str, at-str);
+ if(slash) {
+ jid->domain = g_utf8_strdown(at + 1, slash - (at + 1));
+ jid->resource = g_strdup(slash + 1);
+ } else {
+ jid->domain = g_utf8_strdown(at + 1, -1);
+ }
+ } else {
+ if(slash) {
+ jid->domain = g_utf8_strdown(str, slash - str);
+ jid->resource = g_strdup(slash + 1);
+ } else {
+ jid->domain = g_utf8_strdown(str, -1);
+ }
+ }
+ return jid;
+ }
 
+ /*
+ * If we get here, there are some non-ASCII chars in the string, so
+ * we'll need to validate it, normalize, and finally do a full jabber
+ * nodeprep on the jid.
+ */
+
+ if (!g_utf8_validate(str, -1, NULL)) {
+ jabber_id_free(jid);
+ return NULL;
+ }
+
+ /* normalization */
  if(at) {
  node = g_utf8_normalize(str, at-str, G_NORMALIZE_NFKC);
  if(slash) {
@@ -144,6 +210,7 @@ jabber_id_new(const char *str)
  g_free(domain);
  }
 
+ /* and finally the jabber nodeprep */
  if(!jabber_nodeprep_validate(jid->node) ||
  !jabber_nameprep_validate(jid->domain) ||
  !jabber_resourceprep_validate(jid->resource)) {


_______________________________________________
Devel mailing list
Devel@...
http://pidgin.im/cgi-bin/mailman/listinfo/devel

Re: Optimize jabber_id_new()

by datallah :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

On Tue, Jun 30, 2009 at 4:31 AM, Mark Doliner<mark@...> wrote:
> The jabber_id_new() function in libpurple/protocols/jabber/jutil.c is
> pretty expensive.

<SNIP/>

> How do people feel about me checking this change into the jabber code
> in libpurple?  Meebo probably has a larger percentage of
> English-speaking users than Pidgin, so maybe our results are unfairly
> biased.  Does anyone know how common non-ASCII JIDs are?
>
> I suspect that even for the case where the jid contains non-ASCII
> characters our optimized version won't be very much slower, and might
> even be faster (it makes one pass over the string to determine the
> location of @ and / instead of calling g_utf8_strchr() twice (but
> that's easy to fix on its own)).
>
> In other words: How does everyone feel about the attached patch?

My (completely gut-based) guess is that non-ASCII JIDs are going to be
pretty uncommon "in the wild".  My impression is that even for
non-ASCII languages, people tend to use ASCII transliterations.
Perhaps Peter Saint-Andre can offer some more useful context-specific
insight .

I think the patch looks very reasonable - even in the case where it
ends up having to g_utf8_validate(), the new overhead appears to be
very minimal.

-D

_______________________________________________
Devel mailing list
Devel@...
http://pidgin.im/cgi-bin/mailman/listinfo/devel

Re: Optimize jabber_id_new()

by Ethan Blanton-3 :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

Mark Doliner spake unto us the following wisdom:

> The jabber_id_new() function in libpurple/protocols/jabber/jutil.c is
> pretty expensive.  It creates a JabberID struct given the string
> version of a Jabber username (i.e. it splits
> "mark.doliner@.../Home" into "mark.doliner" "gmail.com" and
> "Home").  It also lowercases the node and domain, does utf8
> normalization, and does stringprep validation to ensure the JID is
> comprised only of characters allowed by the XMPP RFC.
>
> We've optimized this function at Meebo.  In our testing we found that
> the vast majority of JIDs are made of these characters: a-z A-Z 0-9 @
> / { | } ~ . [ \ ] ^ _ ;  And so we do a quick first pass over the
> given string.  If the string contains only these characters than we
> skip g_utf8_normalize() and skip stringprep and only lowercase the
> node and domain.  Otherwise we do everything.
>
> How do people feel about me checking this change into the jabber code
> in libpurple?  Meebo probably has a larger percentage of
> English-speaking users than Pidgin, so maybe our results are unfairly
> biased.  Does anyone know how common non-ASCII JIDs are?
This seems very reasonable to me.  If the "expensive" checks are
expensive enough that Meebo cares about them, we should avoid them
when they are unnecessary.  If it turns out that the short-circuit
checks are too expensive when jids *are* non-ASCII (which, looking at
the source, I doubt), we can revisit this again.  I share your and
Daniel's intuition that most jids will be ASCII anyway.

I say commit it as-is.  :-)

Ethan

--
The laws that forbid the carrying of arms are laws [that have no remedy
for evils].  They disarm only those who are neither inclined nor
determined to commit crimes.
                -- Cesare Beccaria, "On Crimes and Punishments", 1764


_______________________________________________
Devel mailing list
Devel@...
http://pidgin.im/cgi-bin/mailman/listinfo/devel

signature.asc (492 bytes) Download Attachment

Re: Optimize jabber_id_new()

by Peter Saint-Andre-2 :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 6/30/09 10:33 AM, Ethan Blanton wrote:

> Mark Doliner spake unto us the following wisdom:
>> The jabber_id_new() function in libpurple/protocols/jabber/jutil.c is
>> pretty expensive.  It creates a JabberID struct given the string
>> version of a Jabber username (i.e. it splits
>> "mark.doliner@.../Home" into "mark.doliner" "gmail.com" and
>> "Home").  It also lowercases the node and domain, does utf8
>> normalization, and does stringprep validation to ensure the JID is
>> comprised only of characters allowed by the XMPP RFC.
>>
>> We've optimized this function at Meebo.  In our testing we found that
>> the vast majority of JIDs are made of these characters: a-z A-Z 0-9 @
>> / { | } ~ . [ \ ] ^ _ ;  And so we do a quick first pass over the
>> given string.  If the string contains only these characters than we
>> skip g_utf8_normalize() and skip stringprep and only lowercase the
>> node and domain.  Otherwise we do everything.
>>
>> How do people feel about me checking this change into the jabber code
>> in libpurple?  Meebo probably has a larger percentage of
>> English-speaking users than Pidgin, so maybe our results are unfairly
>> biased.  Does anyone know how common non-ASCII JIDs are?
>
> This seems very reasonable to me.  If the "expensive" checks are
> expensive enough that Meebo cares about them, we should avoid them
> when they are unnecessary.  If it turns out that the short-circuit
> checks are too expensive when jids *are* non-ASCII (which, looking at
> the source, I doubt), we can revisit this again.  I share your and
> Daniel's intuition that most jids will be ASCII anyway.

This is my experience as well (e.g., at the jabber.org service). Sounds
like a smart optimization to me.

Peter

- --
Peter Saint-Andre
https://stpeter.im/


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpKP2wACgkQNL8k5A2w/vzMfwCfbRs5NCEa6DhOSFS3FYD1QBT4
bSQAn0dNCHpPEk3yaYTL78MBe/p2bup3
=gJm8
-----END PGP SIGNATURE-----

_______________________________________________
Devel mailing list
Devel@...
http://pidgin.im/cgi-bin/mailman/listinfo/devel

Re: Optimize jabber_id_new()

by Paul Aurich-4 :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

And Mark Doliner spoke on 06/30/2009 01:31 AM, saying:
<snip>
> In other words: How does everyone feel about the attached patch?

Like Daniel and Ethan, I think this is fine, with a few comments (see below).

>
> -Mark
>

+ for (c = str; *c != '\0'; c++) {
+ switch (*c) {
+ case '@':
+ if (!at)
+ at = c;
+ break;
+ case '/':
+ if (!at) {
+ /*
+ * If at is NULL then c is still pointing to the node
+ * name, and node names can not contain forward slashes
+ */
+ return NULL;
+ }
+ if (!slash)
+ slash = c;
+ break;
+ default:

This introduces a few corner cases where invalid JIDs would get through (I
think something like "foo@@bar.com"). That's probably not too much of an
issue. It's also technically valid to have a JID of the form
"domain/resource", which this would fail to validate (again, not actually
too big of an issue).

I'd envision something like this (assuming enum { NODE, DOMAIN, RESOURCE }
state), which I don't think should add much overhead, but addresses both of
those issues:

case '@':
        if (state == NODE && !at) {
                at = c;
                state = DOMAIN;
        } else if (state == DOMAIN) {
                /* An '@' isn't valid in the domain portion of a JID */
                return NULL;
        }
        break;

case '/':
        if (!slash) {
                slash = c;
                state = RESOURCE;

                if (*(c+1) == '\0')
                        /*
                         * A JID containing '/' cannot have a 0-length
                         * resource.
                         */
                        return NULL;
        }
        break;

For the needs_validation == FALSE case, I would recommend moving to
g_ascii_strdown().

~Paul

_______________________________________________
Devel mailing list
Devel@...
http://pidgin.im/cgi-bin/mailman/listinfo/devel

Re: Optimize jabber_id_new()

by Mark Doliner :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

On Tue, Jun 30, 2009 at 11:04 AM, Paul Aurich<paul@...> wrote:
> And Mark Doliner spoke on 06/30/2009 01:31 AM, saying:
> <snip>
>> In other words: How does everyone feel about the attached patch?
>
> Like Daniel and Ethan, I think this is fine, with a few comments (see below).

Cool.  I'll make a few changes and check something in.

> +       for (c = str; *c != '\0'; c++) {
> +               switch (*c) {
> +                       case '@':
> +                               if (!at)
> +                                       at = c;
> +                               break;
> +                       case '/':
> +                               if (!at) {
> +                                       /*
> +                                        * If at is NULL then c is still pointing to the node
> +                                        * name, and node names can not contain forward slashes
> +                                        */
> +                                       return NULL;
> +                               }
> +                               if (!slash)
> +                                       slash = c;
> +                               break;
> +                       default:
>
> This introduces a few corner cases where invalid JIDs would get through (I
> think something like "foo@@bar.com"). That's probably not too much of an
> issue. It's also technically valid to have a JID of the form
> "domain/resource", which this would fail to validate (again, not actually
> too big of an issue).

Yeah, probably not too common, but I think correctness here is pretty important.

> I'd envision something like this (assuming enum { NODE, DOMAIN, RESOURCE }
> state), which I don't think should add much overhead, but addresses both of
> those issues:
>
> case '@':
>        if (state == NODE && !at) {
>                at = c;
>                state = DOMAIN;
>        } else if (state == DOMAIN) {
>                /* An '@' isn't valid in the domain portion of a JID */
>                return NULL;
>        }
>        break;
>
> case '/':
>        if (!slash) {
>                slash = c;
>                state = RESOURCE;
>
>                if (*(c+1) == '\0')
>                        /*
>                         * A JID containing '/' cannot have a 0-length
>                         * resource.
>                         */
>                        return NULL;
>        }
>        break;
>
> For the needs_validation == FALSE case, I would recommend moving to
> g_ascii_strdown().

I think these changes make sense.  I went with g_utf8_strdown()
instead of g_ascii_strdown() because the former takes a length
argument, and we want to dup and lowercase pieces of the string.  I'll
check if g_utf8_strdown() is significantly more expensive, and if it
is then maybe write our own g_ascii_strdown_len().

I just realized that there might also be a problem where we allow some
characters in the node that should only be allowed in the resource.
I'll try to unit test this before checking it in.

-Mark
_______________________________________________
Devel mailing list
Devel@...
http://pidgin.im/cgi-bin/mailman/listinfo/devel

Re: Optimize jabber_id_new()

by Paul Aurich-4 :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

On Jun 30, 2009, at 11:51, Mark Doliner wrote:
> I think these changes make sense.  I went with g_utf8_strdown()
> instead of g_ascii_strdown() because the former takes a length
> argument, and we want to dup and lowercase pieces of the string.  I'll
> check if g_utf8_strdown() is significantly more expensive, and if it
> is then maybe write our own g_ascii_strdown_len().

I think you might be confusing g_strdown with g_ascii_strdown, which has
the same type signature as g_utf8_strdown.

~Paul

_______________________________________________
Devel mailing list
Devel@...
http://pidgin.im/cgi-bin/mailman/listinfo/devel

Re: Optimize jabber_id_new()

by Mark Doliner :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

On Tue, Jun 30, 2009 at 12:05 PM, Paul Aurich<paul@...> wrote:

> On Jun 30, 2009, at 11:51, Mark Doliner wrote:
>>
>> I think these changes make sense.  I went with g_utf8_strdown()
>> instead of g_ascii_strdown() because the former takes a length
>> argument, and we want to dup and lowercase pieces of the string.  I'll
>> check if g_utf8_strdown() is significantly more expensive, and if it
>> is then maybe write our own g_ascii_strdown_len().
>
> I think you might be confusing g_strdown with g_ascii_strdown, which has
> the same type signature as g_utf8_strdown.

Woah, yeah, or maybe I was looking at g_ascii_tolower().  Thanks for
pointing that out!

-Mark

_______________________________________________
Devel mailing list
Devel@...
http://pidgin.im/cgi-bin/mailman/listinfo/devel

Re: Optimize jabber_id_new()

by Jan Kaluza :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

On Tue, Jun 30, 2009 at 5:28 PM, Daniel Atallah<daniel.atallah@...> wrote:

> On Tue, Jun 30, 2009 at 4:31 AM, Mark Doliner<mark@...> wrote:
>> The jabber_id_new() function in libpurple/protocols/jabber/jutil.c is
>> pretty expensive.
>
> <SNIP/>
>
>> How do people feel about me checking this change into the jabber code
>> in libpurple?  Meebo probably has a larger percentage of
>> English-speaking users than Pidgin, so maybe our results are unfairly
>> biased.  Does anyone know how common non-ASCII JIDs are?
>>
>> I suspect that even for the case where the jid contains non-ASCII
>> characters our optimized version won't be very much slower, and might
>> even be faster (it makes one pass over the string to determine the
>> location of @ and / instead of calling g_utf8_strchr() twice (but
>> that's easy to fix on its own)).
>>
>> In other words: How does everyone feel about the attached patch?
>
> My (completely gut-based) guess is that non-ASCII JIDs are going to be
> pretty uncommon "in the wild".  My impression is that even for
> non-ASCII languages, people tend to use ASCII transliterations.
> Perhaps Peter Saint-Andre can offer some more useful context-specific
> insight .

In Czech Republic (our language uses non-ascii characters) people use
in almost 100% cases ascii JIDs, because they are used to do it
because of emails... So I agree with you :).

> I think the patch looks very reasonable - even in the case where it
> ends up having to g_utf8_validate(), the new overhead appears to be
> very minimal.
>
> -D
>
> _______________________________________________
> Devel mailing list
> Devel@...
> http://pidgin.im/cgi-bin/mailman/listinfo/devel
>

Jan "HanzZ" Kaluza

_______________________________________________
Devel mailing list
Devel@...
http://pidgin.im/cgi-bin/mailman/listinfo/devel