gtk-gnutella-devel Digest, Vol 26, Issue 1

View: New views
1 Messages — Rating Filter:   Alert me  

gtk-gnutella-devel Digest, Vol 26, Issue 1

by gtk-gnutella-devel-request :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Send gtk-gnutella-devel mailing list submissions to
        gtk-gnutella-devel@...

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.sourceforge.net/lists/listinfo/gtk-gnutella-devel
or, via email, send a message with subject or body 'help' to
        gtk-gnutella-devel-request@...

You can reach the person managing the list at
        gtk-gnutella-devel-owner@...

When replying, please edit your Subject line so it is more specific
than "Re: Contents of gtk-gnutella-devel digest..."


Today's Topics:

   1. Re:  QRP changes. (Christian Biere)
   2. Re:  QRP changes. (Bill Pringlemeir)
   3. Re:  QRP changes. (Christian Biere)
   4. Re:  QRP changes. (Bill Pringlemeir)
   5. Re:  QRP changes. (Christian Biere)
   6.  Starting out with gtk-gnutella development (Robbie)


----------------------------------------------------------------------

Message: 1
Date: Wed, 17 Sep 2008 17:18:30 +0200
From: Christian Biere <christianbiere@...>
Subject: Re: [gtk-gnutella-devel] QRP changes.
To: gtk-gnutella-devel@...
Message-ID: <20080917151830.GA1577@...>
Content-Type: text/plain; charset=utf-8

Bill Pringlemeir wrote:

>
> I have committed some changes to core/qrp.c (r15843),
>
>    Change qrp_can_route to use a function pointer in the routing_table
>    structure. Groups of route decision routines optimize for tables
>    tables from 16k to 2M (corresponding to 14 to 21 bits), in URN and
>    non-URN version are 'templated'.  The fixed shift allows compilers
>    to do a much better job of precomputing work.
>
>    Profiling indicates that the 2^16 and 2^17 are the most common
>    tables sizes (most likely due to lime wire).  However, all variants
>    were left.  The code size is very small compared to the table
>    sizes.  Also, the hottest routines will be cached.  It may be more
>    sensible to order the URN and non-URN together (same cache page).
>    However, it seems that URN searches are very rare (not supported by
>    LW?).
>
>    Alternate data structures could be employed with the function
>    pointer method.  Sparse tables could be represented by a trees.
>    Alternate function pointers could decode tables in either array
>    format or tree format; However, the tree structure would trade time
>    for space.
>
> My system has a dynamic clock [Intel Prescott].  The rates are 400,
> 800, 1200, 2400, 3200 MHz.  Previously, the CPU would spend time in
> many of the ranges.  With these changes, it is staying in the 400/800
> MHz range (also 8deg C cooler).  Top measurements are useless with a
> dynamic clock.  'gprof' also indicates an improvement.  However,
> multiple measurements are probably best for performance improvements.
>
> If you have top numbers before/after getting r15843 as an ultra-node.
> It might be helpful to know if these changes should stay.  Although,
> the changes should be architecture independent.

Isn't this a bit absurd? Maybe I'm missing something but all I can see
is that your code adds these two optimizations:

        32 - variable -> const
        hashcode >> variable -> hashcode >> const

It adds two assertion checks, one level of indirection and increases
the code size. Sure, the negatives might not outweigh the optimization
on average but I don't understand how these fairly minor optimizations
could make a significant difference. That is, unless the CPU design
is a complete failure. The website that shall not be named also claims
that the temperature sensor on the Prescott is reporting too high
values which means the performance difference is likely exaggerated
if measured in terms of temperature. By the way, the temperature
measurements reported in Celsius, Fahrenheit or Kelvin by diagnosis
tools are always fairly off anyway because these sensors don't work like
thermometers:

http://www.heise-online.co.uk/news/IDF-Why-many-system-info-tools-give-incorrect-CPU-temperatures--/111384

If this code is really so sensitive to optimization, I'd like to know
whether removing those assertion checks makes a significant difference.
Changing the loop to count towards zero might also gain a tiny
improvement. If non-constant shifting is so expensive, RT_SLOT_READ()
could use a small lookup-table instead.

--
Christian



------------------------------

Message: 2
Date: Wed, 17 Sep 2008 11:55:48 -0500
From: Bill Pringlemeir <bpringle@...>
Subject: Re: [gtk-gnutella-devel] QRP changes.
To: gtk-gnutella-devel@...
Message-ID: <87vdwuvhij.fsf@...>
Content-Type: text/plain; charset=us-ascii

On 17 Sep 2008, christianbiere@... wrote:

> Isn't this a bit absurd? Maybe I'm missing something but all I can see
> is that your code adds these two optimizations:
>
> 32 - variable -> const
> hashcode >> variable -> hashcode >> const

The hashcode is also used in the inline function RT_READ_SLOT.  I
added the asserts to protect against error on my part in this
implementation.  They can be removed if there are no reported
problems.  However, I was also unsure whether the URN lookups are
supported?  The splitting of the URN case is also a fairly large gain.
For a non-URN lookups, there is only one branch (if versus if/else).
Non-URN seems to be the most common form of query; this test can be
moved to the caller which can be constant over the node iteration.  A
call through a function pointer is not usually that big a hit (there
was a function call to a fixed address already).  Not having a has_urn
variable also seems to decrease register pressure.  The code size is
very small compared to the 2-256k tables.

Try objdump -S qrp.o to see the code size difference between
qrp_can_route_default and the macro code.  For cases like
'qrp_can_route_16', shifts are not always needed as 32/16/8 bit
manipulation can accomplish some of the work.  The 'qrp_can_route_16'
is the most common case in any runs I have observed.  Also not the
frame addressing [on x86 like -0x15(%epb)].  In the macro version,
most variables are in registers.

Regarding the temperature, I do not take it to be completely accurate.
However, all three temperature sensors have decrease by similar
amount.  Eight degrees is significant.  If it were two or three
degrees it would not be; I realize they are not high precision
instruments.  The clock frequency reported by 'powertop' would also
have be out (and gprof, etc).  I am looking into disabling the
frequency switching to get more accurate numbers via vmstat and top.

However, it is still beneficial to know if it performs better on PPC,
etc.

Sorry about the C++ comments.  Emacs did that automatically.

Regards,
Bill Pringlemeir.




------------------------------

Message: 3
Date: Wed, 17 Sep 2008 19:15:26 +0200
From: Christian Biere <christianbiere@...>
Subject: Re: [gtk-gnutella-devel] QRP changes.
To: gtk-gnutella-devel@...
Message-ID: <20080917171526.GA27745@...>
Content-Type: text/plain; charset=utf-8

Bill Pringlemeir wrote:
> On 17 Sep 2008, christianbiere@... wrote:
>
> > Isn't this a bit absurd? Maybe I'm missing something but all I can see
> > is that your code adds these two optimizations:
> >
> > 32 - variable -> const
> > hashcode >> variable -> hashcode >> const
>
> The hashcode is also used in the inline function RT_READ_SLOT.

Yes, but the shifted hashcode isn't a constant, so RT_READ_SLOT()
can hardly benefit from the constant shift value.

> I added the asserts to protect against error on my part in this
> implementation.  They can be removed if there are no reported
> problems.  However, I was also unsure whether the URN lookups are
> supported?

For the moment being, they are still routed. gtk-gnutella in
current SVN does not insert URNs in the QRT any longer, so
URN queries can never match locally. LimeWire and derivatives
haven't inserted URNs into the QRT for years. I believe they
support single-hop URN queries though.

> The splitting of the URN case is also a fairly large gain.
> For a non-URN lookups, there is only one branch (if versus if/else).
> Non-URN seems to be the most common form of query; this test can be
> moved to the caller which can be constant over the node iteration.  A
> call through a function pointer is not usually that big a hit (there
> was a function call to a fixed address already).  Not having a has_urn
> variable also seems to decrease register pressure.  The code size is
> very small compared to the 2-256k tables.

Sure, but I think code cache much smaller than data cache. I wasn't
trying to say that this is a problem with your optimization but
it's a negative factor no matter how minor it may be.

> Try objdump -S qrp.o to see the code size difference between
> qrp_can_route_default and the macro code.  For cases like
> 'qrp_can_route_16', shifts are not always needed as 32/16/8 bit
> manipulation can accomplish some of the work.  The 'qrp_can_route_16'
> is the most common case in any runs I have observed.  Also not the
> frame addressing [on x86 like -0x15(%epb)].  In the macro version,
> most variables are in registers.

Are you compiling with any -march flag? I believe for most of the code
optimizations are largely irrelevant but code like this will often
benefit significantly from CPU-specific optimizations.

--
Christian



------------------------------

Message: 4
Date: Wed, 17 Sep 2008 21:40:58 -0500
From: Bill Pringlemeir <bpringle@...>
Subject: Re: [gtk-gnutella-devel] QRP changes.
To: gtk-gnutella-devel@...
Message-ID: <878wtq18hx.fsf@...>
Content-Type: text/plain; charset=us-ascii

>> On 17 Sep 2008, christianbiere@... wrote:

>>> Isn't this a bit absurd? Maybe I'm missing something but all I can see
>>> is that your code adds these two optimizations:

> Bill Pringlemeir wrote:

>>> 32 - variable -> const
>>> hashcode >> variable -> hashcode >> const

>> The hashcode is also used in the inline function RT_READ_SLOT.

On 17 Sep 2008, christianbiere@... wrote:

> Yes, but the shifted hashcode isn't a constant, so RT_READ_SLOT()
> can hardly benefit from the constant shift value.

It looks like this,

        const guint   shift = 32 - bits;                                  \
        guint32 idx = qhv->vec[i].hashcode >> shift;                  \
 RT_SLOT_READ -> return 0 != (arena[idx >> 3] & (0x80U >> (i & 0x7)));

So if bits is 16, it is

    shift = 32 - 16 = 16
    idx = hash >> 16

***  RT_SLOT_READ arena[ hash >> 16 >> 3 ] & 0x80 >> (hash >> 16 & 7)

There are many more constant shift than there use to be.  For all x86
machines this is actually a big win as the variable shift has to be in
the 'c' register (afaik).  So the calculated shift has to be loaded
into cx.  I agree that x86 is stupid, but it also widely used.  I also
wanted to make sure that I didn't harm any other processors like the
PPC, ARM, MIPS, etc.  I think that majority of processors are x86 or
PPC.

> Sure, but I think code cache much smaller than data cache. I wasn't
> trying to say that this is a problem with your optimization but
> it's a negative factor no matter how minor it may be.

True. It ends up that many of the routines are unused.  Only the 16
and 17 non-URN are the frequent cases (80%+).  However, I think that
these two routines are about the size of the original general case
(about 40 instruction each versus 80 for the original).  The unused
code is just occupying RAM (like the tables).

> Are you compiling with any -march flag? I believe for most of the code
> optimizations are largely irrelevant but code like this will often
> benefit significantly from CPU-specific optimizations.

I do compile with flags specific to my CPU.  However, I never looked
at the assembler until your post.  gprof was indicating better
performance, but when I look at the number closely they don't seem to
make total sense.  My processor isn't switching frequencies as much,
but this could be just due to network variance.

Did you see no difference with whatever machine(s) you have?

Thanks,
Bill Pringlemeir.





------------------------------

Message: 5
Date: Thu, 18 Sep 2008 17:59:30 +0200
From: Christian Biere <christianbiere@...>
Subject: Re: [gtk-gnutella-devel] QRP changes.
To: gtk-gnutella-devel@...
Message-ID: <20080918155930.GA27426@...>
Content-Type: text/plain; charset=utf-8

Bill Pringlemeir wrote:

> >> On 17 Sep 2008, christianbiere@... wrote:
>
> >>> Isn't this a bit absurd? Maybe I'm missing something but all I can see
> >>> is that your code adds these two optimizations:
>
> > Bill Pringlemeir wrote:
>
> >>> 32 - variable -> const
> >>> hashcode >> variable -> hashcode >> const
>
> >> The hashcode is also used in the inline function RT_READ_SLOT.
>
> On 17 Sep 2008, christianbiere@... wrote:
>
> > Yes, but the shifted hashcode isn't a constant, so RT_READ_SLOT()
> > can hardly benefit from the constant shift value.
>
> It looks like this,
>
> const guint   shift = 32 - bits;                                  \
> guint32 idx = qhv->vec[i].hashcode >> shift;                  \
>  RT_SLOT_READ -> return 0 != (arena[idx >> 3] & (0x80U >> (i & 0x7)));

I looked at the generated code and the variants really only differed
with respect to the constants used for the shift opcodes. Otherwise,
the code and size was identical. There are no further shortcuts
applied. I think the only case where some additional optimization
was possible is "shift = 16" but you have to do that manually, GCC
at least doesn't make use of it. Maybe not because it doesn't know
but because it's not worth. Assembly wisdom, especially with respect
to x86, becomes outdated very quickly.

> So if bits is 16, it is
>
>     shift = 32 - 16 = 16
>     idx = hash >> 16
>
> ***  RT_SLOT_READ arena[ hash >> 16 >> 3 ] & 0x80 >> (hash >> 16 & 7)
>
> There are many more constant shift than there use to be.  For all x86
> machines this is actually a big win as the variable shift has to be in
> the 'c' register (afaik).  So the calculated shift has to be loaded
> into cx.  I agree that x86 is stupid, but it also widely used.  I also
> wanted to make sure that I didn't harm any other processors like the
> PPC, ARM, MIPS, etc.  I think that majority of processors are x86 or
> PPC.

I've supplied three different variants now. The one using a lookup
table to avoid shifting 0x80 seems to yield the smallest code. It's
not necessarily the fastest. Using div() looks worthwhile to me
as on x86 div yields both quotient and remainder and it's although
fast on modern x86, so it could be faster than doing two shifts
even if it increases code size slightly.

qrp_can_route_?? should be smaller by about 30% now. Compiling with
the default flags (that is -O2 and no -march) yields 104 byte for
these functions here and it was about 150 before (after removing
the assertion checks).

The most significant improvement comes actually from using qhv->count
differently (saving 31 byte). Using it as loop condition was very
sub-optimal.  Of course that's also due to the small size of the
function. In a larger function where registers are spilled all the time,
it wouldn't matter.

> I do compile with flags specific to my CPU.  However, I never looked
> at the assembler until your post.  gprof was indicating better
> performance, but when I look at the number closely they don't seem to
> make total sense.  My processor isn't switching frequencies as much,
> but this could be just due to network variance.
>
> Did you see no difference with whatever machine(s) you have?

So far I didn't benchmark any of it. While a micro-benchmark doesn't
necessarily yield realistic results, the macro benchmark depends on
far too many factors. I expect just being connected to a few leaves with
large QRTs could make a significant difference that would completely
overshadow your optimizations.

--
Christian



------------------------------

Message: 6
Date: Tue, 11 Nov 2008 18:11:44 +1100
From: Robbie <robbiesyd@...>
Subject: [gtk-gnutella-devel] Starting out with gtk-gnutella
        development
To: gtk-gnutella-devel@...
Message-ID:
        <addd36b60811102311h2d2499di626bd1b3a2bc4ee3@...>
Content-Type: text/plain; charset="iso-8859-1"

Whats the difference between a gnutella header and a gnutella message?
Is there any?

If there is, how is http involved? Is it just used for connecting, uploading
and downloading or is it used for
everything else as well?

Whats a safe, harmless header/message (both a header and message if theyre
different) that I can send at any time in an existing gnutella connection,
to a gnutella servent so that I can test my header/message sending
abilities? Im going to connect with gtk-gnutella and have its source code
send the header/message to the node.

I have found the gnutella_node object as well as the gnutella_socket, and I
can get a file descriptor for that
socket. I tried to send() an X-Alt header in ascii text (as defined in the
huge specification) to the file descriptor and the remote host disconnected
me (I was assuming that might be a safe header to send).

Anybody know where the function in gnutella is to send a header object or do
you just manually write to the socket's file descriptor? gtk-gnutella doesnt
have much documentation.

Regards,
Robbie
-------------- next part --------------
An HTML attachment was scrubbed...

------------------------------

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/

------------------------------

_______________________________________________
gtk-gnutella-devel mailing list
gtk-gnutella-devel@...
https://lists.sourceforge.net/lists/listinfo/gtk-gnutella-devel


End of gtk-gnutella-devel Digest, Vol 26, Issue 1
*************************************************