|
View:
New views
1 Messages
—
Rating Filter:
Alert me
|
|
|
gtk-gnutella-devel Digest, Vol 26, Issue 1Send gtk-gnutella-devel mailing list submissions to
gtk-gnutella-devel@... To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/gtk-gnutella-devel or, via email, send a message with subject or body 'help' to gtk-gnutella-devel-request@... You can reach the person managing the list at gtk-gnutella-devel-owner@... When replying, please edit your Subject line so it is more specific than "Re: Contents of gtk-gnutella-devel digest..." Today's Topics: 1. Re: QRP changes. (Christian Biere) 2. Re: QRP changes. (Bill Pringlemeir) 3. Re: QRP changes. (Christian Biere) 4. Re: QRP changes. (Bill Pringlemeir) 5. Re: QRP changes. (Christian Biere) 6. Starting out with gtk-gnutella development (Robbie) ---------------------------------------------------------------------- Message: 1 Date: Wed, 17 Sep 2008 17:18:30 +0200 From: Christian Biere <christianbiere@...> Subject: Re: [gtk-gnutella-devel] QRP changes. To: gtk-gnutella-devel@... Message-ID: <20080917151830.GA1577@...> Content-Type: text/plain; charset=utf-8 Bill Pringlemeir wrote: > > I have committed some changes to core/qrp.c (r15843), > > Change qrp_can_route to use a function pointer in the routing_table > structure. Groups of route decision routines optimize for tables > tables from 16k to 2M (corresponding to 14 to 21 bits), in URN and > non-URN version are 'templated'. The fixed shift allows compilers > to do a much better job of precomputing work. > > Profiling indicates that the 2^16 and 2^17 are the most common > tables sizes (most likely due to lime wire). However, all variants > were left. The code size is very small compared to the table > sizes. Also, the hottest routines will be cached. It may be more > sensible to order the URN and non-URN together (same cache page). > However, it seems that URN searches are very rare (not supported by > LW?). > > Alternate data structures could be employed with the function > pointer method. Sparse tables could be represented by a trees. > Alternate function pointers could decode tables in either array > format or tree format; However, the tree structure would trade time > for space. > > My system has a dynamic clock [Intel Prescott]. The rates are 400, > 800, 1200, 2400, 3200 MHz. Previously, the CPU would spend time in > many of the ranges. With these changes, it is staying in the 400/800 > MHz range (also 8deg C cooler). Top measurements are useless with a > dynamic clock. 'gprof' also indicates an improvement. However, > multiple measurements are probably best for performance improvements. > > If you have top numbers before/after getting r15843 as an ultra-node. > It might be helpful to know if these changes should stay. Although, > the changes should be architecture independent. Isn't this a bit absurd? Maybe I'm missing something but all I can see is that your code adds these two optimizations: 32 - variable -> const hashcode >> variable -> hashcode >> const It adds two assertion checks, one level of indirection and increases the code size. Sure, the negatives might not outweigh the optimization on average but I don't understand how these fairly minor optimizations could make a significant difference. That is, unless the CPU design is a complete failure. The website that shall not be named also claims that the temperature sensor on the Prescott is reporting too high values which means the performance difference is likely exaggerated if measured in terms of temperature. By the way, the temperature measurements reported in Celsius, Fahrenheit or Kelvin by diagnosis tools are always fairly off anyway because these sensors don't work like thermometers: http://www.heise-online.co.uk/news/IDF-Why-many-system-info-tools-give-incorrect-CPU-temperatures--/111384 If this code is really so sensitive to optimization, I'd like to know whether removing those assertion checks makes a significant difference. Changing the loop to count towards zero might also gain a tiny improvement. If non-constant shifting is so expensive, RT_SLOT_READ() could use a small lookup-table instead. -- Christian ------------------------------ Message: 2 Date: Wed, 17 Sep 2008 11:55:48 -0500 From: Bill Pringlemeir <bpringle@...> Subject: Re: [gtk-gnutella-devel] QRP changes. To: gtk-gnutella-devel@... Message-ID: <87vdwuvhij.fsf@...> Content-Type: text/plain; charset=us-ascii On 17 Sep 2008, christianbiere@... wrote: > Isn't this a bit absurd? Maybe I'm missing something but all I can see > is that your code adds these two optimizations: > > 32 - variable -> const > hashcode >> variable -> hashcode >> const The hashcode is also used in the inline function RT_READ_SLOT. I added the asserts to protect against error on my part in this implementation. They can be removed if there are no reported problems. However, I was also unsure whether the URN lookups are supported? The splitting of the URN case is also a fairly large gain. For a non-URN lookups, there is only one branch (if versus if/else). Non-URN seems to be the most common form of query; this test can be moved to the caller which can be constant over the node iteration. A call through a function pointer is not usually that big a hit (there was a function call to a fixed address already). Not having a has_urn variable also seems to decrease register pressure. The code size is very small compared to the 2-256k tables. Try objdump -S qrp.o to see the code size difference between qrp_can_route_default and the macro code. For cases like 'qrp_can_route_16', shifts are not always needed as 32/16/8 bit manipulation can accomplish some of the work. The 'qrp_can_route_16' is the most common case in any runs I have observed. Also not the frame addressing [on x86 like -0x15(%epb)]. In the macro version, most variables are in registers. Regarding the temperature, I do not take it to be completely accurate. However, all three temperature sensors have decrease by similar amount. Eight degrees is significant. If it were two or three degrees it would not be; I realize they are not high precision instruments. The clock frequency reported by 'powertop' would also have be out (and gprof, etc). I am looking into disabling the frequency switching to get more accurate numbers via vmstat and top. However, it is still beneficial to know if it performs better on PPC, etc. Sorry about the C++ comments. Emacs did that automatically. Regards, Bill Pringlemeir. ------------------------------ Message: 3 Date: Wed, 17 Sep 2008 19:15:26 +0200 From: Christian Biere <christianbiere@...> Subject: Re: [gtk-gnutella-devel] QRP changes. To: gtk-gnutella-devel@... Message-ID: <20080917171526.GA27745@...> Content-Type: text/plain; charset=utf-8 Bill Pringlemeir wrote: > On 17 Sep 2008, christianbiere@... wrote: > > > Isn't this a bit absurd? Maybe I'm missing something but all I can see > > is that your code adds these two optimizations: > > > > 32 - variable -> const > > hashcode >> variable -> hashcode >> const > > The hashcode is also used in the inline function RT_READ_SLOT. Yes, but the shifted hashcode isn't a constant, so RT_READ_SLOT() can hardly benefit from the constant shift value. > I added the asserts to protect against error on my part in this > implementation. They can be removed if there are no reported > problems. However, I was also unsure whether the URN lookups are > supported? For the moment being, they are still routed. gtk-gnutella in current SVN does not insert URNs in the QRT any longer, so URN queries can never match locally. LimeWire and derivatives haven't inserted URNs into the QRT for years. I believe they support single-hop URN queries though. > The splitting of the URN case is also a fairly large gain. > For a non-URN lookups, there is only one branch (if versus if/else). > Non-URN seems to be the most common form of query; this test can be > moved to the caller which can be constant over the node iteration. A > call through a function pointer is not usually that big a hit (there > was a function call to a fixed address already). Not having a has_urn > variable also seems to decrease register pressure. The code size is > very small compared to the 2-256k tables. Sure, but I think code cache much smaller than data cache. I wasn't trying to say that this is a problem with your optimization but it's a negative factor no matter how minor it may be. > Try objdump -S qrp.o to see the code size difference between > qrp_can_route_default and the macro code. For cases like > 'qrp_can_route_16', shifts are not always needed as 32/16/8 bit > manipulation can accomplish some of the work. The 'qrp_can_route_16' > is the most common case in any runs I have observed. Also not the > frame addressing [on x86 like -0x15(%epb)]. In the macro version, > most variables are in registers. Are you compiling with any -march flag? I believe for most of the code optimizations are largely irrelevant but code like this will often benefit significantly from CPU-specific optimizations. -- Christian ------------------------------ Message: 4 Date: Wed, 17 Sep 2008 21:40:58 -0500 From: Bill Pringlemeir <bpringle@...> Subject: Re: [gtk-gnutella-devel] QRP changes. To: gtk-gnutella-devel@... Message-ID: <878wtq18hx.fsf@...> Content-Type: text/plain; charset=us-ascii >> On 17 Sep 2008, christianbiere@... wrote: >>> Isn't this a bit absurd? Maybe I'm missing something but all I can see >>> is that your code adds these two optimizations: > Bill Pringlemeir wrote: >>> 32 - variable -> const >>> hashcode >> variable -> hashcode >> const >> The hashcode is also used in the inline function RT_READ_SLOT. On 17 Sep 2008, christianbiere@... wrote: > Yes, but the shifted hashcode isn't a constant, so RT_READ_SLOT() > can hardly benefit from the constant shift value. It looks like this, const guint shift = 32 - bits; \ guint32 idx = qhv->vec[i].hashcode >> shift; \ RT_SLOT_READ -> return 0 != (arena[idx >> 3] & (0x80U >> (i & 0x7))); So if bits is 16, it is shift = 32 - 16 = 16 idx = hash >> 16 *** RT_SLOT_READ arena[ hash >> 16 >> 3 ] & 0x80 >> (hash >> 16 & 7) There are many more constant shift than there use to be. For all x86 machines this is actually a big win as the variable shift has to be in the 'c' register (afaik). So the calculated shift has to be loaded into cx. I agree that x86 is stupid, but it also widely used. I also wanted to make sure that I didn't harm any other processors like the PPC, ARM, MIPS, etc. I think that majority of processors are x86 or PPC. > Sure, but I think code cache much smaller than data cache. I wasn't > trying to say that this is a problem with your optimization but > it's a negative factor no matter how minor it may be. True. It ends up that many of the routines are unused. Only the 16 and 17 non-URN are the frequent cases (80%+). However, I think that these two routines are about the size of the original general case (about 40 instruction each versus 80 for the original). The unused code is just occupying RAM (like the tables). > Are you compiling with any -march flag? I believe for most of the code > optimizations are largely irrelevant but code like this will often > benefit significantly from CPU-specific optimizations. I do compile with flags specific to my CPU. However, I never looked at the assembler until your post. gprof was indicating better performance, but when I look at the number closely they don't seem to make total sense. My processor isn't switching frequencies as much, but this could be just due to network variance. Did you see no difference with whatever machine(s) you have? Thanks, Bill Pringlemeir. ------------------------------ Message: 5 Date: Thu, 18 Sep 2008 17:59:30 +0200 From: Christian Biere <christianbiere@...> Subject: Re: [gtk-gnutella-devel] QRP changes. To: gtk-gnutella-devel@... Message-ID: <20080918155930.GA27426@...> Content-Type: text/plain; charset=utf-8 Bill Pringlemeir wrote: > >> On 17 Sep 2008, christianbiere@... wrote: > > >>> Isn't this a bit absurd? Maybe I'm missing something but all I can see > >>> is that your code adds these two optimizations: > > > Bill Pringlemeir wrote: > > >>> 32 - variable -> const > >>> hashcode >> variable -> hashcode >> const > > >> The hashcode is also used in the inline function RT_READ_SLOT. > > On 17 Sep 2008, christianbiere@... wrote: > > > Yes, but the shifted hashcode isn't a constant, so RT_READ_SLOT() > > can hardly benefit from the constant shift value. > > It looks like this, > > const guint shift = 32 - bits; \ > guint32 idx = qhv->vec[i].hashcode >> shift; \ > RT_SLOT_READ -> return 0 != (arena[idx >> 3] & (0x80U >> (i & 0x7))); I looked at the generated code and the variants really only differed with respect to the constants used for the shift opcodes. Otherwise, the code and size was identical. There are no further shortcuts applied. I think the only case where some additional optimization was possible is "shift = 16" but you have to do that manually, GCC at least doesn't make use of it. Maybe not because it doesn't know but because it's not worth. Assembly wisdom, especially with respect to x86, becomes outdated very quickly. > So if bits is 16, it is > > shift = 32 - 16 = 16 > idx = hash >> 16 > > *** RT_SLOT_READ arena[ hash >> 16 >> 3 ] & 0x80 >> (hash >> 16 & 7) > > There are many more constant shift than there use to be. For all x86 > machines this is actually a big win as the variable shift has to be in > the 'c' register (afaik). So the calculated shift has to be loaded > into cx. I agree that x86 is stupid, but it also widely used. I also > wanted to make sure that I didn't harm any other processors like the > PPC, ARM, MIPS, etc. I think that majority of processors are x86 or > PPC. I've supplied three different variants now. The one using a lookup table to avoid shifting 0x80 seems to yield the smallest code. It's not necessarily the fastest. Using div() looks worthwhile to me as on x86 div yields both quotient and remainder and it's although fast on modern x86, so it could be faster than doing two shifts even if it increases code size slightly. qrp_can_route_?? should be smaller by about 30% now. Compiling with the default flags (that is -O2 and no -march) yields 104 byte for these functions here and it was about 150 before (after removing the assertion checks). The most significant improvement comes actually from using qhv->count differently (saving 31 byte). Using it as loop condition was very sub-optimal. Of course that's also due to the small size of the function. In a larger function where registers are spilled all the time, it wouldn't matter. > I do compile with flags specific to my CPU. However, I never looked > at the assembler until your post. gprof was indicating better > performance, but when I look at the number closely they don't seem to > make total sense. My processor isn't switching frequencies as much, > but this could be just due to network variance. > > Did you see no difference with whatever machine(s) you have? So far I didn't benchmark any of it. While a micro-benchmark doesn't necessarily yield realistic results, the macro benchmark depends on far too many factors. I expect just being connected to a few leaves with large QRTs could make a significant difference that would completely overshadow your optimizations. -- Christian ------------------------------ Message: 6 Date: Tue, 11 Nov 2008 18:11:44 +1100 From: Robbie <robbiesyd@...> Subject: [gtk-gnutella-devel] Starting out with gtk-gnutella development To: gtk-gnutella-devel@... Message-ID: <addd36b60811102311h2d2499di626bd1b3a2bc4ee3@...> Content-Type: text/plain; charset="iso-8859-1" Whats the difference between a gnutella header and a gnutella message? Is there any? If there is, how is http involved? Is it just used for connecting, uploading and downloading or is it used for everything else as well? Whats a safe, harmless header/message (both a header and message if theyre different) that I can send at any time in an existing gnutella connection, to a gnutella servent so that I can test my header/message sending abilities? Im going to connect with gtk-gnutella and have its source code send the header/message to the node. I have found the gnutella_node object as well as the gnutella_socket, and I can get a file descriptor for that socket. I tried to send() an X-Alt header in ascii text (as defined in the huge specification) to the file descriptor and the remote host disconnected me (I was assuming that might be a safe header to send). Anybody know where the function in gnutella is to send a header object or do you just manually write to the socket's file descriptor? gtk-gnutella doesnt have much documentation. Regards, Robbie -------------- next part -------------- An HTML attachment was scrubbed... ------------------------------ ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ ------------------------------ _______________________________________________ gtk-gnutella-devel mailing list gtk-gnutella-devel@... https://lists.sourceforge.net/lists/listinfo/gtk-gnutella-devel End of gtk-gnutella-devel Digest, Vol 26, Issue 1 ************************************************* |
| Free embeddable forum powered by Nabble | Forum Help |