|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 | Next > |
|
|
Understanding IRAHi Vladimir,
I have just begun working for Icera Inc, on a private port of GCC, and my first task has been to investigate the performance impact of using the Chaitin-Briggs algorithm of IRA vs the priority algorithm that we currently have enabled in our production compiler (due to not defining IRA_COVER_CLASSES). After running various tests, I measured little performance changes except for some test cases that have regressed quite considerably. My suspicion was that these regressions were due to our architecture having some instructions that can only use a subset of the register bank, so I set about trying to understand IRA enough to confirm this. I now believe I have now reached a point where I understand IRA enough to start thinking about how I can improve these regressed cases, but I thought it might be a good idea to formulate an email with my current understanding of IRA in it, so you could correct any errors before I continue. (I figure this exchange between us will also serve useful to other newbies in the field of GCC register allocation!) Without further ado, here is my current understanding of IRA (numbered for ease of reference). 1. You can enable full IRA, with Chaitin-Briggs coloring, by defining IRA_COVER_CLASSES, but you can just as easily disable it again by supplying -fira-algorithm=priority at the command line. (The latter is useful to know because it means you can compare two versions without recompiling.) 2. The priority algorithm allocates in an order determined by the allocno_priority_compare_func and calls into assign_hard_reg in that order to determine which allocno gets what register (or memory). 3. The CB algorithm puts allocnos into colored and uncolored buckets, according to conflicts between allocnos. The more conflicts, the more chance of being put in the uncolored bucket. Being in the uncolored bucket increases an allocno's chance of being spilled; being in the colored bucket guarantees that it will get a hard register. 4. When you run at -O0, the conflicts aspect of IRA is disabled (ira_conflict_p is set to false), and this subsequently causes the fast_allocation function to be used instead of the do_coloring IRA function. (This makes sense since you cannot separate into buckets without inspecting conflicts.) 5. Assuming ira_conflicts_p is enabled and you are using the CB algorithm, the two buckets are sorted using a bucket_allocno_compare_func, and this ordering is important, since it dictates the order each allocno will be pushed onto the stack and hence the order it will be popped from it (and assign_hard_reg called). 6. Allocnos from the colorable bucket always go on the stack first (in the order they were sorted) and then uncolorable ones, in an order based on which is the best candidate for spilling. Each time an allocno is added to the stack, conflicts may be reduced and there is the possibility of moving one or more allocnos from the uncolorable bucket to the colorable one (so they end up being added to the stack sooner than the other uncolorable ones). Near the end of processing the uncolorable bucket, there comes an opportunity to move all remaining allocnos to the colorable bucket (due to few remaining conflicts) and these get put on the stack last, and hence popped first, and so they all get registers. These final uncolorables will be those that had the highest spill cost and/or the most remaining conflicts. 7. The spill cost and priority calculation for determining which uncolorable to put on the stack first is important, for two main reasons: firstly, whichever allocno is picked is a more likely candidate for spilling; secondly, it will free other allocnos up to be added to the nicer colored bucket, from which an allocno always receives a hard register. If you could let me know if and where I've gone wrong with any of the above, I would be extremely grateful. Then, once we are on the same page, I hope you could also make some suggestions as to how I might help IRA work well with our instructions that can only use a subset of the register bank. Best regards, Ian Bolton (Compiler Engineer, Icera Inc.) |
|
|
Re: Understanding IRAIan Bolton wrote:
> Hi Vladimir, > > I have just begun working for Icera Inc, on a private port of GCC, > and my first task has been to investigate the performance impact of > using the Chaitin-Briggs algorithm of IRA vs the priority algorithm > that we currently have enabled in our production compiler (due to not > defining IRA_COVER_CLASSES). > > After running various tests, I measured little performance changes > except for some test cases that have regressed quite considerably. > My suspicion was that these regressions were due to our architecture > having some instructions that can only use a subset of the register > bank, so I set about trying to understand IRA enough to confirm this. > there are instructions using specific register subsets. It is well known and some researchers tried to solve the problem. The most known article about this is "A generalized algorithm for graph-coloring register allocation": http://130.203.133.121:8080/viewdoc/summary?doi=10.1.1.107.113 I am still working on this problem but I don't know when it will be finally in gcc. > I now believe I have now reached a point where I understand IRA > enough to start thinking about how I can improve these regressed > cases, but I thought it might be a good idea to formulate an email > with my current understanding of IRA in it, so you could correct any > errors before I continue. (I figure this exchange between us will > also serve useful to other newbies in the field of GCC register > allocation!) > The biggest problem of GCC RA is still reload pass. It frequently changes decisions of IRA not in a good way. > Without further ado, here is my current understanding of IRA > (numbered for ease of reference). > > 1. You can enable full IRA, with Chaitin-Briggs coloring, by defining > IRA_COVER_CLASSES, but you can just as easily disable it again by > supplying -fira-algorithm=priority at the command line. (The latter > is useful to know because it means you can compare two versions > without recompiling.) > Correct. > 2. The priority algorithm allocates in an order determined by the > allocno_priority_compare_func and calls into assign_hard_reg in that > order to determine which allocno gets what register (or memory). > Correct. > 3. The CB algorithm puts allocnos into colored and uncolored buckets, > according to conflicts between allocnos. The more conflicts, the > more chance of being put in the uncolored bucket. Being in the > uncolored bucket increases an allocno's chance of being spilled; > being in the colored bucket guarantees that it will get a hard > register. > I'd say if allocno goes to the coloring stack from the uncolored bucket, it will be probably spilled. Going to the coloring stack from coloring bucket means that the allocno most probably get a hard register. In theory it should be guaranteed but in practice it might not happen because of multi-registers allocnos, hard register conflicts with allocno, non-profitability of some hard registers (e.g. callee-saved vs callee-clobbered registers). Being in colored or uncolored bucket is defined by trivial colorability criteria which is classically based on conflicts left in the graph for given allocno. The criteria can be more complicated and exact but it might slow down RA a lot. > 4. When you run at -O0, the conflicts aspect of IRA is disabled > (ira_conflict_p is set to false), and this subsequently causes the > fast_allocation function to be used instead of the do_coloring IRA > function. (This makes sense since you cannot separate into buckets > without inspecting conflicts.) > Correct. Building conflicts is an expansive operation. Therefore -O0 (or fast) RA is based on live ranges. > 5. Assuming ira_conflicts_p is enabled and you are using the CB > algorithm, the two buckets are sorted using a > bucket_allocno_compare_func, and this ordering is important, since it > dictates the order each allocno will be pushed onto the stack and > hence the order it will be popped from it (and assign_hard_reg > called). > Correct. But I should say it is only used for currently colored allocnos. You can not reorder two allocnos when putting the first allocno on the stack (in order words, removing trivially colored allocno from the graph) results in making the second allocno trivially colored. > 6. Allocnos from the colorable bucket always go on the stack first > (in the order they were sorted) and then uncolorable ones, in an > order based on which is the best candidate for spilling. Each time > an allocno is added to the stack, conflicts may be reduced and there > is the possibility of moving one or more allocnos from the > uncolorable bucket to the colorable one (so they end up being added > to the stack sooner than the other uncolorable ones). Near the end > of processing the uncolorable bucket, there comes an opportunity to > move all remaining allocnos to the colorable bucket (due to few > remaining conflicts) and these get put on the stack last, and hence > popped first, and so they all get registers. can be found in any compiler book. > These final uncolorables > will be those that had the highest spill cost and/or the most > remaining conflicts. > Sometimes such allocnos get hard register (it is implemented by optimistic coloring which is the biggest invention of Briggs). It is very hard to predict when it happens. > 7. The spill cost and priority calculation for determining which > uncolorable to put on the stack first is important, for two main > reasons: firstly, whichever allocno is picked is a more likely > candidate for spilling; secondly, it will free other allocnos up to > be added to the nicer colored bucket, from which an allocno always > receives a hard register. > Correct with mentioned above problem that allocnos from colored bucket might not get a hard register. Chaitin-Briggs algorithm is good at coloring but not good at spilling which should be used on register pressure criteria and this is very different from # conflicts. Priority coloring, imho, better dealing with spilling because its priority takes register pressure into account by using live range length. From well known RA algorithms, ELS (extended linear scan RA) probably does the best decision about spilling but it is not good about coloring (especially for code with complicated CFG). > If you could let me know if and where I've gone wrong with any of the > above, I would be extremely grateful. Then, once we are on the same > page, I hope you could also make some suggestions as to how I might > help IRA work well with our instructions that can only use a subset > of the register bank. > Subsets are complicated issue. I've been working on different approaches to this problem for year without significant improvements. As I wrote I hope it finally will be in GCC in some time and necessity for IRA_COVER_CLASS will finally go away. > Best regards, Ian Bolton (Compiler Engineer, Icera Inc.) |
|
|
Re: Understanding IRAIan Bolton wrote:
> I hope you could also make some suggestions as to how I might > help IRA work well with our instructions that can only use a subset > of the register bank. > I forgot to write: thanks, it would be interesting for me to see your suggestions :) |
|
|
Re: Understanding IRAOn 10/16/09 08:53, Vladimir Makarov wrote:
> > The biggest problem of GCC RA is still reload pass. It frequently > changes decisions of IRA not in a good way. Agreed. Not only may reload make a bad choice, it's horribly unpredictable. Trivial changes often lead to drastically different reloading decisions which in turn drastically change the final output. One of my favorites right now is the round-robin selection of spill registers to encourage reload inheritance. While I certainly understand the reasoning behind the code, it's amazingly frustrating to watch reload choose the worst possible spill register simply because of the round-robin selection. I've got a little hack in the reload-v2 branch which queries IRA to mitigate this problem, but it's merely a short-term hack until I can make reload inheritance go away. jeff |
|
|
RE: Understanding IRAHi Jeff and Vladimir.
Jeff: I'd be interested in trying the patch if you can send it my way. Vladimir: Today I have seen reasonable improvements on progressively-trimmed-down versions of a *single* test case by applying some cost adjustments in the case when an operand wants one of our bottom registers and IRA is considering giving it an alternative register or memory. I will be experimenting with different values for these two new costs and some more test cases tomorrow. I wonder if the optimal register allocator will just end up being one that tries all the algorithms and picks the best output for a given case? When it comes to optimisation, I prefer to bend the rules if at all possible! Best regards, Ian -----Original Message----- From: Jeff Law [mailto:law@...] Sent: 16 October 2009 16:45 To: Vladimir Makarov Cc: Ian Bolton; gcc@... Subject: Re: Understanding IRA On 10/16/09 08:53, Vladimir Makarov wrote: > > The biggest problem of GCC RA is still reload pass. It frequently > changes decisions of IRA not in a good way. Agreed. Not only may reload make a bad choice, it's horribly unpredictable. Trivial changes often lead to drastically different reloading decisions which in turn drastically change the final output. One of my favorites right now is the round-robin selection of spill registers to encourage reload inheritance. While I certainly understand the reasoning behind the code, it's amazingly frustrating to watch reload choose the worst possible spill register simply because of the round-robin selection. I've got a little hack in the reload-v2 branch which queries IRA to mitigate this problem, but it's merely a short-term hack until I can make reload inheritance go away. jeff |
|
|
Re: Understanding IRAIan Bolton wrote:
> Hi Jeff and Vladimir. > > Jeff: I'd be interested in trying the patch if you can send it my way. > > Vladimir: Today I have seen reasonable improvements on > progressively-trimmed-down versions of a *single* test case by applying > some cost adjustments in the case when an operand wants one of our > bottom registers and IRA is considering giving it an alternative > register or memory. I will be experimenting with different values for > these two new costs and some more test cases tomorrow. > > I wonder if the optimal register allocator will just end up being one > that tries all the algorithms and picks the best output for a given > case? When it comes to optimisation, I prefer to bend the rules if at > all possible! > > doubts) but I did a lot of experiments recently (last year) with ILP based register allocators: from simple spill model (which pseudos should be spilled) to RA with coalescing, live range splitting, rematerialization, and code selection (it needs models on hard reg levels) including some models closed to Apel's and two Wilken's approaches. The simplest model can be only used to solve medium-size functions in reasonable time (about 10 minutes). But even this model can not be used for function like reload.c::find_reloads. More complex model problems can not be solved even for many tiny benchmarks (like heapsort) in reasonable time. Although I used GLPK which is much slower than the best package (CPLEX). Other algorithms for soliving ILP (like Balas' one) are even much worse. The improvement for the simplest model was not significant unless profiling was used (in this case some SPECInt2000 benchmarks were improved upto 20% on x86). In any case, I did not find optimal RA based on ILP is practical. About 4 years ago, I tried QAP (quadratic assignment problem) model approaches with the same conclusion. Although I did not try multi commodity flow network approach for which there are better specialized algorithms because there are no good GPL-based packages for this (unfortunately the best one MCF has no such license). It could be an interesting research. |
|
|
Re: Understanding IRAOn 10/19/09 12:30, Ian Bolton wrote:
> Hi Jeff and Vladimir. > > Jeff: I'd be interested in trying the patch if you can send it my way. > It's nothing special. /* Return nonzero if REGNO is a particularly bad choice for reloading X. */ static int ira_bad_reload_regno_1 (int regno, rtx x) { int x_regno; ira_allocno_t a; enum reg_class pref; /* We only deal with pseudo regs. */ if (! x || GET_CODE (x) != REG) return 0; x_regno = REGNO (x); if (x_regno < FIRST_PSEUDO_REGISTER) return 0; /* If the pseudo prefers REGNO explicitly, then do not consider REGNO a bad spill choice. */ pref = reg_preferred_class (x_regno); if (reg_class_size[pref] == 1 && TEST_HARD_REG_BIT (reg_class_contents[pref], regno)) return 0; /* If the pseudo conflicts with REGNO, then we consider REGNO a poor choice for a reload regno. */ a = ira_regno_allocno_map[x_regno]; if (TEST_HARD_REG_BIT (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a), regno)) return 1; return 0; } /* Return nonzero if REGNO is a particularly bad choice for reloading IN or OUT. */ int ira_bad_reload_regno (int regno, rtx in, rtx out) { return (ira_bad_reload_regno_1 (regno, in) || ira_bad_reload_regno_1 (regno, out)); } Then change the loop in allocate_reload_reg to iterate 3 times intead of 2. And add this fragment if (pass == 1 && ira_bad_reload_regno (regnum, rld[r].in, rld[r].out)) continue; To body of hte conditional starting with if ((reload_reg_free_p ... It's really just a hack. I don't want to spend much time on that code as ultimately I want it all to go away. Jeff |
|
|
RE: Understanding IRAHi again, Vladimir,
I am pleased to report some performance improvements after altering ira-costs.c. A key benchmark for us has improved by 5%. Specifically, in record_reg_classes(), after the alt_cost has been calculated and it will be applied to pp->mem_cost and pp->cost[k], I check whether this particular operand wanted one of our BOTTOM_REGS (r0-r15) and I further increase the pp->mem_cost by an arbitrary amount and also increase pp->cost[k] by an arbitrary amount if k does not represent the BOTTOM_REGS class. My aim here is to nudge IRA in the right direction for operands that just want BOTTOM_REGS. After experimenting with different values for my "arbitrary amounts", I discovered some that successfully made IRA more likely to give BOTTOM_REGS to those instructions/operands that want BOTTOM_REGS, since any other regs and memory ended up with high enough costs for IRA to try and avoid using them. I have included a snippet from my version of record_reg_classes() below: ==================================== op_cost_add = alt_cost * frequency; /* Finally, update the costs with the information we've calculated about this alternative. */ for (i = 0; i < n_ops; i++) if (REG_P (ops[i]) && REGNO (ops[i]) >= FIRST_PSEUDO_REGISTER) { struct costs *pp = op_costs[i], *qq = this_op_costs[i]; int scale = 1 + (recog_data.operand_type[i] == OP_INOUT); /* If this operand really wanted a BOTTOM_REG, add an extra cost onto memory to nudge IRA away from putting it in memory */ if (allocno_pref && allocno_pref[ALLOCNO_NUM(ira_curr_regno_allocno_map [REGNO (ops[i])])] == BOTTOM_REGS) { pp->mem_cost = MIN (pp->mem_cost, (qq->mem_cost + op_cost_add + (flag_ira_preferred_register_cost_memory * frequency)) * scale); } else { pp->mem_cost = MIN (pp->mem_cost, (qq->mem_cost + op_cost_add) * scale); } for (k = 0; k < cost_classes_num; k++) { /* If this operand really wanted a BOTTOM_REG, add an extra cost onto any register class that isn't BOTTOM_REGS to nudge IRA away from putting it in a hard register of that class */ if (allocno_pref && allocno_pref[ALLOCNO_NUM(ira_curr_regno_allocno_map [REGNO (ops[i])])] == BOTTOM_REGS) { switch(cost_classes[k]) { case BOTTOM_REGS: op_cost_add = alt_cost * frequency; break; case TOP_CREGS: case C_REGS: op_cost_add = (alt_cost + flag_ira_preferred_register_cost_register) * frequency; break; default: op_cost_add = alt_cost * frequency; break; } } pp->cost[k] = MIN (pp->cost[k], (qq->cost[k] + op_cost_add) * scale); } } ==================================== So far, I have found the best value for flag_ira_preferred_register_cost_memory to be 20 and the best value for flag_ira_preferred_register_cost_register to be 6. I appreciate that these numbers do not really correlate with the other cost units but they were the ones that made the impact. In terms of coloring algorithms, we are still seeing better performance with the priority algorithm on our benchmarks, but the cost adjustments above improved both priority algorithm and the CB algorithm, with ira-region=mixed and ira-region=one. If you have any thoughts you'd like to share then I'd definitely be interested, but this post is mainly because you said in a previous email that you wanted to hear my suggestions :) Best regards, Ian >Ian Bolton wrote: >> I hope you could also make some suggestions as to how I might >> help IRA work well with our instructions that can only use a >> subset of the register bank. >> >I forgot to write: thanks, it would be interesting for me to see >your suggestions :) |
|
|
Re: Understanding IRAOn 11/03/09 09:29, Ian Bolton wrote:
> Hi again, Vladimir, > > I am pleased to report some performance improvements after altering > ira-costs.c. A key benchmark for us has improved by 5%. > > Specifically, in record_reg_classes(), after the alt_cost has been > calculated and it will be applied to pp->mem_cost and pp->cost[k], I > check whether this particular operand wanted one of our BOTTOM_REGS > (r0-r15) and I further increase the pp->mem_cost by an arbitrary > amount and also increase pp->cost[k] by an arbitrary amount if k > does not represent the BOTTOM_REGS class. My aim here is to nudge > IRA in the right direction for operands that just want BOTTOM_REGS. > > After experimenting with different values for my "arbitrary > amounts", I discovered some that successfully made IRA more likely > to give BOTTOM_REGS to those instructions/operands that want > BOTTOM_REGS, since any other regs and memory ended up with high > enough costs for IRA to try and avoid using them. > > I have included a snippet from my version of record_reg_classes() > below: > > IRA aren't showing a lower cost for using BOTTOM_REGS (or a higher cost for TOP_REGS). i.e. I don't think any of this should be necessary as IRA should already be doing something similar. This may be a case where your backend hasn't indicated that TOP_REGS has a higher cost than BOTTOM_REGS in particular situations. jeff |
|
|
Re: Understanding IRAJeff Law wrote:
> On 11/03/09 09:29, Ian Bolton wrote: >> Hi again, Vladimir, >> >> I am pleased to report some performance improvements after altering >> ira-costs.c. A key benchmark for us has improved by 5%. >> >> Specifically, in record_reg_classes(), after the alt_cost has been >> calculated and it will be applied to pp->mem_cost and pp->cost[k], I >> check whether this particular operand wanted one of our BOTTOM_REGS >> (r0-r15) and I further increase the pp->mem_cost by an arbitrary >> amount and also increase pp->cost[k] by an arbitrary amount if k >> does not represent the BOTTOM_REGS class. My aim here is to nudge >> IRA in the right direction for operands that just want BOTTOM_REGS. >> >> After experimenting with different values for my "arbitrary >> amounts", I discovered some that successfully made IRA more likely >> to give BOTTOM_REGS to those instructions/operands that want >> BOTTOM_REGS, since any other regs and memory ended up with high >> enough costs for IRA to try and avoid using them. >> >> I have included a snippet from my version of record_reg_classes() >> below: >> >> > What I don't understand at this point is why the current mechanisms in > IRA aren't showing a lower cost for using BOTTOM_REGS (or a higher > cost for TOP_REGS). i.e. I don't think any of this should be > necessary as IRA should already be doing something similar. > > This may be a case where your backend hasn't indicated that TOP_REGS > has a higher cost than BOTTOM_REGS in particular situations. > without the architecture knowledge and some macro values in your port (IRA_COVER_CLASSES, MEMORY_MOVE_COST, and REGISTER_MOVE_COST). I'd also add that besides right macro value definitions, you could use insn alternative hints near register constraints like ? or even *. |
|
|
RE: Understanding IRAHi Jeff,
From an empirical perspective, the value of your patch is hard to determine at this stage - one benchmark improved about 0.5% but others have marginally regressed. From an intellectual perspective, however, your patch is clearly a Good Thing. If I am understanding correctly, your aim is to prevent cases where reload evicts a pseudo from a register that conflicts with the pseudo that we are trying to reload, thereby undoing the clever cost- based logic in IRA that gave the register to the current owner in the first place? My guess is that the performance drop could be attributed to reload being lucky in some cases and your patch is preventing this luck from happening. Whilst I'm more of a risk-taker by nature, my employer would prefer more predictable behaviour from its compiler, so we will likely commit your patch to our development branch. Thanks very much! Is there an ETA on when reload will be gone away? ;-) Best regards, Ian > -----Original Message----- > From: Jeff Law [mailto:law@...] > Sent: 22 October 2009 23:05 > To: Ian Bolton > Cc: Vladimir Makarov; gcc@... > Subject: Re: Understanding IRA > > On 10/19/09 12:30, Ian Bolton wrote: > > Hi Jeff and Vladimir. > > > > Jeff: I'd be interested in trying the patch if you can send it my > way. > > > It's nothing special. > > /* Return nonzero if REGNO is a particularly bad choice for reloading > X. */ > static int > ira_bad_reload_regno_1 (int regno, rtx x) > { > int x_regno; > ira_allocno_t a; > enum reg_class pref; > > /* We only deal with pseudo regs. */ > if (! x || GET_CODE (x) != REG) > return 0; > > x_regno = REGNO (x); > if (x_regno < FIRST_PSEUDO_REGISTER) > return 0; > > /* If the pseudo prefers REGNO explicitly, then do not consider > REGNO a bad spill choice. */ > pref = reg_preferred_class (x_regno); > if (reg_class_size[pref] == 1 > && TEST_HARD_REG_BIT (reg_class_contents[pref], regno)) > return 0; > > /* If the pseudo conflicts with REGNO, then we consider REGNO a > poor choice for a reload regno. */ > a = ira_regno_allocno_map[x_regno]; > if (TEST_HARD_REG_BIT (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a), > return 1; > > return 0; > } > > /* Return nonzero if REGNO is a particularly bad choice for reloading > IN or OUT. */ > int > ira_bad_reload_regno (int regno, rtx in, rtx out) > { > return (ira_bad_reload_regno_1 (regno, in) > || ira_bad_reload_regno_1 (regno, out)); > } > > Then change the loop in allocate_reload_reg to iterate 3 times intead > of > 2. And add this fragment > > > > if (pass == 1 > && ira_bad_reload_regno (regnum, rld[r].in, rld[r].out)) > continue; > > > To body of hte conditional starting with > > if ((reload_reg_free_p ... > > > It's really just a hack. I don't want to spend much time on that > as ultimately I want it all to go away. > > Jeff > > > |
|
|
Re: Understanding IRAOn 11/04/09 10:52, Ian Bolton wrote:
> Hi Jeff, > > From an empirical perspective, the value of your patch is hard to > determine at this stage - one benchmark improved about 0.5% but others > have marginally regressed. > It's a hack, no doubt about it. Your results are about what I expected, a few things get better a few things get worse. The purpose of the patch was twofold. First, hopefully make reload's actions more predictable. Second hopefully improve the overall code in at least some marginal fashion. I didn't want to burn a lot of time on it since the reload inheritance is, IMHO, the biggest ball of hair in reload that I want to kill. If I'm successful in killing the need for reload inheritance, then this little hack silently goes away. > From an intellectual perspective, however, your patch is clearly a Good > Thing. If I am understanding correctly, your aim is to prevent cases > where reload evicts a pseudo from a register that conflicts with the > pseudo that we are trying to reload, thereby undoing the clever cost- > based logic in IRA that gave the register to the current owner in the > first place? > Basically, yes. Note that we still allow eviction if no other register can be found. > My guess is that the performance drop could be attributed to reload > being lucky in some cases and your patch is preventing this luck from > happening. Precisely. > Whilst I'm more of a risk-taker by nature, my employer > would prefer more predictable behaviour from its compiler, so we will > likely commit your patch to our development branch. Thanks very much! > There's definitely a lot more that could be done with this code. If it were going to be around for a while, you'd want to sort the spill register array based on a number of criteria. Part of those criteria would be the conflicts & costs recorded by IRA, whether or not the spill reg holds a value interesting for this insn, etc. Given my goals, I didn't want to spend that much time on it. > Is there an ETA on when reload will be gone away? ;-) > It'll be a long time, though I'd like to start staging in pieces of the work for 4.6. jeff |
|
|
Re: Understanding IRAOn 11/04/09 10:14, Vladimir Makarov wrote:
> > I am agree with Jeff. It is hard to understand what you are doing > without the architecture knowledge and some macro values in your port > (IRA_COVER_CLASSES, MEMORY_MOVE_COST, and REGISTER_MOVE_COST). > > I'd also add that besides right macro value definitions, you could use > insn alternative hints near register constraints like ? or even *. I was wondering about the constraints in particular. ISTM that for insns where TOP_REGS and BOTTOM_REGS have differing costs that the constraint letters for TOP_REGS ought to have a '?' to show they're slightly more costly than BOTTOM_REGS. Alternately IRA_HARD_REGNO_ADD_COST_MULTIPLER might be used to get BOTTOM_REGS preferred without having to add a zillion '?' modifiers to the constraints in the machine description. jeff |
|
|
|
|
|
RE: Understanding IRA> Ian Bolton wrote: > > (NOTE: this is a repost to gcc@... of a private message > originally sent to just Jeff Law and Vladimir Makarov.) > > > Vladimir Makarov wrote: > > Jeff Law wrote: > > > On 11/03/09 09:29, Ian Bolton wrote: > > >> Hi again, Vladimir, > > >> > > >> I am pleased to report some performance improvements after > altering > > >> ira-costs.c. A key benchmark for us has improved by 5%. > > >> > > >> Specifically, in record_reg_classes(), after the alt_cost has > > >> calculated and it will be applied to pp->mem_cost and pp->cost[k], > I > > >> check whether this particular operand wanted one of our > BOTTOM_REGS > > >> (r0-r15) and I further increase the pp->mem_cost by an arbitrary > > >> amount and also increase pp->cost[k] by an arbitrary amount if k > > >> does not represent the BOTTOM_REGS class. My aim here is to nudge > > >> IRA in the right direction for operands that just want > BOTTOM_REGS. > > >> > > >> After experimenting with different values for my "arbitrary > > >> amounts", I discovered some that successfully made IRA more likely > > >> to give BOTTOM_REGS to those instructions/operands that want > > >> BOTTOM_REGS, since any other regs and memory ended up with high > > >> enough costs for IRA to try and avoid using them. > > >> > > >> I have included a snippet from my version of record_reg_classes() > > >> below: > > >> > > >> > > > What I don't understand at this point is why the current mechanisms > > in > > > IRA aren't showing a lower cost for using BOTTOM_REGS (or a higher > > > cost for TOP_REGS). i.e. I don't think any of this should be > > > necessary as IRA should already be doing something similar. > > > > > > This may be a case where your backend hasn't indicated that > TOP_REGS > > > has a higher cost than BOTTOM_REGS in particular situations. > > > > > I am agree with Jeff. It is hard to understand what you are doing > > without the architecture knowledge and some macro values in your > > (IRA_COVER_CLASSES, MEMORY_MOVE_COST, and REGISTER_MOVE_COST). > > > > I'd also add that besides right macro value definitions, you could > use > > insn alternative hints near register constraints like ? or even *. > > > > FYI, my IRA_COVER_CLASSES is simply this: > > #define IRA_COVER_CLASSES \ > { \ > GENERAL_REGS, \ > LIM_REG_CLASSES \ > } > > REGISTER_MOVE_COST is currently set to the default of 2, in > On our architecture, it takes 1 cycle to move from one register to > another, so I assume 2 would be the correct number if this is meant to > represent the cost of moving there and back again? Or is the cost 2 > to represent the cycle impact (1) plus the size impact (1)? > > MEMORY_MOVE_COST is set to be 4 + memory_move_secondary_cost in > reload.h. I've not delved into what is happening in that function for > my particular case, but I see it mentions REGISTER_MOVE_COST in there > too. > > (As you can see from my previous email, I have increased the costs for > allocnos where we want BOTTOM_REGS and get TOP_CREGS or MEM by 6*freq > and 20*freq, respectively. I do not know whether these numbers are > sensible at this stage, since I do not fully understand what the costs > are meant to represent.) > > After Jeff mentioned that IRA should already be doing the right thing, > I had another look at the 176r.ira dump file and realised that I had > originally only seen "Pass 0 for finding allocno costs" in that file. > In fact, some costs for allocnos are added in Pass 1. > > For example, in unmodified IRA with priority algorithm and mixed > region: > > a0(r230,l0) costs: BOTTOM_REGS:0,0 TOP_CREGS:0,0 C_REGS:0,0 MEM:32000 > a1(r203,l0) costs: BOTTOM_REGS:0,0 TOP_CREGS:0,8000 C_REGS:0,8000 > MEM:8000 > ... > a29(r203,l1) costs: BOTTOM_REGS:0,0 TOP_CREGS:8000,8000 > C_REGS:8000,8000 > MEM:20000 > > From looking at 174r.sched1, I can see that r203 is defined in l0 and > used as an operand that needed BOTTOM_REGS in l1. Notice that the > allocno cost is 0 for l0, but the total cost is 8000, since this has > been passed on from the 8000 for allocno in l1. > > r203 ends up getting register #23, which is not a BOTTOM_REG, so we > need > to generate a move in region l1 before we can use this value. > > What is interesting is that in my modified version of IRA, both lines > that mention r203 have an allocno cost: > > a0(r230,l0) costs: BOTTOM_REGS:0,0 TOP_CREGS:0,0 C_REGS:0,0 MEM:32000 > a1(r203,l0) costs: BOTTOM_REGS:0,0 TOP_CREGS:12000,50000 > C_REGS:12000,50000 MEM:48000 > a29(r203,l1) costs: BOTTOM_REGS:0,0 TOP_CREGS:38000,38000 > C_REGS:38000,38000 MEM:138000 > > I'm not really sure how my change in record_reg_classes has ended up > altering the allocno cost in l0, but that does seem to be the key. > I will continue investigating this. > > Another thing you can notice from the above cost output is that > allocnos > that are going to be used for memory operations are getting their MEM > cost set high, to prevent them being put in MEM. (r230 gets used > several > times to load in values and store out values.) > > But in my modified IRA, my grossly inflated MEM costs for allocnos > want BOTTOM_REGS are causing some of those allocnos to get a higher > priority than these ones that are to be used as memory address. See > that a29/r203 has a MEM score of 138000, which means it gets first pick > of the available registers, so more chance of getting the BOTTOM_REG > that it wants. > > I wonder if the value being used for these memory address allocnos > could > be tuned. I also wonder if we could persuade them to prefer TOP_CREGS, > thus leaving more BOTTOM_REGS for those that want them. > I think I may have made a breakthrough! As mentioned above, IRA is correctly increasing the cost for TOP_REGS when an allocno in region 1 is being used in one of our special instructions that needs BOTTOM_REGS. We can also see that IRA is passing that allocno cost up to the associated allocno in region 0 and altering the total_cost for that allocno. However, it does not appear as though IRA is doing anything with the total_costs, apart from using it to determine the preferred class for the allocno. When the main logic of the algorithms comes into play, we only look at allocno_costs, which do not take into account the future usage of that register in the allocno in region 1. I believe that if the decisions were made based on total_costs then I would get a better allocation with existing IRA. Before I experiment with this, I was wondering what the motivation is for only looking at allocno_costs? BTW, I did look into using the ! and * constraints, but I don't think they could help. Our architecture is like Motorola 68k in that we have some instructions that can use all the registers and some that can only use the subset (BOTTOM_REGS). The latter type can only specify "b" for their constraint, since they can only go in BOTTOM_REGS. The former type might benefit from "b,!t", so we are more likely to get things in BOTTOM_REGS for when they are later needed there, but the flip-side is that we might also fill up BOTTOM_REGS when actually there was no subsequent need for the value to be in BOTTOM_REGS. I may have misunderstood how all this works, but it looks like constraints will not help here and, in fact, the total_costs calculations that I mention above are precisely the way IRA passes information about register requirements upwards. Best regards, Ian |
|
|
RE: Understanding IRAOn Thu, 2009-11-05 at 18:05 +0000, Ian Bolton wrote:
> I think I may have made a breakthrough! > > As mentioned above, IRA is correctly increasing the cost for TOP_REGS > when an allocno in region 1 is being used in one of our special > instructions that needs BOTTOM_REGS. We can also see that IRA is > passing that allocno cost up to the associated allocno in region 0 and > altering the total_cost for that allocno. > > However, it does not appear as though IRA is doing anything with the > total_costs, apart from using it to determine the preferred class for > the allocno. When the main logic of the algorithms comes into play, > we only look at allocno_costs, which do not take into account the > future usage of that register in the allocno in region 1. > > I believe that if the decisions were made based on total_costs then I > would get a better allocation with existing IRA. Before I experiment > with this, I was wondering what the motivation is for only looking > at allocno_costs? > > BTW, I did look into using the ! and * constraints, but I don't think > they could help. Our architecture is like Motorola 68k in that we have > some instructions that can use all the registers and some that can > only use the subset (BOTTOM_REGS). The latter type can only specify > "b" for their constraint, since they can only go in BOTTOM_REGS. The > former type might benefit from "b,!t", so we are more likely to get > things in BOTTOM_REGS for when they are later needed there, but the > flip-side is that we might also fill up BOTTOM_REGS when actually there > was no subsequent need for the value to be in BOTTOM_REGS. I may have > misunderstood how all this works, but it looks like constraints will > not help here and, in fact, the total_costs calculations that I > mention above are precisely the way IRA passes information about > register requirements upwards. I've been working on gcc for an ISA (Ubicom32) that seems to have some similarities to the problem you're seeing (we have some regs that can be used for many things but not all) and was seeing a ton a pointless moves introduced by reload. I've ended up trying quite a few different strategies including defining different cover classes and the strategy we're now using is to leave the cover classes the same way you have them but found that the most critical thing was to ensure that REGNO_REG_CLASS was returning a minimal class correctly. Once I had this right then IRA started to do the right thing most of the time (-Os still isn't great but -O2 usually looks very good now). We still see some problems but they're usually a result of reload messing things up or suboptimal handling of auto-incrementing in MEMs. I haven't found using "*" and "!" helped much except where we had a couple of different ways of encoding the same behaviour and where we wanted to avoid using one unless the register allocations worked out right. Cheers, Dave |
|
|
RE: Understanding IRADave Hudson wrote:
> On Thu, 2009-11-05 at 18:05 +0000, Ian Bolton wrote: > > I think I may have made a breakthrough! > > > > As mentioned above, IRA is correctly increasing the cost for TOP_REGS > > when an allocno in region 1 is being used in one of our special > > instructions that needs BOTTOM_REGS. We can also see that IRA is > > passing that allocno cost up to the associated allocno in region 0 > and > > altering the total_cost for that allocno. > > > > However, it does not appear as though IRA is doing anything with the > > total_costs, apart from using it to determine the preferred class for > > the allocno. When the main logic of the algorithms comes into play, > > we only look at allocno_costs, which do not take into account the > > future usage of that register in the allocno in region 1. > > > > I believe that if the decisions were made based on total_costs then I > > would get a better allocation with existing IRA. Before I experiment > > with this, I was wondering what the motivation is for only looking > > at allocno_costs? > > > > BTW, I did look into using the ! and * constraints, but I don't think > > they could help. Our architecture is like Motorola 68k in that we > have > > some instructions that can use all the registers and some that can > > only use the subset (BOTTOM_REGS). The latter type can only specify > > "b" for their constraint, since they can only go in BOTTOM_REGS. The > > former type might benefit from "b,!t", so we are more likely to get > > things in BOTTOM_REGS for when they are later needed there, but the > > flip-side is that we might also fill up BOTTOM_REGS when actually > there > > was no subsequent need for the value to be in BOTTOM_REGS. I may > have > > misunderstood how all this works, but it looks like constraints will > > not help here and, in fact, the total_costs calculations that I > > mention above are precisely the way IRA passes information about > > register requirements upwards. > > I've been working on gcc for an ISA (Ubicom32) that seems to have some > similarities to the problem you're seeing (we have some regs that can > be > used for many things but not all) and was seeing a ton a pointless > moves > introduced by reload. > > I've ended up trying quite a few different strategies including > defining > different cover classes and the strategy we're now using is to leave > the > cover classes the same way you have them but found that the most > critical thing was to ensure that REGNO_REG_CLASS was returning a > minimal class correctly. Once I had this right then IRA started to do > the right thing most of the time (-Os still isn't great but -O2 usually > looks very good now). We still see some problems but they're usually a > result of reload messing things up or suboptimal handling of > auto-incrementing in MEMs. Thanks for the info, Dave. Given that C_REGS = r0-r31, BOTTOM_REGS = r0-r15, TOP_CREGS = r16-r31, when you say "minimal class", does that mean that I should change my macro from this: /* Map a register number to a register class. */ #define REGNO_REG_CLASS(REGNO) \ (BOTTOM_C_REG_P (REGNO) ? BOTTOM_REGS : \ (REGNO) == LINK_POINTER_REGNUM ? LINK_REGS : \ C_REG_P (REGNO) ? C_REGS : \ D_REG_P (REGNO) ? D_REGS : \ CSR_REG_P (REGNO) ? CSR_REGS : \ (unsigned)(REGNO) < FIRST_PSEUDO_REGISTER ? INTERNAL_REGS : ALL_REGS) to this (see C_REG_P line for change): /* Map a register number to a register class. */ #define REGNO_REG_CLASS(REGNO) \ (BOTTOM_C_REG_P (REGNO) ? BOTTOM_REGS : \ (REGNO) == LINK_POINTER_REGNUM ? LINK_REGS : \ C_REG_P (REGNO) ? TOP_CREGS : \ D_REG_P (REGNO) ? D_REGS : \ CSR_REG_P (REGNO) ? CSR_REGS : \ (unsigned)(REGNO) < FIRST_PSEUDO_REGISTER ? INTERNAL_REGS : ALL_REGS) I made the change and nothing noticeable happened, but maybe once I fix whatever else needs fixing then the second version of the macro will be better? |
|
|
RE: Understanding IRAOn Mon, 2009-11-09 at 14:13 +0000, Ian Bolton wrote:
> Dave Hudson wrote: > > I've been working on gcc for an ISA (Ubicom32) that seems to have some > > similarities to the problem you're seeing (we have some regs that can > > be > > used for many things but not all) and was seeing a ton a pointless > > moves > > introduced by reload. > > > > I've ended up trying quite a few different strategies including > > defining > > different cover classes and the strategy we're now using is to leave > > the > > cover classes the same way you have them but found that the most > > critical thing was to ensure that REGNO_REG_CLASS was returning a > > minimal class correctly. Once I had this right then IRA started to do > > the right thing most of the time (-Os still isn't great but -O2 usually > > looks very good now). We still see some problems but they're usually a > > result of reload messing things up or suboptimal handling of > > auto-incrementing in MEMs. > > Thanks for the info, Dave. > > Given that C_REGS = r0-r31, BOTTOM_REGS = r0-r15, TOP_CREGS = r16-r31, > when you say "minimal class", does that mean that I should change my > macro from this: > > /* Map a register number to a register class. */ > #define REGNO_REG_CLASS(REGNO) \ > (BOTTOM_C_REG_P (REGNO) ? BOTTOM_REGS : \ > (REGNO) == LINK_POINTER_REGNUM ? LINK_REGS : \ > C_REG_P (REGNO) ? C_REGS : \ > D_REG_P (REGNO) ? D_REGS : \ > CSR_REG_P (REGNO) ? CSR_REGS : \ > (unsigned)(REGNO) < FIRST_PSEUDO_REGISTER ? INTERNAL_REGS : ALL_REGS) > > to this (see C_REG_P line for change): > > /* Map a register number to a register class. */ > #define REGNO_REG_CLASS(REGNO) \ > (BOTTOM_C_REG_P (REGNO) ? BOTTOM_REGS : \ > (REGNO) == LINK_POINTER_REGNUM ? LINK_REGS : \ > C_REG_P (REGNO) ? TOP_CREGS : \ > D_REG_P (REGNO) ? D_REGS : \ > CSR_REG_P (REGNO) ? CSR_REGS : \ > (unsigned)(REGNO) < FIRST_PSEUDO_REGISTER ? INTERNAL_REGS : ALL_REGS) > > I made the change and nothing noticeable happened, but maybe once I fix > whatever else needs fixing then the second version of the macro will be > better? I think this looks more like what I had to do. FWIW I found it easier to use an array in our port rather than using a cascading conditional sequence (I can't remember which port I got that idea from): #define REG_CLASS_CONTENTS \ { \ {0x00000000, 0x00000000}, /* No regs */ \ {0x0000ffff, 0x00000000}, /* DATA_REGS */ \ {0x00010000, 0x00000000}, /* FDPIC_REG */ \ {0x20fe0000, 0x00000000}, /* ADDRESS_REGS */ \ {0x20ff0000, 0x00000000}, /* ALL_ADDRESS_REGS */ \ {0x0a000000, 0x00000000}, /* ACC_LO_REGS */ \ {0x0f000000, 0x00000000}, /* ACC_REGS */ \ {0x40000000, 0x00000000}, /* CC_REG */ \ {0x0f00ffff, 0x00000000}, /* DATA_ACC_REGS */ \ {0x10000000, 0x00000000}, /* SOURGE3_REG */ \ {0x80000000, 0x0000007f}, /* SPECIAL_REGS */ \ {0xbfffffff, 0x0000007f}, /* GENERAL_REGS */ \ {0xbfffffff, 0x0000007f} /* ALL_REGS */ \ } extern enum reg_class const ubicom32_regclass_map[FIRST_PSEUDO_REGISTER]; /* A C expression whose value is a register class containing hard register REGNO. In general there is more than one such class; choose a class which is "minimal", meaning that no smaller class also contains the register. */ #define REGNO_REG_CLASS(REGNO) (ubicom32_regclass_map[REGNO]) and: enum reg_class const ubicom32_regclass_map[FIRST_PSEUDO_REGISTER] = { DATA_REGS, DATA_REGS, DATA_REGS, DATA_REGS, DATA_REGS, DATA_REGS, DATA_REGS, DATA_REGS, DATA_REGS, DATA_REGS, DATA_REGS, DATA_REGS, DATA_REGS, DATA_REGS, DATA_REGS, DATA_REGS, FDPIC_REG, ADDRESS_REGS, ADDRESS_REGS, ADDRESS_REGS, ADDRESS_REGS, ADDRESS_REGS, ADDRESS_REGS, ADDRESS_REGS, ACC_REGS, ACC_LO_REGS, ACC_REGS, ACC_LO_REGS, SOURCE3_REG, ADDRESS_REGS, NO_REGS, /* CC_REG must be NO_REGS */ SPECIAL_REGS, SPECIAL_REGS, SPECIAL_REGS, SPECIAL_REGS, SPECIAL_REGS, SPECIAL_REGS, SPECIAL_REGS, SPECIAL_REGS }; |
|
|
Re: Understanding IRAOn 11/06/09 05:53, Dave Hudson wrote:
> the most > critical thing was to ensure that REGNO_REG_CLASS was returning a > minimal class correctly. I believe that's been documented as the right thing to do for about 15 years :-) So, yes, you definitely want REGNO_REG_CLASS to return the smallest class to which a register belongs. Jeff |
|
|
RE: Understanding IRA> On 11/06/09 05:53, Dave Hudson wrote:
> > the most > > critical thing was to ensure that REGNO_REG_CLASS was returning a > > minimal class correctly. > I believe that's been documented as the right thing to do for about 15 > years :-) So, yes, you definitely want REGNO_REG_CLASS to return the > smallest class to which a register belongs. > > Jeff So I should definitely change my function then! If something isn't BOTTOM_REGS and it is in C_REGS then the smallest class it's in is actually TOP_CREGS. BTW, today I have achieved some good results with existing IRA by doing the following: 1) Changed the REG_ALLOC_ORDER so that TOP_CREGS are given out before BOTTOM_REGS. My previous hacked version worked by increasing the priority of those that wanted BOTTOM_REGS, so they got first pick; this new version makes them wait their turn, but ensures those with higher priority take TOP_CREGS before BOTTOM_REGS. The analogy, I think, is of giving out meals on an airplane. Most people will eat meat or vegetarian meals, whereas vegetarians only want vegetarian meals. My hacked version was effectively allowing the vegetarians to push to the front of the queue and get their meals; this new version works by encouraging the meat eaters to eat the meaty meals first, leaving more suitable meals for the vegetarians further down the plane. 2) I have forced propagation of allocnos to parent regions with the following hack in find_allocno_class_costs(): { /* Propagate costs to upper levels in the region tree. */ parent_a_num = ALLOCNO_NUM (parent_a); for (k = 0; k < cost_classes_num; k++) COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k] += COSTS_OF_ALLOCNO (total_costs, a_num)->cost[k]; COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost += COSTS_OF_ALLOCNO (total_costs, a_num)->mem_cost; /* BEGIN IGB-IRA CHANGE 2 */ /* Force total_costs to propagate upwards, by setting allocno_costs to be total_costs */ for (k = 0; k < cost_classes_num; k++) COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->cost[k] = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->cost[k]; COSTS_OF_ALLOCNO (allocno_costs, parent_a_num)->mem_cost = COSTS_OF_ALLOCNO (total_costs, parent_a_num)->mem_cost; /* END IGB-IRA CHANGE 2 */ } I don't know why propagation isn't happening normally, but this total_costs hack achieves the same thing for me at the moment. Is there any information I could provide to help someone tell me why propagation isn't happening? |
| < Prev | 1 - 2 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |