|
View:
New views
10 Messages
—
Rating Filter:
Alert me
|
|
|
What should happen when a distributed agent dies?I've been working with Distributed Builds lately, and I've found that
it works if everything is perfect, but if something goes wrong it has a hard time coping with the problem, and it doesn't recover. For example, it's a given that at some point, an agent is going to die without being properly removed first. Currently if this happens, the Queues page breaks (error/stack trace) and you can't edit or delete the offending agent to disable or get rid of it. The agent is also still shown as 'enabled' on the Distributed Agents page even though it's not responding. What should happen in this case? I'm all for having the system automatically disable any agent that is not behaving properly. At first, the admin may have to manually re-enable it. In the future we might come up with a way for it to auto-recover. Thoughts? -- Wendy |
|
|
Re: What should happen when a distributed agent dies?On Tue, Sep 29, 2009 at 1:03 AM, Wendy Smoak <wsmoak@...> wrote:
> I've been working with Distributed Builds lately, and I've found that > it works if everything is perfect, but if something goes wrong it has > a hard time coping with the problem, and it doesn't recover. > > For example, it's a given that at some point, an agent is going to die > without being properly removed first. > > Currently if this happens, the Queues page breaks (error/stack trace) > and you can't edit or delete the offending agent to disable or get rid > of it. > > The agent is also still shown as 'enabled' on the Distributed Agents > page even though it's not responding. > > What should happen in this case? > > I'm all for having the system automatically disable any agent that is > not behaving properly. At first, the admin may have to manually > re-enable it. In the future we might come up with a way for it to > auto-recover. > > Thoughts? > > -- > Wendy > +1 |
|
|
Re: What should happen when a distributed agent dies?On Mon, Sep 28, 2009 at 10:03 AM, Wendy Smoak <wsmoak@...> wrote:
> I'm all for having the system automatically disable any agent that is > not behaving properly. At first, the admin may have to manually > re-enable it. In the future we might come up with a way for it to > auto-recover. Unfortunately, disabling the agent doesn't seem to help. The second time I tried this, I _was_ able to edit the agent to disable it, but the Queues page was still broken. Same after I deleted the offending agent, Continuum still tried to contact it. In possibly related news, I notice that even though the agent no longer shows on the 'Build Agents' page, it is still in the continuum.xml config file, at least until you re-start. Interestingly, if you add a new agent that does not respond, it will get added as disabled, and this does _not_ break the Queues page. Where is the "real" data about what agents there are and which ones are enabled? Anecdotally, it seems like it's working from continuum.xml and not what I see in the web UI. -- Wendy |
|
|
Re: What should happen when a distributed agent dies?On Wed, Sep 30, 2009 at 5:19 AM, Wendy Smoak <wsmoak@...> wrote:
> On Mon, Sep 28, 2009 at 10:03 AM, Wendy Smoak <wsmoak@...> wrote: > > I'm all for having the system automatically disable any agent that is > > not behaving properly. At first, the admin may have to manually > > re-enable it. In the future we might come up with a way for it to > > auto-recover. > > Unfortunately, disabling the agent doesn't seem to help. The second > time I tried this, I _was_ able to edit the agent to disable it, but > the Queues page was still broken. Same after I deleted the offending > agent, Continuum still tried to contact it. > > In possibly related news, I notice that even though the agent no > longer shows on the 'Build Agents' page, it is still in the > continuum.xml config file, at least until you re-start. > > Interestingly, if you add a new agent that does not respond, it will > get added as disabled, and this does _not_ break the Queues page. > > Where is the "real" data about what agents there are and which ones > are enabled? > > Anecdotally, it seems like it's working from continuum.xml and not > what I see in the web UI. > > > Wendy > Continuum reads the continuum.xml for the build agents. The way it should work is when you update or remove a build agent, it should be reflected immediately in the continuum.xml. So all "real" data about the agents should be in the continuum.xml. Apparently, this is not the case and should be filed as a bug. Thanks, -- Marica |
|
|
Re: What should happen when a distributed agent dies?I created a JIRA [CONTINUUM-2377] for build agents not in continuum.xml.
On Wed, Sep 30, 2009 at 7:37 AM, Marica Tan <marica.tan@...> wrote: > > > On Wed, Sep 30, 2009 at 5:19 AM, Wendy Smoak <wsmoak@...> wrote: > >> On Mon, Sep 28, 2009 at 10:03 AM, Wendy Smoak <wsmoak@...> wrote: >> > I'm all for having the system automatically disable any agent that is >> > not behaving properly. At first, the admin may have to manually >> > re-enable it. In the future we might come up with a way for it to >> > auto-recover. >> >> Unfortunately, disabling the agent doesn't seem to help. The second >> time I tried this, I _was_ able to edit the agent to disable it, but >> the Queues page was still broken. Same after I deleted the offending >> agent, Continuum still tried to contact it. >> >> In possibly related news, I notice that even though the agent no >> longer shows on the 'Build Agents' page, it is still in the >> continuum.xml config file, at least until you re-start. >> >> Interestingly, if you add a new agent that does not respond, it will >> get added as disabled, and this does _not_ break the Queues page. >> >> Where is the "real" data about what agents there are and which ones >> are enabled? >> >> Anecdotally, it seems like it's working from continuum.xml and not >> what I see in the web UI. >> >> > -- >> Wendy >> > > Continuum reads the continuum.xml for the build agents. The way it should > work is when you update or remove a build agent, it should be reflected > immediately in the continuum.xml. So all "real" data about the agents should > be in the continuum.xml. Apparently, this is not the case and should be > filed as a bug. > > > Thanks, > -- > Marica > |
|
|
Re: What should happen when a distributed agent dies?So I went through the cluster of issues you opened and put them in
1.3.5, then realised that we had already kind of settled on the list of things for 1.3.5 :) Should we: 1) keep them where they are? 2) push them to 1.3.6? 3) push them to 1.4.0? 4) or are they already addressed by Marica's other fix? Cheers, Brett On 29/09/2009, at 3:03 AM, Wendy Smoak wrote: > I've been working with Distributed Builds lately, and I've found that > it works if everything is perfect, but if something goes wrong it has > a hard time coping with the problem, and it doesn't recover. > > For example, it's a given that at some point, an agent is going to die > without being properly removed first. > > Currently if this happens, the Queues page breaks (error/stack trace) > and you can't edit or delete the offending agent to disable or get rid > of it. > > The agent is also still shown as 'enabled' on the Distributed Agents > page even though it's not responding. > > What should happen in this case? > > I'm all for having the system automatically disable any agent that is > not behaving properly. At first, the admin may have to manually > re-enable it. In the future we might come up with a way for it to > auto-recover. > > Thoughts? > > -- > Wendy |
|
|
Re: What should happen when a distributed agent dies?On Wed, Sep 30, 2009 at 12:35 AM, Brett Porter <brett@...> wrote:
> So I went through the cluster of issues you opened and put them in 1.3.5, > then realised that we had already kind of settled on the list of things for > 1.3.5 :) > > Should we: > 1) keep them where they are? > 2) push them to 1.3.6? > 3) push them to 1.4.0? > 4) or are they already addressed by Marica's other fix? I don't think we should add anything else to 1.3.5 unless someone is specifically volunteering to do it. 1.3.x or 1.4.x depends on whether we consider them necessary to fix in order to finally call 1.3.x GA. What fix are you talking about in 4? -- Wendy |
|
|
Re: What should happen when a distributed agent dies?On 30/09/2009, at 9:55 PM, Wendy Smoak wrote: > On Wed, Sep 30, 2009 at 12:35 AM, Brett Porter <brett@...> > wrote: >> So I went through the cluster of issues you opened and put them in >> 1.3.5, >> then realised that we had already kind of settled on the list of >> things for >> 1.3.5 :) >> >> Should we: >> 1) keep them where they are? >> 2) push them to 1.3.6? >> 3) push them to 1.4.0? >> 4) or are they already addressed by Marica's other fix? > > I don't think we should add anything else to 1.3.5 unless someone is > specifically volunteering to do it. 1.3.x or 1.4.x depends on whether > we consider them necessary to fix in order to finally call 1.3.x GA. Yep, I agree... so do you think these fall into that category for a 1.3.x? > > What fix are you talking about in 4? Only skimmed it, but I got the impression that CONTINUUM-2377 might have helped some of the issues you raised? - Brett |
|
|
Re: What should happen when a distributed agent dies?On Wed, Sep 30, 2009 at 5:01 AM, Brett Porter <brett@...> wrote:
> Yep, I agree... so do you think these fall into that category for a 1.3.x? > > Only skimmed it, but I got the impression that CONTINUUM-2377 might have > helped some of the issues you raised? I don't think it addresses the case where a build agent dies unexpectedly. In that case it will still show as 'active' in the configuration and Continuum will try to contact it, which will break the Queues page when the agent is not reachable. These issues (especially with the 2377 fix, thanks Marica!) are not a blocker for GA for me, and IMO can wait until 1.4.x. With that fix, in most cases you should be able to edit the agents through the UI and get rid of the offending agent. If for some reason that fails, you can stop the master, edit the continuum.xml file, and re-start it. -- Wendy |
|
|
Re: What should happen when a distributed agent dies?On 30/09/2009, at 10:05 PM, Wendy Smoak wrote: > On Wed, Sep 30, 2009 at 5:01 AM, Brett Porter <brett@...> > wrote: > >> Yep, I agree... so do you think these fall into that category for a >> 1.3.x? >> >> Only skimmed it, but I got the impression that CONTINUUM-2377 might >> have >> helped some of the issues you raised? > > I don't think it addresses the case where a build agent dies > unexpectedly. In that case it will still show as 'active' in the > configuration and Continuum will try to contact it, which will break > the Queues page when the agent is not reachable. > > These issues (especially with the 2377 fix, thanks Marica!) are not a > blocker for GA for me, and IMO can wait until 1.4.x. Ok, I've pushed them into 1.4.0 which needs shrinking when we get that far too. I did leave a couple of new things in 1.3.5 that look like they are needed for GA. - Brett |
| Free embeddable forum powered by Nabble | Forum Help |