What should happen when a distributed agent dies?

View: New views
10 Messages — Rating Filter:   Alert me  

What should happen when a distributed agent dies?

by Wendy Smoak-3 :: Rate this Message:

| View Threaded | Show Only this Message

I've been working with Distributed Builds lately, and I've found that
it works if everything is perfect, but if something goes wrong it has
a hard time coping with the problem, and it doesn't recover.

For example, it's a given that at some point, an agent is going to die
without being properly removed first.

Currently if this happens, the Queues page breaks (error/stack trace)
and you can't edit or delete the offending agent to disable or get rid
of it.

The agent is also still shown as 'enabled' on the Distributed Agents
page even though it's not responding.

What should happen in this case?

I'm all for having the system automatically disable any agent that is
not behaving properly.  At first, the admin may have to manually
re-enable it.  In the future we might come up with a way for it to
auto-recover.

Thoughts?

--
Wendy

Re: What should happen when a distributed agent dies?

by Marica Tan-2 :: Rate this Message:

| View Threaded | Show Only this Message

On Tue, Sep 29, 2009 at 1:03 AM, Wendy Smoak <wsmoak@...> wrote:

> I've been working with Distributed Builds lately, and I've found that
> it works if everything is perfect, but if something goes wrong it has
> a hard time coping with the problem, and it doesn't recover.
>
> For example, it's a given that at some point, an agent is going to die
> without being properly removed first.
>
> Currently if this happens, the Queues page breaks (error/stack trace)
> and you can't edit or delete the offending agent to disable or get rid
> of it.
>
> The agent is also still shown as 'enabled' on the Distributed Agents
> page even though it's not responding.
>
> What should happen in this case?
>
> I'm all for having the system automatically disable any agent that is
> not behaving properly.  At first, the admin may have to manually
> re-enable it.  In the future we might come up with a way for it to
> auto-recover.
>
> Thoughts?
>
> --
> Wendy
>

+1

Re: What should happen when a distributed agent dies?

by Wendy Smoak-3 :: Rate this Message:

| View Threaded | Show Only this Message

On Mon, Sep 28, 2009 at 10:03 AM, Wendy Smoak <wsmoak@...> wrote:
> I'm all for having the system automatically disable any agent that is
> not behaving properly.  At first, the admin may have to manually
> re-enable it.  In the future we might come up with a way for it to
> auto-recover.

Unfortunately, disabling the agent doesn't seem to help.  The second
time I tried this, I _was_ able to edit the agent to disable it, but
the Queues page was still broken.  Same after I deleted the offending
agent, Continuum still tried to contact it.

In possibly related news, I notice that even though the agent no
longer shows on the 'Build Agents' page, it is still in the
continuum.xml config file, at least until you re-start.

Interestingly, if you add a new agent that does not respond, it will
get added as disabled, and this does _not_ break the Queues page.

Where is the "real" data about what agents there are and which ones
are enabled?

Anecdotally, it seems like it's working from continuum.xml and not
what I see in the web UI.

--
Wendy

Re: What should happen when a distributed agent dies?

by Marica Tan-2 :: Rate this Message:

| View Threaded | Show Only this Message

On Wed, Sep 30, 2009 at 5:19 AM, Wendy Smoak <wsmoak@...> wrote:

> On Mon, Sep 28, 2009 at 10:03 AM, Wendy Smoak <wsmoak@...> wrote:
> > I'm all for having the system automatically disable any agent that is
> > not behaving properly.  At first, the admin may have to manually
> > re-enable it.  In the future we might come up with a way for it to
> > auto-recover.
>
> Unfortunately, disabling the agent doesn't seem to help.  The second
> time I tried this, I _was_ able to edit the agent to disable it, but
> the Queues page was still broken.  Same after I deleted the offending
> agent, Continuum still tried to contact it.
>
> In possibly related news, I notice that even though the agent no
> longer shows on the 'Build Agents' page, it is still in the
> continuum.xml config file, at least until you re-start.
>
> Interestingly, if you add a new agent that does not respond, it will
> get added as disabled, and this does _not_ break the Queues page.
>
> Where is the "real" data about what agents there are and which ones
> are enabled?
>
> Anecdotally, it seems like it's working from continuum.xml and not
> what I see in the web UI.
>
>
--
> Wendy
>

Continuum reads the continuum.xml for the build agents. The way it should
work is when you update or remove a build agent, it should be reflected
immediately in the continuum.xml. So all "real" data about the agents should
be in the continuum.xml. Apparently, this is not the case and should be
filed as a bug.


Thanks,
--
Marica

Re: What should happen when a distributed agent dies?

by Marica Tan-2 :: Rate this Message:

| View Threaded | Show Only this Message

I created a JIRA [CONTINUUM-2377] for build agents not in continuum.xml.


On Wed, Sep 30, 2009 at 7:37 AM, Marica Tan <marica.tan@...> wrote:

>
>
> On Wed, Sep 30, 2009 at 5:19 AM, Wendy Smoak <wsmoak@...> wrote:
>
>> On Mon, Sep 28, 2009 at 10:03 AM, Wendy Smoak <wsmoak@...> wrote:
>> > I'm all for having the system automatically disable any agent that is
>> > not behaving properly.  At first, the admin may have to manually
>> > re-enable it.  In the future we might come up with a way for it to
>> > auto-recover.
>>
>> Unfortunately, disabling the agent doesn't seem to help.  The second
>> time I tried this, I _was_ able to edit the agent to disable it, but
>> the Queues page was still broken.  Same after I deleted the offending
>> agent, Continuum still tried to contact it.
>>
>> In possibly related news, I notice that even though the agent no
>> longer shows on the 'Build Agents' page, it is still in the
>> continuum.xml config file, at least until you re-start.
>>
>> Interestingly, if you add a new agent that does not respond, it will
>> get added as disabled, and this does _not_ break the Queues page.
>>
>> Where is the "real" data about what agents there are and which ones
>> are enabled?
>>
>> Anecdotally, it seems like it's working from continuum.xml and not
>> what I see in the web UI.
>>
>>
> --
>> Wendy
>>
>
> Continuum reads the continuum.xml for the build agents. The way it should
> work is when you update or remove a build agent, it should be reflected
> immediately in the continuum.xml. So all "real" data about the agents should
> be in the continuum.xml. Apparently, this is not the case and should be
> filed as a bug.
>
>
> Thanks,
> --
> Marica
>

Re: What should happen when a distributed agent dies?

by brettporter :: Rate this Message:

| View Threaded | Show Only this Message

So I went through the cluster of issues you opened and put them in  
1.3.5, then realised that we had already kind of settled on the list  
of things for 1.3.5 :)

Should we:
1) keep them where they are?
2) push them to 1.3.6?
3) push them to 1.4.0?
4) or are they already addressed by Marica's other fix?

Cheers,
Brett

On 29/09/2009, at 3:03 AM, Wendy Smoak wrote:

> I've been working with Distributed Builds lately, and I've found that
> it works if everything is perfect, but if something goes wrong it has
> a hard time coping with the problem, and it doesn't recover.
>
> For example, it's a given that at some point, an agent is going to die
> without being properly removed first.
>
> Currently if this happens, the Queues page breaks (error/stack trace)
> and you can't edit or delete the offending agent to disable or get rid
> of it.
>
> The agent is also still shown as 'enabled' on the Distributed Agents
> page even though it's not responding.
>
> What should happen in this case?
>
> I'm all for having the system automatically disable any agent that is
> not behaving properly.  At first, the admin may have to manually
> re-enable it.  In the future we might come up with a way for it to
> auto-recover.
>
> Thoughts?
>
> --
> Wendy


Re: What should happen when a distributed agent dies?

by Wendy Smoak-3 :: Rate this Message:

| View Threaded | Show Only this Message

On Wed, Sep 30, 2009 at 12:35 AM, Brett Porter <brett@...> wrote:
> So I went through the cluster of issues you opened and put them in 1.3.5,
> then realised that we had already kind of settled on the list of things for
> 1.3.5 :)
>
> Should we:
> 1) keep them where they are?
> 2) push them to 1.3.6?
> 3) push them to 1.4.0?
> 4) or are they already addressed by Marica's other fix?

I don't think we should add anything else to 1.3.5 unless someone is
specifically volunteering to do it.  1.3.x or 1.4.x depends on whether
we consider them necessary to fix in order to finally call 1.3.x GA.

What fix are you talking about in 4?

--
Wendy

Re: What should happen when a distributed agent dies?

by brettporter :: Rate this Message:

| View Threaded | Show Only this Message


On 30/09/2009, at 9:55 PM, Wendy Smoak wrote:

> On Wed, Sep 30, 2009 at 12:35 AM, Brett Porter <brett@...>  
> wrote:
>> So I went through the cluster of issues you opened and put them in  
>> 1.3.5,
>> then realised that we had already kind of settled on the list of  
>> things for
>> 1.3.5 :)
>>
>> Should we:
>> 1) keep them where they are?
>> 2) push them to 1.3.6?
>> 3) push them to 1.4.0?
>> 4) or are they already addressed by Marica's other fix?
>
> I don't think we should add anything else to 1.3.5 unless someone is
> specifically volunteering to do it.  1.3.x or 1.4.x depends on whether
> we consider them necessary to fix in order to finally call 1.3.x GA.

Yep, I agree... so do you think these fall into that category for a  
1.3.x?

>
> What fix are you talking about in 4?

Only skimmed it, but I got the impression that CONTINUUM-2377 might  
have helped some of the issues you raised?

- Brett

Re: What should happen when a distributed agent dies?

by Wendy Smoak-3 :: Rate this Message:

| View Threaded | Show Only this Message

On Wed, Sep 30, 2009 at 5:01 AM, Brett Porter <brett@...> wrote:

> Yep, I agree... so do you think these fall into that category for a 1.3.x?
>
> Only skimmed it, but I got the impression that CONTINUUM-2377 might have
> helped some of the issues you raised?

I don't think it addresses the case where a build agent dies
unexpectedly.  In that case it will still show as 'active' in the
configuration and Continuum will try to contact it, which will break
the Queues page when the agent is not reachable.

These issues (especially with the 2377 fix, thanks Marica!) are not a
blocker for GA for me, and IMO can wait until 1.4.x.

With that fix, in most cases you should be able to edit the agents
through the UI and get rid of the offending agent.  If for some reason
that fails,  you can stop the master, edit the continuum.xml file, and
re-start it.

--
Wendy

Re: What should happen when a distributed agent dies?

by brettporter :: Rate this Message:

| View Threaded | Show Only this Message


On 30/09/2009, at 10:05 PM, Wendy Smoak wrote:

> On Wed, Sep 30, 2009 at 5:01 AM, Brett Porter <brett@...>  
> wrote:
>
>> Yep, I agree... so do you think these fall into that category for a  
>> 1.3.x?
>>
>> Only skimmed it, but I got the impression that CONTINUUM-2377 might  
>> have
>> helped some of the issues you raised?
>
> I don't think it addresses the case where a build agent dies
> unexpectedly.  In that case it will still show as 'active' in the
> configuration and Continuum will try to contact it, which will break
> the Queues page when the agent is not reachable.
>
> These issues (especially with the 2377 fix, thanks Marica!) are not a
> blocker for GA for me, and IMO can wait until 1.4.x.

Ok, I've pushed them into 1.4.0 which needs shrinking when we get that  
far too.

I did leave a couple of new things in 1.3.5 that look like they are  
needed for GA.

- Brett