« Return to Thread: [RFC] Remove pendingLimit from OSyncQueue

Re: [RFC] Remove pendingLimit from OSyncQueue

by Graham Cobb-4 :: Rate this Message:

Reply to Author | View in Thread

On Tuesday 14 April 2009 13:39:43 Daniel Gollub wrote:
> On Tuesday 14 April 2009 02:25:29 pm Michael Bell wrote:
> > I don't think that a dependency between two pipes is a good idea but I
> > understand that IPC has limits. I also used IBM MQ series in the past
> > which has nothing to do with IPC. So is our message queue implementation
> > a real IPC implementation which requires limits?
>
> No Idea - maybe Graham kann answer this.

Sorry about the delay -- I have been away and have not had a chance to spend
any time on this until today.

I understand the problem, and the current timeout/limit mechanism definitely
deadlocks with the way the async plugins work today (I am ignoring any
changes suggested in this thread as I am not sure I understand exactly what
has been proposed).

The pendingLimit is there to allow timeouts to work properly.  If the
pendingLimit is just removed, the timeouts will break as they did before (the
timeout starts counting at the wrong time and so if there are a large number
of transactions queued up the timeout fires too early).  But let's review
what timeouts are for and how we would **like** them to behave.

As I understand it, the main purpose of the timeouts is to deal with cases
where the remote device (or some intermediary) has got stuck and is no longer
proceeding with transactions (but not returning errors).  It also helps with
cases where the plugin tries to send a message but does not notice that there
is an error (e.g. a socket has been disconnected) and it will never get a
reply.  This is, of course, a plugin bug but it is useful that the timeout
mechanism also protects against that problem.

There is a secondary use for timeouts and that is to protect against problems
in the IPC mechanism itself -- e.g. a process has stopped and is no longer
reading the pipe.  This is a smaller consideration and can be handled by
mechanisms within the IPC itself if necessary, so let's ignore it for now.

There seem to be three plugin architectures which are relevant (I thought,
when I was rewriting the timeout code, that there were only the first two but
I now realise there is a third):

1) Synchronous plugin (most plugins are like this, I believe): when a
transaction is received by the plugin (e.g. Connect or Get Changes or Commit)
it does synchronous writes to send messages to the device and synchronous
reads to get messages from the device.  If the device stops responding, the
thread executing the plugin will just wait.  No other plugin messages will be
handled while it is waiting.

2) Asynchronous but transaction-at-a-time plugin (maybe there are none like
this): when the transaction is received, the plugin sends the message to the
device and then returns.  The thread polls the socket and resumes when the
response is received. However, other plugin messages can be handled while it
is waiting -- so further updates will cause further messages to be sent to
the device.  If the device stops responding, the engine will keep sending
updates which the plugin will send. although it is not seeing any responses.

3) Aysnchronous, multiple transaction plugin (like SyncML): when the
transaction is received, it is stored internally to the plugin.  Nothing is
sent until a message fills up or the last transaction is received. Then all
updates are sent and, when responses are received, the updates are completed.  
If the device stops responding then all or some updates will not receive a
response.

For 1 and 2, the timeout is protecting each single commit and the value should
be set based on the time needed for that transaction.  In the case of 2, this
means the pendingLimit is needed -- it limits the number of updates that
might already be queued ahead of this one and so allows the timeout value to
be calculated (i.e. pendingLimit * maximum time for one update).

For 3, however, it is much harder.  One option would be to set the timeouts
for each commit (and the commit_all) based on the time the device needs to
complete a maximum sized message of updates.  On the other hand, that doesn't
allow for the fact that the OpenSync engine itself might take some time (in
complex cases) to even provide enough updates to fill a message: timeouts
were not intended to have to take into account OpenSync engine processing.

Another option for 3 is to set no timeouts at all on the individual commits,
but set a timeout on the commit_all.  The problem with that is that the
timeout value for the commit_all is potentially unlimited (it is not limited
to a single message of commits as the commit_all will start as soon as the
commits have all been queued and hundreds of messages may have been sent to
the device).

On the other hand, the plugin itself knows what is going on.  So, I think the
best option for case 3 is for the plugin itself to control the timeouts.  I
suggest that in this case, there are no timeouts on the commit or commit_all,
but that the plugin itself sets a timeout when it has assembled a message and
sent it to the device.  I.e. add some sort of OsyncStartTimeout(int timeout)
and OSyncStopTimeout() calls.  The plugin would start the timeout when the
first message was sent to the device.  Whenever a response is received, the
timeout would be stopped and, if there were one or more messages still
waiting for responses, it would be started again.  If the timeout actually
fires, then the OSyncQueue code completes all pending operations with a
timeout error (just as in the existing timeout processing).

This does mean that for plugins using this third architecture, they have to
have some extra complexity.  But I can't see an alternative, if we want to
keep the timeout protection.  Of course, we could decide that for this
release of OpenSync that we disable timeout processing altogether -- and add
it back in later (with additions to the API at that time).

Does anyone have an alternative suggestion?  If not, I will spec up a
suggested API for the timeout operations.

Graham


------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensign option that enables unlimited
royalty-free distribution of the report engine for externally facing
server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
Opensync-devel mailing list
Opensync-devel@...
https://lists.sourceforge.net/lists/listinfo/opensync-devel

 « Return to Thread: [RFC] Remove pendingLimit from OSyncQueue