|
View:
New views
8 Messages
—
Rating Filter:
Alert me
|
|
|
Problems with replication on two servers.Hello,
I'm trying to get coda working with one replication server. on the scm: coda-server-6.9.4-1.i386 on the replica: coda-server-6.9.4-0.3.rc2.i386 I'm installing both servers with vice-setup script. All goes well. At the end, of the installation, I gather the /vice/db/vicetab and /vice/db/servers, and restart all servers. The root volume is created at the end of the scm installation so I guess it's not replicated on the replica. Then I create a volume like this : root@scm# createvol_rep test scm.myrealm.yeh/vicepa replica.myrealm.yeh/vicepa On my client I've edited /etc/coda/realms: myrealm.yeh scm.myrealm.yeh replica.myrealm.yeh Then I've executed: root@client# veuns-setup myrealm.yeh 20000 root@client# clog -coda admincoda root@client# cfs mkmount /coda/myrealm.yeh/test test root@client# ls /coda/myrealm.yeh/test Until this step it's okay. I can create files in volume test. It becomes complicated when on the scm, I block all traffic using iptables. I see the client starting sending messages to the replica(via tcpdump). But when I unblock the traffic on the scm I always get the same error. On the scm: 18:25:10 GetVolObj: Volume (1000002) already write locked 18:25:10 RS_LockAndFetch: Error 11 during GetVolObj for 1000002.1.1 18:25:46 LockQueue Manager: found entry for volume 0x1000002 On the replica: 18:34:36 Going to spool log entry for phase3 18:34:38 CheckRetCodes: server 132.227.168.169 returned error 11 18:34:38 ViceResolve: Couldnt lock volume 7f000001 at all accessible servers 18:34:38 Entering RecovDirResolve 7f000001.1.1 18:34:38 ComputeCompOps: fid(0x7f000001.1.1) 18:34:38 RS_ShipLogs - returning 0 On the client I got a dangling symlink for volume test. My question is: Isn't coda fail tolerant? Or do I miss something in my installation/configuration ? Thanks for your great work, and your help. Marc. -- |
|
|
Re: Problems with replication on two servers.Hi Marc,
On Thu, Apr 23, 2009 at 11:42:22AM +0200, Marc SCHLINGER wrote: > It becomes complicated when on the scm, I block all traffic using iptables. > I see the client starting sending messages to the replica(via tcpdump). > But when I unblock the traffic on the scm I always get the same error. > On the scm: > 18:25:10 GetVolObj: Volume (1000002) already write locked > 18:25:10 RS_LockAndFetch: Error 11 during GetVolObj for 1000002.1.1 > 18:25:46 LockQueue Manager: found entry for volume 0x1000002 There are certainly some locking issues hiding there. I have been hit by "Volume (XXXXXXX) already write locked" as well. This problem stems quite certainly from one of the original assumptions of Coda design - the servers are treated as well-connected to each other, in contrast to the clients which may have unreliable connections. > On the client I got a dangling symlink for volume test. > > My question is: Isn't coda fail tolerant? Or do I miss something in my > installation/configuration ? No, I don't think you do. Coda is quite fault tolerant, it copes pretty well with - clients losing connection to the net - a server going down once in a while It does not cope well with servers intermittently losing contact with each other. I guess this would be relatively hard to fix, given the original assumption named above. AFAIK there are no current plans to. It is nice that you are consequently testing Coda, this might certainly help to discover some hiding bugs and possibly even convince the developers about the server-side fault tolerance. There are certainly many potential users which would appreciate weakly connected servers being supported, but this may present some fundamental problems besides the implementation ones. On the other side Coda is very useful as it is and there are also issues of more immediate interest to fix. The developers' resources are limited, so your best bet would be to join the development. Unfortunately the "entry threshold" is quite high because of the code being complex and still reflecting the years of research-oriented programming. Regards, Rune |
|
|
Re: Problems with replication on two servers.On Thu, Apr 23, 2009 at 11:42:22AM +0200, Marc SCHLINGER wrote:
> The root volume is created at the end of the scm installation so I guess > it's not replicated on the replica. Right, that is done to simplify the common case of a new user setting up just one server. You would have to remove the root volume and create a new replicated volume to replace it with, and then probably reinitialize the clients so that they actually forget about the old single replica root. > root@client# cfs mkmount /coda/myrealm.yeh/test test > root@client# ls /coda/myrealm.yeh/test > > Until this step it's okay. I can create files in volume test. That is a good start. > It becomes complicated when on the scm, I block all traffic using iptables. > I see the client starting sending messages to the replica(via tcpdump). > But when I unblock the traffic on the scm I always get the same error. > On the scm: > 18:25:10 GetVolObj: Volume (1000002) already write locked > 18:25:10 RS_LockAndFetch: Error 11 during GetVolObj for 1000002.1.1 > 18:25:46 LockQueue Manager: found entry for volume 0x1000002 The volume xxx already write locked sounds very ominous, but it is really just a debugging message added to help debug Rune's issues. It happens whenever we get a read operation for a volume that is write locked, at this point we used to start waiting for the write to complete which uses up a server thread. However Rune was describing some sort of a deadlock issue, so instead of silently sleeping we now loudly complain and return an error and leave it up to the client to retry the operation. He is running a non-replicated server, so his testing never hit the resolution case, and either way it doesn't seem to have solved his issues, so I'll probably revert this change. Especially as now there is no queueing on these locks so readers are in some cases not able to obtain the lock. > On the replica: > 18:34:36 Going to spool log entry for phase3 > 18:34:38 CheckRetCodes: server 132.227.168.169 returned error 11 > 18:34:38 ViceResolve: Couldnt lock volume 7f000001 at all accessible servers Ok, so it fails to lock on the SCM continues resolving with only the remaining servers (replica), which of course doesn't really help much. Looks like resolution doesn't really like to get bounced back because the lock happened to be taken. > On the client I got a dangling symlink for volume test. Right, resolution failed to get all replicas in sync, so the client is still seeing different copies on different sites and shows the dangling symlink to indicate that the user should 'repair' the problem. In this case repair would probably be something like, $ repair repair> beginrepair /coda/myrealm.yeh repair> comparedirs /tmp/fix repair> dorepair repair> end Of course it would have been nice if resolution had succeeded. Jan |
|
|
Re: Problems with replication on two servers.You wrote :
This problem stems quite certainly from one of the original assumptions of Coda design - the servers are treated as well-connected to each other, in contrast to the clients which may have unreliable connections. So, after other test, I ask you if coda is designed for mobility I mean: Let's take the assumption, we work in multi-site environnement. For different reason we don't want any stream from our users to go by our vpn, (dedicated inter-site line 99% of avaibility). So each site handles his coda server (one handle the scm). Some users because they move a lot between each site use a laptop and have their volume replicated. Once their are in a site they can only join the server of this site. I didn't manage to have a client working., this way. Can coda handle this scheme? Are the volumes the only things that can move around? Or can a client move from one server to another? You wrote: The developers' resources are limited, so your best bet would be to join the development. Unfortunately the "entry threshold" is quite high because of the code being complex and still reflecting the years of research-oriented programming. It would be an honor for me to join the dev-team but I don't think I meet the quality requierment. Thanks a lot. Marc -- |
|
|
Re: Problems with replication on two servers.Hello Jan,
On Thu, Apr 23, 2009 at 04:52:48PM -0400, Jan Harkes wrote: > > 18:25:10 GetVolObj: Volume (1000002) already write locked > > 18:25:10 RS_LockAndFetch: Error 11 during GetVolObj for 1000002.1.1 > > 18:25:46 LockQueue Manager: found entry for volume 0x1000002 > > The volume xxx already write locked sounds very ominous, but it is > really just a debugging message added to help debug Rune's issues. > He is running a non-replicated server, so his testing never hit the > resolution case, and either way it doesn't seem to have solved his > issues, so I'll probably revert this change. Especially as now there is > no queueing on these locks so readers are in some cases not able to > obtain the lock. Trying to minimize confusion: I get this message in a replicated scenario quite regularly "forever", the only way to get rid of this situation is to restart the server with the runaway lock. It is not the same as the harmless messages on the single server. I did not complain loudly as I see this on servers where one of them has a slow connection which potentially can be flooded (say by sftp's unflexible resend policy) and become unreliable. You said it is an unsupported configuration :) My observed error on the clients is though "Resource temporarily unavailable", not a dangling link. I have a realm with a volume in this state right now. It is not going to recover on its own, nor could I use repair. [So this has nothing to do with our non-replicated server, which apparently does not deadlock - fully conforming to your expectations, but it still has the "unexpected delays" issue which may look as a deadlock.] Regards, Rune |
|
|
Re: Problems with replication on two servers.Hello Marc,
On Thu, Apr 23, 2009 at 10:59:51PM +0200, Marc SCHLINGER wrote: > Let's take the assumption, we work in multi-site environnement. > For different reason we don't want any stream from our users to go by our > vpn, (dedicated inter-site line 99% of avaibility). > So each site handles his coda server (one handle the scm). This is apparently unsupported (it may work, we are using geographically distant servers in a couple of realms, but the network between the servers may not be flaky no matter what) > Some users because they move a lot between each site use a laptop and have > their volume replicated. > > Once their are in a site they can only join the server of this site. You are thinking in an "inappropriate" way. Clients interact with _realms_, and internally pick servers to talk to. There is no user interface to those internals and it is not supposed to be present. > Can coda handle this scheme? Are the volumes the only things that can move > around? Or can > a client move from one server to another? A client will talk to some server(s) of the realm depending on what it thinks of network bandwidth and of the servers' availability. That's it. It will move "from server to server" when it feels for doing that, it is totally Coda's internal business. This is not as bad as it may seem, most often the clients pick the "nearest" server, but you do not have any guarantee that they do so each time, they don't. > Thanks a lot. Hope this helps to see what one may expect and what one shouldn't expect from Coda. Best regards, Rune |
|
|
Re: Problems with replication on two servers.Hello,
u+codalist-wk5r@... a écrit : > Hello Marc, > > On Thu, Apr 23, 2009 at 10:59:51PM +0200, Marc SCHLINGER wrote: > >> Let's take the assumption, we work in multi-site environnement. >> For different reason we don't want any stream from our users to go by our >> vpn, (dedicated inter-site line 99% of avaibility). >> So each site handles his coda server (one handle the scm). >> > > This is apparently unsupported (it may work, > we are using geographically distant servers in a couple of realms, > but the network between the servers may not be flaky no matter what) > > >> Some users because they move a lot between each site use a laptop and have >> their volume replicated. >> >> Once their are in a site they can only join the server of this site. >> > > You are thinking in an "inappropriate" way. Clients interact with _realms_, > and internally pick servers to talk to. There is no user interface > to those internals and it is not supposed to be present. > > >> Can coda handle this scheme? Are the volumes the only things that can move >> around? Or can >> a client move from one server to another? >> > > A client will talk to some server(s) of the realm depending on what it thinks > of network bandwidth and of the servers' availability. That's it. > It will move "from server to server" when it feels for doing that, > it is totally Coda's internal business. > > This is not as bad as it may seem, most often the clients pick the "nearest" > server, but you do not have any guarantee that they do so each time, > they don't. > > >> Thanks a lot. >> > > Hope this helps to see what one may expect and what one shouldn't expect > from Coda. > I've done a couple of test, and I 've found out different information in the documentation, but all seems to collide. Given that we have two servers, Can a client modify a volume, send the information to one server he can join, and have the server keeps the differents replicas of this volume up-to-date, whithout the client connecting to or warning the 2nd server? Sorry, if I'm insistent but I not fully at ease with english. Thanks. Marc |
|
|
Re: Problems with replication on two servers.Hi Marc,
On Wed, Jun 24, 2009 at 03:13:57PM +0200, Marc SCHLINGER wrote: > Given that we have two servers, > Can a client modify a volume, send the information to one server he can > join, and have the server keeps the differents replicas of this volume > up-to-date, whithout the client connecting to or warning the 2nd server? No, not without _some_ client having contact to _both_ servers and accessing the modified file. Then the client notices that the two copies of the file are not congruent (there are so called version vectors which make it possible to detect such a situation) and tells the servers to synchronise the file, which they do. Then and only then both file replicas become up-to-date. Hope this helps. Regards, Rune |
| Free embeddable forum powered by Nabble | Forum Help |