Sun cluster mediator problem in a two node cluster
Recently we found a "bug" in Sun Cluster 3.2. If having a two node "metropolitan cluster", a site separation end up in a death of the SVM based device group. The situation is the following.
We have to nodes (V1,V2), and we use quorum server (qserver) to provide the necessary votes. We run an SVM device group (test), and a resource group (nfs-rg) having a HAStoragePlus resource (nfs-stor) maintaining a volume from the disk set test
V1:~# clrg status
=== Cluster Resource Groups ===
Group Name Node Name Suspended Status
---------- --------- --------- ------
nfs-rg V1 No Online
V2 No Offline
V1:~# clrs status
=== Cluster Resources ===
Resource Name Node Name State Status Message
------------- --------- ----- --------------
nfs-server V1 Online Online - LogicalHostname online.
V2 Offline Offline
nfs-stor V1 Online Online
V2 Offline Offline
nfs-res V1 Online Online - Service is online.
V2 Offline Offline
V1:~# cldg status === Cluster Device Groups === --- Device Group Status --- Device Group Name Primary Secondary Status ----------------- ------- --------- ------ test V1 V2 Online
V1:~# df -h Filesystem size used avail capacity Mounted on /dev/dsk/c1t0d0s0 31G 7.3G 23G 24% / ... /dev/did/dsk/d2s3 480M 3.6M 428M 1% /global/.devices/node@1 /dev/md/test/dsk/d0 188M 23M 146M 14% /test /dev/did/dsk/d11s3 480M 3.6M 428M 1% /global/.devices/node@2
V1:~# metaset -s test Set name = test, Set number = 1 Host Owner V1 Yes V2 Mediator Host(s) Aliases V1 V2 Driv Dbase d4 Yes d7 Yes
V1:~# clq status
=== Cluster Quorum ===
--- Quorum Votes Summary ---
Needed Present Possible
------ ------- --------
2 3 3
--- Quorum Votes by Node ---
Node Name Present Possible Status
--------- ------- -------- ------
V1 1 1 Online
V2 1 1 Online
--- Quorum Votes by Device ---
Device Name Present Possible Status
----------- ------- -------- ------
qs 1 1 Online
Now, simulating a "site failure", we shut down V1 and one storage box, so that the half of the metadb replicas, and also the half of the mediators are lost. According to a Sunsolve doc description, the mediators are intended to provide an in memory state database replicas in case of disk failures, but fail is also a mediator vote is lost.
3. If the replica quorum is not met, half of the replicas is accessible, the mediator quorum is not met, half of the mediator hosts is accessible, and the replica and mediator data match, the system prompts you to grant or deny access to the diskset.
- Replicas (diskset) == half
- Mediator hosts (diskset) == half
- Replicas (diskset) ~= Mediator hosts (diskset)
In our case it means the death of our data set
Feb 18 14:56:41 V2 Cluster.Framework: [ID 801593 daemon.error] stderr: metaset: V2: test: 50replicas & 50mediator hosts available, user intervention required Feb 18 14:56:41 V2 Cluster.Framework: [ID 801593 daemon.notice] stdout: Only 50% replicas and 50% mediator hosts available for diskset test Feb 18 14:56:41 V2 cl_runtime: [ID 371250 kern.warning] WARNING: Failed to set t his node as primary for service 'test'. Feb 18 14:56:41 V2 SC[,SUNW.HAStoragePlus:6,nfs-rg,nfs-stor,hastorageplus_prenet _start]: [ID 579208 daemon.error] Global service test associated with path /test is found to be in maintenance state.
As mediator votes did not help, what is adding more disk based votes? Let's try ISCSI!
qserver:~# mkdir /iscsi qserver:~# iscsitadm modify admin -d /iscsi qserver:~# iscsitadm create target --size 200m qs-disk
V1:~# iscsiadm add discovery-address qserver
V1:~# iscsiadm modify discovery --sendtargets enable
V1:~# iscsiadm list target
Target: iqn.1986-03.com.sun:02:06552d12-12f1-482b-d7ba-951dc13e0187.qs-disk
Alias: qs-disk
TPGT: 1
ISID: 4000002a0000
Connections: 1
V1:~# devfsadm -i iscsi
V1:~# cldev refresh -v
Successfully refreshed DID devices on node V1
V1:~# cldev list -v
DID Device Full Device Path
---------- ----------------
...
d13 V2:/dev/rdsk/c5t010000E081587B3900002A00499C08A1d0
d13 V1:/dev/rdsk/c5t010000E081587B3900002A00499C08A1d0
This procedure must be made on both nodes. Unfortunatelly, ISCSI devices cannot be used as shared ones (so cannot be used as quorum disk), but for holding and additional replica, it is prefect. So let's include that in the diskset.
V1:~# metaset -s test -a /dev/did/rdsk/d13 V1:~# metaset -s test Set name = test, Set number = 1 Host Owner V1 Yes V2 Mediator Host(s) Aliases V1 V2 Drive Dbase d4 Yes d7 Yes d13 Yes
Let's try a site failure again! A seen below, the disk set, and the appropriate volume needs maintenance (as of one of the mirror legs became unavailable), but the volume is still available.
V2:~# clrg status
=== Cluster Resource Groups ===
Group Name Node Name Suspended Status
---------- --------- --------- ------
nfs-rg V1 No Offline
V2 No Online
V2:~# clq status
=== Cluster Quorum ===
--- Quorum Votes Summary ---
Needed Present Possible
------ ------- --------
2 2 3
--- Quorum Votes by Node ---
Node Name Present Possible Status
--------- ------- -------- ------
V1 0 1 Offline
V2 1 1 Online
--- Quorum Votes by Device ---
Device Name Present Possible Status
----------- ------- -------- ------
qs 1 1 Online
V2:~# metastat -s test -c
test/d0 p 200MB test/d30
test/d30 m 68GB test/d10 (maint) test/d20
test/d10 s 68GB d4s0 (maint)
test/d20 s 68GB d7s0
Well, that smells like hack, but it is a solution, if you want to avoid buying a VxVM licence, or avoid upgrading to Sun Cluster 3.2u3. (In the latest release of Sun Cluster, it is allowed to add an additional server (such as our quorum server) as mediator to any disksets).
Feedback awaiting moderation
This post has 301 feedbacks awaiting moderation...