Sun Cluster 3.3 - Bypassing amnesia prevention
The problem might be known to all of you. If shut down my nodes one after the other, the last node leaving the cluster (host v1 in our case) should be the first booting up in order the provide CCR consistency.
However, it that node refuses to start up (because of some HW error), I am in trouble: all the other nodes stuck in a "attempting to join cluster", since the quorum key on the quorum device belongs to v1 (the last node).
So did we have to wait since the first one repaired? Not really, here is a solution.
So our starting point is the following. An operational cluster, 3 votes, everything is fine.
bash-3.00# clq status
=== Cluster Quorum ===
--- Quorum Votes Summary from (latest node reconfiguration) ---
Needed Present Possible
------ ------- --------
2 3 3
--- Quorum Votes by Node (current status) ---
Node Name Present Possible Status
--------- ------- -------- ------
v1 1 1 Online
v2 1 1 Online
--- Quorum Votes by Device (current status) ---
Device Name Present Possible Status
----------- ------- -------- ------
d3 1 1 Online
Then, let us stop v2 first, then v1. If trying to start up v2, it is going to wait forever.
NOTICE: CMM: Node v1 (nodeid = 1) with votecount = 1 added. NOTICE: CMM: Node v2 (nodeid = 2) with votecount = 1 added. WARNING: CMM: Open failed for quorum device /dev/did/rdsk/d3s2 with error 1. NOTICE: clcomm: Adapter bge3 constructed NOTICE: clcomm: Adapter bge2 constructed NOTICE: CMM: Node v2: attempting to join cluster. ... Jan 2 16:33:19 v2 cl_runtime: NOTICE: CMM: Cluster doesn't have operational quorum yet; waiting for quorum.
Let us check the reason of hanging.
root@v2# egrep -i 'nodes...name|key' /etc/cluster/ccr/global/infrastructure cluster.nodes.1.name v1 cluster.nodes.1.properties.quorum_resv_key 0x4DF9565B00000001 cluster.nodes.2.name v2 cluster.nodes.2.properties.quorum_resv_key 0x4DF9565B00000002
root@v2# egrep -i 'quorum_dev' /etc/cluster/ccr/global/infrastructure cluster.quorum_devices.1.name d3 cluster.quorum_devices.1.state enabled cluster.quorum_devices.1.properties.votecount 1 cluster.quorum_devices.1.properties.gdevname /dev/did/rdsk/d3s2 cluster.quorum_devices.1.properties.path_1 enabled cluster.quorum_devices.1.properties.path_2 enabled cluster.quorum_devices.1.properties.access_mode scsi2 cluster.quorum_devices.1.properties.type shared_disk
root@v2# /usr/cluster/lib/sc/pgre -c pgre_inkeys -d /dev/did/rdsk/d3s2 key[0]=0x4df9565b00000001.
OK, so v1's key is there, blocking our way to boot up in cluster mode.
I might be an obvious solution to wipe this key out with pgre -c pgre_scrub , but that would end up with the same results: if no keys in the quorum device, we should wait for the other node to have the operational quourum. So what to do then. Edit the CCR, of course B)
root@v2$ reboot -- -x
root@v2$ cd /etc/cluster/ccr/global/ root@v2$ vi infrastructure
First enable install mode
cluster.name test cluster.state enabled cluster.properties.cluster_id 0x4DF9565B cluster.properties.installmode enabled
Set the vote count of v1 to 0.
cluster.nodes.1.name v1 cluster.nodes.1.state enabled cluster.nodes.1.properties.private_hostname clusternode1-priv cluster.nodes.1.properties.quorum_vote 0
And finally, remove any reference to quorum device.
Do not forget to update the checksum in the file, otherwise the node end up saying "Corrupted CCR".
root@v2# /usr/cluster/lib/sc/ccradm recover -o infrastructure root@v2# reboot
After reboot, v2 comes up as a single cluster node. The quorum votes are like the following.
root@v2# clq status
=== Cluster Quorum ===
--- Quorum Votes Summary from (latest node reconfiguration) ---
Needed Present Possible
------ ------- --------
1 1 1
--- Quorum Votes by Node (current status) ---
Node Name Present Possible Status
--------- ------- -------- ------
v1 0 0 Offline
v2 1 1 Online
So are we there yet?
bash-3.00# clrs status
=== Cluster Resources ===
Resource Name Node Name State Status Message
------------- --------- ----- --------------
test-lh v1 Offline Offline
v2 Online Online - LogicalHostname online.
Our dummy resource is up, and running. That's fine. But can I configure an other one?
bash-3.00# clrslh create -g test-rg -h test-ip2 test-lh2 clrslh: v1 not a cluster member
It seems we fail to validate the existence of the hostname test-ip2 on the other side (v1).
Until v1 is repaired, we have to postpone any change in the configuration. So really?
Well, if repairing v1 seems to be a long run, we might choose to wipe the entire config of v1, so end up in a single node cluster.
bash-3.00# clnode clear v1 clnode: Node "v1" is still in use by resource group "test-rg".
Well, that's not so easy. First you have to clear the definitions of v1 from all shared disk paths, interconnects, services, resource groups, whatever.... That can be long, but if no other choose....
But let us see, how to get back, if v1 finally repaired.
=== Cluster Quorum ===
--- Quorum Votes Summary from (latest node reconfiguration) ---
Needed Present Possible
------ ------- --------
1 1 1
--- Quorum Votes by Node (current status) ---
Node Name Present Possible Status
--------- ------- -------- ------
v1 0 0 Online
v2 1 1 Online
To get out of install mode, simply define a quorum device, the simply the same way, as in the "One node at once" type of installation of cluster.
bash-3.00# clq add d3 Jan 2 17:05:21 v2 cl_runtime: NOTICE: CMM: Cluster members: v1 v2. Jan 2 17:05:21 v2 cl_runtime: NOTICE: CMM: node reconfiguration #5 completed. bash-3.00# clq reset Jan 2 17:05:29 v2 cl_runtime: NOTICE: CMM: Votecount changed from 0 to 1 for node v1. Jan 2 17:05:29 v2 cl_runtime: NOTICE: CMM: Cluster members: v1 v2. Jan 2 17:05:29 v2 cl_runtime: NOTICE: CMM: node reconfiguration #6 completed. Jan 2 17:05:30 v2 cl_runtime: NOTICE: CMM: Quorum device 1 (/dev/did/rdsk/d3s2) added; votecount = 1, bitmask of nodes with configured paths = 0x3. Jan 2 17:05:30 v2 cl_runtime: NOTICE: CMM: Registered key on and acquired quorum device 1 (gdevname /dev/did/rdsk/d3s2). Jan 2 17:05:30 v2 cl_runtime: NOTICE: CMM: Quorum device /dev/did/rdsk/d3s2: owner set to node 2. Jan 2 17:05:30 v2 cl_runtime: NOTICE: CMM: Cluster members: v1 v2. Jan 2 17:05:30 v2 cl_runtime: NOTICE: CMM: node reconfiguration #7 completed. Jan 2 17:05:31 v2 cl_runtime: NOTICE: CMM: Quorum device /dev/did/rdsk/d3s2: owner set to node 2.
So, finally.
== Cluster Quorum ===
--- Quorum Votes Summary from (latest node reconfiguration) ---
Needed Present Possible
------ ------- --------
2 3 3
--- Quorum Votes by Node (current status) ---
Node Name Present Possible Status
--------- ------- -------- ------
v1 1 1 Online
v2 1 1 Online
--- Quorum Votes by Device (current status) ---
Device Name Present Possible Status
----------- ------- -------- ------
d3 1 1 Online
Feedback awaiting moderation
This post has 33 feedbacks awaiting moderation...