Free Blog Themes and Blog Templates

Sun Cluster 3.3 - Bypassing amnesia prevention

The problem might be known to all of you. If shut down my nodes one after the other, the last node leaving the cluster (host v1 in our case) should be the first booting up in order the provide CCR consistency.
However, it that node refuses to start up (because of some HW error), I am in trouble: all the other nodes stuck in a "attempting to join cluster", since the quorum key on the quorum device belongs to v1 (the last node).
So did we have to wait since the first one repaired? Not really, here is a solution.


So our starting point is the following. An operational cluster, 3 votes, everything is fine.

bash-3.00# clq status

=== Cluster Quorum ===

--- Quorum Votes Summary from (latest node reconfiguration) ---

            Needed   Present   Possible
            ------   -------   --------
            2        3         3


--- Quorum Votes by Node (current status) ---

Node Name       Present       Possible       Status
---------       -------       --------       ------
v1              1             1              Online
v2              1             1              Online


--- Quorum Votes by Device (current status) ---

Device Name       Present      Possible      Status
-----------       -------      --------      ------
d3                1            1             Online

Then, let us stop v2 first, then v1. If trying to start up v2, it is going to wait forever.

NOTICE: CMM: Node v1 (nodeid = 1) with votecount = 1 added.
NOTICE: CMM: Node v2 (nodeid = 2) with votecount = 1 added.
WARNING: CMM: Open failed for quorum device /dev/did/rdsk/d3s2 with error 1.
NOTICE: clcomm: Adapter bge3 constructed
NOTICE: clcomm: Adapter bge2 constructed
NOTICE: CMM: Node v2: attempting to join cluster.
...
Jan  2 16:33:19 v2 cl_runtime: NOTICE: CMM: Cluster doesn't have operational quorum yet; waiting for quorum.

Let us check the reason of hanging.

root@v2# egrep -i 'nodes...name|key' /etc/cluster/ccr/global/infrastructure
cluster.nodes.1.name    v1
cluster.nodes.1.properties.quorum_resv_key      0x4DF9565B00000001
cluster.nodes.2.name    v2
cluster.nodes.2.properties.quorum_resv_key      0x4DF9565B00000002
root@v2# egrep -i 'quorum_dev' /etc/cluster/ccr/global/infrastructure
cluster.quorum_devices.1.name   d3
cluster.quorum_devices.1.state  enabled
cluster.quorum_devices.1.properties.votecount   1
cluster.quorum_devices.1.properties.gdevname    /dev/did/rdsk/d3s2
cluster.quorum_devices.1.properties.path_1      enabled
cluster.quorum_devices.1.properties.path_2      enabled
cluster.quorum_devices.1.properties.access_mode scsi2
cluster.quorum_devices.1.properties.type        shared_disk
root@v2# /usr/cluster/lib/sc/pgre  -c pgre_inkeys -d /dev/did/rdsk/d3s2 
key[0]=0x4df9565b00000001.

OK, so v1's key is there, blocking our way to boot up in cluster mode.
I might be an obvious solution to wipe this key out with pgre -c pgre_scrub , but that would end up with the same results: if no keys in the quorum device, we should wait for the other node to have the operational quourum. So what to do then. Edit the CCR, of course B)

root@v2$ reboot -- -x
root@v2$ cd /etc/cluster/ccr/global/
root@v2$ vi infrastructure

First enable install mode

cluster.name    test
cluster.state   enabled
cluster.properties.cluster_id   0x4DF9565B
cluster.properties.installmode  enabled

Set the vote count of v1 to 0.

cluster.nodes.1.name    v1
cluster.nodes.1.state   enabled
cluster.nodes.1.properties.private_hostname     clusternode1-priv
cluster.nodes.1.properties.quorum_vote  0

And finally, remove any reference to quorum device.
Do not forget to update the checksum in the file, otherwise the node end up saying "Corrupted CCR".

root@v2# /usr/cluster/lib/sc/ccradm recover -o infrastructure
root@v2# reboot 

After reboot, v2 comes up as a single cluster node. The quorum votes are like the following.

root@v2# clq status

=== Cluster Quorum ===

--- Quorum Votes Summary from (latest node reconfiguration) ---

            Needed   Present   Possible
            ------   -------   --------
            1        1         1


--- Quorum Votes by Node (current status) ---

Node Name       Present       Possible       Status
---------       -------       --------       ------
v1              0             0              Offline
v2              1             1              Online

So are we there yet?

bash-3.00# clrs status

=== Cluster Resources ===

Resource Name       Node Name      State        Status Message
-------------       ---------      -----        --------------
test-lh             v1             Offline      Offline
                    v2             Online       Online - LogicalHostname online.

Our dummy resource is up, and running. That's fine. But can I configure an other one?


bash-3.00# clrslh create -g test-rg -h test-ip2 test-lh2
clrslh:  v1 not a cluster member

It seems we fail to validate the existence of the hostname test-ip2 on the other side (v1).
Until v1 is repaired, we have to postpone any change in the configuration. So really?
Well, if repairing v1 seems to be a long run, we might choose to wipe the entire config of v1, so end up in a single node cluster.

bash-3.00# clnode clear v1
clnode:  Node "v1" is still in use by resource group "test-rg".

Well, that's not so easy. First you have to clear the definitions of v1 from all shared disk paths, interconnects, services, resource groups, whatever.... That can be long, but if no other choose....

But let us see, how to get back, if v1 finally repaired.

=== Cluster Quorum ===

--- Quorum Votes Summary from (latest node reconfiguration) ---

            Needed   Present   Possible
            ------   -------   --------
            1        1         1


--- Quorum Votes by Node (current status) ---

Node Name       Present       Possible       Status
---------       -------       --------       ------
v1              0             0              Online
v2              1             1              Online

To get out of install mode, simply define a quorum device, the simply the same way, as in the "One node at once" type of installation of cluster.

bash-3.00# clq add d3

Jan  2 17:05:21 v2 cl_runtime: NOTICE: CMM: Cluster members: v1 v2.
Jan  2 17:05:21 v2 cl_runtime: NOTICE: CMM: node reconfiguration #5 completed.

bash-3.00# clq reset 

Jan  2 17:05:29 v2 cl_runtime: NOTICE: CMM: Votecount changed from 0 to 1 for node v1.
Jan  2 17:05:29 v2 cl_runtime: NOTICE: CMM: Cluster members: v1 v2.
Jan  2 17:05:29 v2 cl_runtime: NOTICE: CMM: node reconfiguration #6 completed.
Jan  2 17:05:30 v2 cl_runtime: NOTICE: CMM: Quorum device 1 (/dev/did/rdsk/d3s2) added; votecount = 1, bitmask of nodes with configured paths = 0x3.
Jan  2 17:05:30 v2 cl_runtime: NOTICE: CMM: Registered key on and acquired quorum device 1 (gdevname /dev/did/rdsk/d3s2).
Jan  2 17:05:30 v2 cl_runtime: NOTICE: CMM: Quorum device /dev/did/rdsk/d3s2: owner set to node 2.
Jan  2 17:05:30 v2 cl_runtime: NOTICE: CMM: Cluster members: v1 v2.
Jan  2 17:05:30 v2 cl_runtime: NOTICE: CMM: node reconfiguration #7 completed.
Jan  2 17:05:31 v2 cl_runtime: NOTICE: CMM: Quorum device /dev/did/rdsk/d3s2: owner set to node 2.

So, finally.

== Cluster Quorum ===

--- Quorum Votes Summary from (latest node reconfiguration) ---

            Needed   Present   Possible
            ------   -------   --------
            2        3         3


--- Quorum Votes by Node (current status) ---

Node Name       Present       Possible       Status
---------       -------       --------       ------
v1              1             1              Online
v2              1             1              Online


--- Quorum Votes by Device (current status) ---

Device Name       Present      Possible      Status
-----------       -------      --------      ------
d3                1            1             Online

Feedback awaiting moderation

This post has 33 feedbacks awaiting moderation...

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
Miami Real Estate Blog Theme
Free Blog Themes and Templates