Release notes for OCFSv2 and NetApp iSCSI failover.

 

Here is important information, taken from Oracle Technology Network, about OCFSv2 timeouts.

In case of iSCSI on NetAll cluster, failover time can be 10 – 30 seconds, so OCVS timeout must be higher then default. I did not figure out best timeout yet, but I recommend 46 to have 90 seconds fence time (46 – 1 ) * 2:

 

  vi /etc/sysconfig/o2cb

 

 O2CB_HEARTBEAT_THRESHOLD=46

 

Script 019-edit-o2cb.sh do this change.

 

Text below was taken from ‘Oracle on FireWire’ document (see here: http://www.oracle.com/technology/pub/articles/hunter_rac10gr2_2.html#16 ).

Adjust the O2CB Heartbeat Threshold


This is a very important section when configuring OCFS2 for use by Oracle Clusterware's two shared files on our FireWire drive. During testing, I was able to install and configure OCFS2, format the new volume, and finally install Oracle Clusterware (with its two required shared files; the voting disk and OCR file), located on the new OCFS2 volume. I was able to install Oracle Clusterware and see the shared drive; however, during my evaluation I was receiving many lock-ups and hanging after about 15 minutes when the Clusterware software was running on both nodes. It always varied on which node would hang (either linux1 or linux2 in my example). It also didn't matter whether there was a high I/O load or none at all for it to crash (hang).

Keep in mind that the configuration you are creating is a rather low-end setup being configured with slow disk access with regards to the FireWire drive. This is by no means a high-end setup and is susceptible to bogus timeouts.

After looking through the trace files for OCFS2, it was apparent that access to the voting disk was too slow (exceeding the O2CB heartbeat threshold) and causing the Oracle Clusterware software (and the node) to crash.

The solution I used was to simply increase the O2CB heartbeat threshold from its default setting of 7, to 601 (and in some cases as high as 900). This is a configurable parameter that is used to compute the time it takes for a node to "fence" itself.

First, let's see how to determine what the O2CB heartbeat threshold is currently set to. This can be done by querying the /proc file system as follows:

# cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold
7

The value is 7, but what does this value represent? Well, it is used in the formula below to determine the fence time (in seconds):

[fence time in seconds] = (O2CB_HEARTBEAT_THRESHOLD - 1) * 2

So, with a O2CB heartbeat threshold of 7, you would have a fence time of:

(7 - 1) * 2 = 12 seconds

You need a much larger threshold (1200 seconds to be exact) given your slower FireWire disks. For 1200 seconds, you will want a O2CB_HEARTBEAT_THRESHOLD of 601 as shown below:

(601 - 1) * 2 = 1200 seconds

Let's see now how to increase the O2CB heartbeat threshold from 7 to 601. This will need to be performed on both nodes in the cluster. You first need to modify the file /etc/sysconfig/o2cb and set O2CB_HEARTBEAT_THRESHOLD to 601:

# O2CB_ENABELED: 'true' means to load the driver on boot.
O2CB_ENABLED=true
 
# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=ocfs2
 
# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=601

After modifying the file /etc/sysconfig/o2cb, you need to alter the o2cb configuration. Again, this should be performed on all nodes in the cluster.

# umount /u02/oradata/orcl/
# /etc/init.d/o2cb unload
# /etc/init.d/o2cb configure
 
Load O2CB driver on boot (y/n) [y]: y
Cluster to start on boot (Enter "none" to clear) [ocfs2]: ocfs2
Writing O2CB configuration: OK
Loading module "configfs": OK
Mounting configfs filesystem at /config: OK
Loading module "ocfs2_nodemanager": OK
Loading module "ocfs2_dlm": OK
Loading module "ocfs2_dlmfs": OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
Starting cluster ocfs2: OK

You can now check again to make sure the settings took place in for the o2cb cluster stack:

# cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold
601

Important Note: The value of 601 used for the O2CB heartbeat threshold will not work for all the FireWire drives listed in this guide. Use the following chart to determine the O2CB heartbeat threshold value that should be used.