6
Administering Oracle Real Application Clusters Guard

This chapter describes how to administer an Oracle Real Application Clusters Guard environment. It includes the following sections:

Administering Planned Outages

This section contains the following topics:

Maintenance on the Primary Node

Maintenance, such as hardware repair or an operating system upgrade, requires a planned outage so that the primary role can be moved to the secondary node. Plan it for a part of the business cycle that is less busy and give advance notification to users. To administer a planned outage on the primary node, perform the following steps:

From the PFSCTL command line, enter the move_primary command to move the primary role to the secondary instance:
```
PFSCTL> move_primary
```
Complete maintenance.

Restore the pack to the secondary role on the idle node.

PFSCTL> restore

Note:

The system is now resilient, but the primary and secondary roles are reversed from the initial states. If you want to restore the nodes to their initial states, then continue with the following step.

Move the primary role to the original primary node and the secondary role to the original secondary node (optional):
```
PFSCTL> switchover
```

Maintenance on the Secondary Node

Maintenance on the secondary node does not interrupt operation, but the system is not resilient while the secondary node is down. To administer a planned outage on the secondary node, perform the following steps:

Stop the secondary instance:
```
PFSCTL> stop_secondary 
```
Complete maintenance.
Restore the pack on the secondary node.
```
PFSCTL> restore
```

Recovering from an Unplanned Outage on One Node

When an unplanned outage occurs on the primary node, Oracle Real Application Clusters Guard automatically fails over to the secondary node and notifies the user that a role change has occurred. At this point, Oracle Real Application Clusters Guard is operating in a nonresilient state with the primary role on the former secondary node.

After you have performed root cause analysis and repaired the source of the fault, restore the secondary role on the former primary node by using the restore command:

PFSCTL> restore

The primary and secondary roles have now been reversed. Choose one of the following actions:

Operate with Reversed Primary and Secondary Roles

After restoring both packs, you can continue to operate with primary and secondary roles that are reversed from the initial state. For sites with symmetric configurations, there is no need to return to the original state. Returning to the original roles requires a planned outage and can be avoided. In fact, some users intentionally operate with role reversal on a fixed schedule (such as every three months) in order to test the capabilities of the system.

Return to the Original Primary/Secondary Configuration

Returning to the original primary/secondary configuration requires a planned outage while the primary role is moved. Plan it for a less busy part of your business cycle and give advance notice to users. Execute it as follows:

# pfsctl
PFSCTL> switchover

Choose a Less Critical Application to Restore

If your system includes more than one uniquely identified database on each node, then performance may be degraded after a failover. For example, if you have a two-node cluster in a primary/secondary configuration and you are also running an unrelated database on the secondary node, then the secondary node runs the primary services as well as the unrelated database after failover and may be overloaded. In this situation, you should move the less critical service to the other node when it is restored.

Perform the following steps for each of the services that are moved to the restored node:

Set the ORACLE_SERVICE and DB_NAME environment variables. For example:
```
$ export ORACLE_SERVICE=SALES
$ export DB_NAME=sales
```
Restore the instance with secondary role:
```
# pfsctl
PFSCTL> restore
```
Move the primary role to the original primary node:
```
PFSCTL> switchover
```

Recovering from Unplanned Outages on Both Nodes

Figure 6-1 and Figure 6-2 show what happens when both instances of a two-node cluster fail.

Figure 6-1 Failure of Both Instances, Part 1

Text description of pfsar002.gif follows

Text description of the illustration pfsar002.gif

During normal operation, both Node A and Node B are up and operational. Pack A is running on its home node, Node A, and has the primary role. It contains the primary instance and an IP address. Pack B is running on its home node, Node B, and has the secondary role. It contains the secondary instance and an IP address.

If the primary instance fails, then Oracle Real Application Clusters Guard automatically takes the following failover actions:

The secondary instance becomes the primary instance.
Pack A starts on Node B in foreign mode. This means that only its IP address is activated on Node B.

Now both Pack A and Pack B are running on Node B. Pack B contains the primary instance and its IP address. Pack A contains only an IP address. Nothing is running on Node A. The system is not resilient.

If the primary instance fails, then Pack A and Pack B contain only IP addresses.

Figure 6-2 shows what happens after the primary instance fails.

Figure 6-2 Failure of Both Instances, Part 2

Text description of pfsar006.gif follows

Text description of the illustration pfsar006.gif

Pack B starts on its foreign node (Node A). Pack A is still running on Node B. Only the IP addresses are up on the nodes. Because there is no instance running, Pack B restarts on its home node and tries to restart the primary instance. If restarting the instance is unsuccessful, Pack B again starts on its foreign node. The outcome of double instance failure is:

Both packs are running on their foreign nodes.
Only the IP addresses are up.
No instances are running.

Diagnose and repair the cause of the failures. To restart the instances, you must perform the following steps:

Halt both of the packs. Enter the following command:
```
PFSCTL> pfshalt
```
You should see output similar to the following:
```
pfshalt command succeeded.
```
Start both of the packs. Enter the following command:
```
PFSCTL> pfsboot
```
You should see output similar to the following:
```
pfsboot command succeeded.
```

Administering Failover of the Applications

Oracle Real Application Clusters Guard restores service quickly. The application must restart transactions when it receives an Oracle message that indicates that failure has occurred.

Failing over the application when the primary instance fails is straightforward. The application sessions receive the ORA-1089 and ORA-1034 Oracle errors for new requests and the ORA-1041, ORA-3113, and ORA-3114 Oracle errors for active requests. These errors must be trapped by the application. At reconnection, the application connects transparently to the new primary instance. For example, in the case of a Web server, the server threads are restarted for each connection pool against the new primary instance. The current transactions are then resubmitted by the clients.

Failing over the application when the primary node fails is not straightforward because of TCP/IP time-out. TCP/IP time-out is a significant problem for high availability. It occurs when a node fails without closing the sockets, because new requests can be made to an IP address that is unavailable. For active requests, the delays to the client are the values for TCP_IP_ABORT_CINTERVAL and TCP_IP_ABORT_INTERVAL. For sessions that are waiting for read/write completion, the delay is the value for TCP_KEEPALIVE_INTERVAL. The values for these TCP/IP parameter should be tuned at each site.

Note:

These parameters are specific to your operating system. See your operating system-specific documentation for more information.

TCP/IP time-outs are addressed in Oracle Real Application Clusters Guard by using relocatable IP addresses and the call-home feature. Because Oracle Real Application Clusters Guard moves the IP addresses, active requests for an address do not wait to time out. Requests for connection are refused immediately and are routed transparently to the new primary instance. Requests that issue SQL statements receive a broken pipe error (ORA-3113), allowing the application to restart. The application should detect this error and take appropriate action.

See Also:

"Setting Up the Call-Home Feature"

Enhancing Application Failover with Role Change Notification

The role change notification in Oracle Real Application Clusters Guard can enhance application failover. The feature allows you to implement actions such as running or halting applications when the notification of a role change (UP, PLANNED_UP, PLANNED_DOWN, DOWN, CLEANUP) is received. For example, when the instance starts, the notification can be used to start the applications. When the instance terminates, the notification can be used to halt the applications. It is also possible to halt the application when a role starts. This allows secondary applications to halt when the primary role fails over, for example.

Automatic role change notification behaves as follows:

An UP notification occurs
- After the instance (primary or secondary) starts
- After an instance role changes from secondary to primary
A DOWN notification occurs before the instance (primary or secondary) is shut down
A CLEANUP notification occurs after the instance (primary or secondary) is shut down

Manual role notification occurs only when PFSCTL commands are executed, for example, during planned outages. Manual role notification behaves as follows:

A PLANNED_UP notification occurs before the instance (primary or secondary) starts
A PLANNED_DOWN notification occurs before the instance (primary or secondary) is shut down

See Also:
"Setting Up Role Change Notification"

Changing the Configuration

Most configuration changes can be made to an Oracle Real Application Clusters Guard environment by switching over to the secondary instance, applying the change, and switching back (optional). The following types of configuration changes are described in this section:

Changing the Oracle Real Application Clusters Guard Configuration Parameters

There are several ways to change Oracle Real Application Clusters Guard configuration parameters, depending on what kind of parameter needs to be changed. For example, changing $ORACLE_HOME requires the packs to be re-created, while changing the port numbers requires that the packs, the database, and the listener be halted.

See Also:

Chapter 3, "Oracle Real Application Clusters Guard Configuration Parameters" for information about changing configuration parameters

Changing the Configuration of Both Instances of Oracle9i Real Application Clusters

To change initialization parameters for both instances, perform the following steps:

Note:

This applies only to initialization parameters that are not included in the mandatory parameters listed in the $ORACLE_SERVICE_config.pfs, $ORACLE_SERVICE_config.Host.ded.pfs, and init_$ORACLE_SID_Host.ora files. Changing the INSTANCE_NAMES parameter, for example, requires the catpfs.sql script to be rerun.

Modify the desired parameters for both instances.
Stop the secondary instance.
```
PFSCTL> stop_secondary 
```
Restart the secondary instance.
```
PFSCTL> restore
```
Move the primary role to the secondary instance.
```
PFSCTL> move_primary
```
Restore the secondary instance on the former primary node.
```
PFSCTL> restore
```
Reverse the roles to their original locations, if desired. (Use the switchover command.)

See Also:
Chapter 3, "Oracle Real Application Clusters Guard Configuration Parameters" for information about changing configuration parameters

Making Online Changes to the Configuration

Oracle supports many online configuration changes.

Make the online configuration change at the primary instance. For example, enter the following SQL statement:
```
SQL> ALTER SYSTEM SET fast_start_mttr_target = 120;
```

Make the same configuration change to the Oracle configuration files to ensure that the change is preserved at the next failover or restart.

See Also:
Oracle9i Database Reference to find out which initialization parameters can be changed online

Changing the PFS_KEEP_PRIMARY Parameter

The PFS_KEEP_PRIMARY parameter specifies whether to leave the primary pack up and running when the secondary pack does not come up when the pfsboot command is entered.

Figure 6-3 shows the effect of entering the pfsboot command during normal operation.

Figure 6-3 Using the pfsboot Command During Normal Operation

Text description of pfsar003.gif follows

Text description of the illustration pfsar003.gif

Before the command is entered, no packs are running. When the pfsboot command is entered, Oracle Real Application Clusters Guard first starts Pack A on Node A, which becomes the primary node. Then Oracle Real Application Clusters Guard starts Pack B on Node B, which becomes the secondary node.

Figure 6-4 shows what happens when PFS_KEEP_PRIMARY is set to $PFS_TRUE and the second pack does not start.

Figure 6-4 Using the pfsboot Command When PFS_KEEP_PRIMARY=$PFS_TRUE and the Secondary Pack Does Not Start

Text description of pfsar004.gif follows

Text description of the illustration pfsar004.gif

When the pfsboot command is entered, Oracle Real Application Clusters Guard starts Pack A on Node A, which becomes the primary node. However, when Oracle Real Application Clusters Guard tries to start Pack B on Node B, it fails for some reason. If PFS_KEEP_PRIMARY is set to $PFS_TRUE, then Pack A remains up. The system runs without resilience while you diagnose the cause of the failure on Node B.

Figure 6-5 shows what happens when PFS_KEEP_PRIMARY is set to $PFS_FALSE and the second pack does not start.

Figure 6-5 Using the pfsboot Command When PFS_KEEP_PRIMARY=$PFS_FALSE and the Secondary Pack Does Not Start

Text description of pfsar005.gif follows

Text description of the illustration pfsar005.gif

When the pfsboot command is entered, Oracle Real Application Clusters Guard starts Pack A on Node A, which becomes the primary node. If Oracle Real Application Clusters Guard fails to start Pack B on Node B and PFS_KEEP_PRIMARY is set to $PFS_FALSE, then Oracle Real Application Clusters Guard shuts down Pack A on Node A. No packs are running.

See Also:

"Changing Oracle Real Application Clusters Guard Configuration Parameters" for more information about changing the value of the PFS_KEEP_PRIMARY parameter

Making Online Changes to the ORAPING_CONFIG Table

The heartbeat monitor uses a database table, ORAPING_CONFIG, to record the configuration information. The use of a table ensures that both instances of the cluster always use the same value. This table is refreshed on an interval defined by the CONFIG_INTERVAL parameter.

Table 6-1 shows the parameters in the ORAPING_CONFIG table.

Table 6-1 Parameters in the ORAPING_CONFIG Table

Parameter Name	Default Value	Description
`INTERNAL_TIMEOUT`	`30`	Time in seconds to execute internal `ORACLE_PING` statements
`USER_TIMEOUT`	`60`	Time in seconds to execute customer query
`MAX_TRIES`	`3`	Number of times to try to execute the heartbeat monitor cycle before declaring failure
`SPECIAL_WAIT`	`300`	Time in seconds to wait for special events to complete
`RECOVERY_RAMPUP_TIME`	`300`	Time in seconds to wait for ramp-up after failover
`CYCLE_TIME`	`120`	Time in seconds to execute heartbeat monitor and sleep cycle
`CONNECT_TIMEOUT`	`30`	Time in seconds to establish heartbeat monitor connection
`CONFIG_INTERVAL`	`600`	Time in seconds to wait before reading the `ORAPING_CONFIG` table
`TRACE_FLAG`	`0`	Flag to enable (`1`) or disable (`0`) SQL trace
`TRACE_ITERATIONS`	`1`	Number of heartbeat monitor cycles to trace if trace is enabled
`LOGON_STORM_THRESHOLD`	`50`	If the number of sessions logging on to the database exceeds the value of `LOGON_STORM_THRESHOLD` during the heartbeat monitor cycle, then Oracle Real Application Clusters Guard ignores the `CONNECT_TIMEOUT` parameter.

Suppose performance issues arise during initial testing of the system. Then you can run Oracle Real Application Clusters Guard with the values in the ORAPING_CONFIG table raised to a level that allows problems to persist long enough for detailed analysis. You can lower the configuration values when the system is stable.

Another reason to change the values in the ORAPING_CONFIG table is to customize them for different workloads. False failovers can occur when workloads are so large that timeouts occur simply because the system is busy.

To change the values in the ORAPING_CONFIG table, perform steps similar to the following:

Connect as the $ORACLE_USER and view the default values in the ORAPING_CONFIG table. Enter the following commands:

$ sqlplus /
SQL> SELECT * FROM oraping_config;

You should see the following output:

INTERNAL_TIMEOUT USER_TIMEOUT MAX_RETRIES SPECIAL_WAIT
RECOVERY_RAMPUP_TIME
---------------- ------------ ----------- ------------
--------------------
CYCLE_TIME CONNECT_TIMEOUT CONFIG_INTERVAL TRACE_FLAG
TRACE_ITERATIONS
---------- --------------- --------------- ----------
----------------
LOGON_STORM_THRESHOLD
---------------------
              30           60           3          300
300
       120              30             600          0                1
                   50

Update the ORAPING_CONFIG table. Enter commands similar to the following:

SQL> UPDATE oraping_config SET
  2  cycle_time = 300,
  3  connect_timeout = 120,
  4  user_timeout = 120,
  5  special_wait = 600,
  6  logon_storm_threshold =100;
1 row updated.
SQL> COMMIT;

View the results of the update. Enter the following command:

SQL> SELECT * FROM oraping_config;

You should see output similar to the following:

INTERNAL_TIMEOUT USER_TIMEOUT MAX_RETRIES SPECIAL_WAIT
RECOVERY_RAMPUP_TIME
---------------- ------------ ----------- ------------
--------------------
CYCLE_TIME CONNECT_TIMEOUT CONFIG_INTERVAL TRACE_FLAG
TRACE_ITERATIONS
---------- --------------- --------------- ----------
----------------
LOGON_STORM_THRESHOLD
---------------------
              30          120           3          600
300
       300             120             600          0                1
                  100

Managing the Oracle Real Application Clusters Guard Log Files

Note:

Do not delete the Oracle Real Application Clusters Guard log files. They are essential for tracking faults.

Oracle Real Application Clusters Guard writes log files and debug files to the following locations:

OFA configuration: $ORACLE_BASE/admin/$DB_NAME/pfs/pfsdump
Non-OFA configuration: $ORACLE_HOME/pfs/$DB_NAME/log

To find the Oracle Real Application Clusters Guard logs, change to the pfsdump directory. Enter a command similar to the following:

$ cd /mnt1/oracle/admin/sales/pfs/pfsdump

List the contents of the directory. You should see output similar to the following:

pfs_sales_host1.debug     pfs_sales_host1_ping.log
pfs_sales_host1.log

Allow sufficient space for the log files. If the log files become too large, then copy them manually to a backup location. Oracle Real Application Clusters Guard automatically opens a new copy of the file that has been archived when it writes to the file again.

Recovering from a Failover While Datafiles Are in Backup Mode

When datafiles are in backup mode, they appear to instance recovery as if they are past versions. Oracle issues a message at the next startup that says media recovery is required. Media recovery is not required. Solve the problem by taking the following actions:

Stop the packs.
Mount the database.
Take each affected datafile out of backup mode.
Restart the packs.

Note:
RMAN does not encounter this problem. If you use RMAN, this procedure is not necessary.

The steps are shown in more detail as follows:

Halt the packs. Enter the following command:
```
PFSCTL>  pfshalt
```
Mount one of the instances. Enter commands similar to the following:
```
$ sqlplus "system/manager as sysdba"
SQL> startup mount;
```

Identify the datafiles that are in backup mode. Enter commands similar to the following:

SELECT file#, recover, fuzzy, tablespace_name, name
FROM v$datafile_header
WHERE fuzzy = 'YES' ;

You should see output similar to the following:

FILE#      REC  FUZ  TABLESPACE  NAME
-----      ---  ---  ----------  ---------------------------------
6          NO   YES  USERS       /dev/vx/rdsk/home-dg/oracle_usr01
7          NO   YES  USERS       /dev/vx/rdsk/home-dg/oracle_usr02

Take the datafiles out of backup mode. Enter SQL statements similar to the following:

SQL> ALTER DATABASE DATAFILE '/dev/vx/rdsk/home-dg/oracle_usr01' END BACKUP;

You should see output similar to the following:

Database altered.

Continue taking affected datafiles out of backup mode.

SQL> ALTER DATABASE DATAFILE '/dev/vx/rdsk/home-dg/oracle_usr02' END BACKUP;

Database altered.

Note:

If you repeat the ALTER DATABASE...END BACKUP statement, then Oracle issues errors. They are not destructive, and you can ignore them.

SQL> ALTER DATABASE DATAFILE '/dev/vx/rdsk/home-dg/oracle_usr01' END BACKUP;

Output similar to the following may occur:

alter database datafile '/dev/vx/rdsk/home-dg/oracle_usr01' end backup
*
ERROR at line 1:
ORA-01235: END BACKUP failed for 1 file(s) and succeeded for 0
ORA-01199: file 6 is not in online backup mode
ORA-01110: data file 6: '/dev/vx/rdsk/home-dg/oracle_usr01'

Unmount the Oracle instance.
```
SQL> shutdown immediate
```

Start the packs.

PFSCTL> pfsboot

Note:

You should also take datafiles out of backup mode before a switchover. You can do it manually, or you can implement it as a call-out from the PLANNED_DOWN state in role change notification.

See Also:

6 Administering Oracle Real Application Clusters Guard

Administering Planned Outages

Maintenance on the Primary Node

Maintenance on the Secondary Node

Recovering from an Unplanned Outage on One Node

Operate with Reversed Primary and Secondary Roles

Return to the Original Primary/Secondary Configuration

Choose a Less Critical Application to Restore

Recovering from Unplanned Outages on Both Nodes

Figure 6-1 Failure of Both Instances, Part 1

Figure 6-2 Failure of Both Instances, Part 2

Administering Failover of the Applications

Enhancing Application Failover with Role Change Notification

Changing the Configuration

Changing the Oracle Real Application Clusters Guard Configuration Parameters

Changing the Configuration of Both Instances of Oracle9i Real Application Clusters

Making Online Changes to the Configuration

Changing the PFS_KEEP_PRIMARY Parameter

Figure 6-3 Using the pfsboot Command During Normal Operation

Figure 6-4 Using the pfsboot Command When PFS_KEEP_PRIMARY=$PFS_TRUE and the Secondary Pack Does Not Start

Figure 6-5 Using the pfsboot Command When PFS_KEEP_PRIMARY=$PFS_FALSE and the Secondary Pack Does Not Start

Making Online Changes to the ORAPING_CONFIG Table

Table 6-1 Parameters in the ORAPING_CONFIG Table

Managing the Oracle Real Application Clusters Guard Log Files

Recovering from a Failover While Datafiles Are in Backup Mode

6
Administering Oracle Real Application Clusters Guard