Oracle9i Recovery Manager User's Guide Release 2 (9.2) Part Number A96566-01 |
|
This chapter describes how to troubleshoot Recovery Manager. This chapter contains these topics:
Recovery Manager provides detailed error messages that can aid in troubleshooting problems. Also, the Oracle database server and third-party media vendors generate useful debugging output of their own. This section contains these topics:
Output that is useful for troubleshooting failed or hung RMAN jobs is located in several different places, as explained in the following table.
Type of Output | Produced By | Location | Description |
---|---|---|---|
RMAN messages |
RMAN |
Direct this output to: |
Contains actions relevant to the RMAN job as well as error messages generated by RMAN, the server, and the media vendor. RMAN error messages have an |
|
Oracle database server |
The directory named in the |
Contains a chronological log of errors, initialization parameter settings, and administration operations. Records values for overwritten control file records (refer to "Monitoring the Overwriting of Control File Records"). |
Oracle trace file |
Oracle database server |
The directory specified in the |
Contains detailed output generated by Oracle server processes. This file is created when an |
|
Third-party media management software |
The directory specified in the |
Contains vendor-specific information written by the media management software. Note that this log does not contain Oracle server or RMAN errors. |
Media manager log file |
Third-party media management software |
The filenames for any media manager logs other than |
Contains information on the functioning of the media management device. |
On various occasions it may be important for you to determine whether RMAN successfully executed a command. For example, if you are trying to write a script that performs an unattended backup using RMAN, you may want to know whether the backup was a success or failure.
One way to determine whether RMAN encountered an error is to examine its return code, as described in "Identifying RMAN Return Codes". A second way is to search the Recovery Manager output for the string RMAN-00569
, which is the message number for the error stack banner. All RMAN errors are preceded by this error message. If you do not see an RMAN-00569
message in the output, then there are no errors. Following is sample output for a syntax error:
RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: =========================================================== RMAN-00558: error encountered while parsing input commands RMAN-01005: syntax error: found ")": expecting one of: "archivelog, backup, backupset, controlfilecopy, current, database, datafile, datafilecopy, (, plus, ;, tablespace" RMAN-01007: at line 1 column 18 file: standard input
Typically, you find the following types of error codes in RMAN message stacks:
RMAN-
ORA-
Additional information:
See Also:
Oracle9i Database Error Messages for explanations of |
Table 15-1 indicates the error ranges for common RMAN error messages, all of which are described in Oracle9i Database Error Messages.
When errors occur through the media management API, RMAN returns an error message number prefixed as follows:
Additional information:
Table 15-2 lists media manager message numbers and their corresponding error text. In the error codes, O/S
stands for operating system. The errors prefixed with an asterisk are internal and should never be seen during normal operation.
Sometimes you may find it difficult to identify the useful messages in the RMAN error stack. Note the following tips and suggestions:
Additional information
followed by an integer. This line indicates a media management error. The integer that follows refers to a code that is explained in the text of the error message.RMAN-03002
or RMAN-03009
message (RMAN-03009
is the same as RMAN-03002
but includes the channel ID), which immediately follows the banner. These messages indicate which command failed. Syntax errors generate RMAN-00558
.You attempt a backup of tablespace users
and receive the following message:
Starting backup at 29-AUG-01 using channel ORA_DISK_1 RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: =========================================================== RMAN-03002: failure of backup command at 08/29/2001 15:14:03 RMAN-20202: tablespace not found in the recovery catalog RMAN-06019: could not translate tablespace name "USESR"
The RMAN-03002
error indicates that the BACKUP
command failed. You read the last two messages in the stack first and immediately see the problem: no tablespace usesr
appears in the recovery catalog because you mistyped the name.
Assume that you attempt to recover a tablespace and receive the following errors:
RMAN> RECOVER TABLESPACE users; Starting recover at 29-AUG-01 using channel ORA_DISK_1 starting media recovery media recovery failed RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: =========================================================== RMAN-03002: failure of recover command at 08/29/2001 15:18:43 RMAN-11003: failure during parse/execution of SQL statement: alter database recover if needed tablespace USERS ORA-00283: recovery session canceled due to errors ORA-01124: cannot recover data file 8 - file is in use or recovery ORA-01110: data file 8: '/oracle/oradata/trgt/users01.dbf'
As suggested, you start reading from the bottom up. The ORA-01110
message explains there was a problem with the recovery of datafile users01.dbf
. The second error indicates that Oracle cannot recover the datafile because it is in use or already being recovered. The remaining RMAN errors indicate that the recovery session was cancelled due to the server errors. Hence, you conclude that because you were not already recovering this datafile, the problem must be that the datafile is online and you need to take it offline and restore a backup.
Media management errors in RMAN message output are not uncommon. Assume that you use a tape drive and receive the following output during a backup job:
RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: =========================================================== RMAN-03009: failure of backup command on c1 channel at 09/04/2001 13:18:19 ORA-19506: failed to create sequential file, name="07d36ecp_1_1", parms="" ORA-27007: failed to open file SVR4 Error: 2: No such file or directory Additional information: 7005 Additional information: 1 ORA-19511: Error received from media manager layer, error text: SBT error = 7005, errno = 2, sbtopen: system error
Following the suggestions for reading error message stacks, you look for the Additional
information
line and notice:
Additional information: 7005
You discover that error 7005
means that the media management device is busy. So, the media management software is not able to write to the device because it is in use or there is a problem with it.
One way to determine whether RMAN encountered an error is to examine its return code. RMAN returns 0 to the operating system if no errors occurred, a nonzero value otherwise. For example, if you are running UNIX with the C shell, then, when RMAN completes, the return code will be in a shell variable called $status
.
On specific platforms, Oracle provides a diagnostic tool called sbttest
. This utility performs a simple test of the media management software by acting as the Oracle database server and attempting to communicate with the media manager.
On UNIX, the sbttest
utility is located in $ORACLE_HOME/bin
. If for some reason the utility is not included with your platform, then contact Oracle Support to obtain the C version of the program. You can compile this version of the program on all UNIX platforms.
Note that on platforms such as Solaris, you do not have to relink when using sbttest
. On other platforms, relinking may be necessary.
For online documentation of sbttest
, issue the following on the command line:
% sbttest
The program displays the list of possible arguments for the program:
Error: backup file name must be specified Usage: sbttest backup_file_name # this is the only required parameter <-dbname database_name> <-trace trace_file_name> <-remove_before> <-no_remove_after> <-read_only> <-no_regular_backup_restore> <-no_proxy_backup> <-no_proxy_restore> <-file_type n> <-copy_number n> <-media_pool n> <-os_res_size n> <-pl_res_size n> <-block_size block_size> <-block_count block_count> <-proxy_file os_file_name bk_file_name [os_res_size pl_res_size block_size block_count]>
The display also indicates the meaning of each argument. For example, following is the description for two optional parameters:
Optional parameters: -dbname specifies the database name which will be used by SBT to identify the backup file. The default is "sbtdb" -trace specifies the name of a file where the Media Management software will write diagnostic messages.
Use sbttest
to perform a quick test of the media manager. The following table explains how to interpret the output.
To use sbttest:
sbttest
at the command line:
% sbttest
If the program is operational, then you should see a display of the online documentation.
some_file.f
and write the output to sbtio.log
:
% sbttest some_file.f -trace sbtio.log
You can also test a backup of an existing datafile. For example, this command tests datafile tbs_33.f
of database prod
:
% sbttest tbs_33.f -dbname prod
libobk.so could not be loaded. Check that it is installed properly, and that LD_ LIBRARY_PATH environment variable (or its equivalent on your platform) includes the directory where this file can be found. Here is some additional information on the cause of this error: ld.so.1: sbttest: fatal: libobk.so: open failed: No such file or directory
Note that in some cases sbttest
can work but an RMAN backup does not. The reasons can be the following:
sbttest
is not the owner of the Oracle processes.sbttest
can still work.sbttest
program passes all environment parameters from the shell but RMAN does not.You have the following methods for terminating an RMAN command while it is executing:
CTRL+C
(or the equivalent "attention" key combination for your system) in the RMAN interface, which is the preferred method. This operation also terminates allocated channels unless they are hung in the media management code, for example, when they are waiting for a tape to be mounted.ALTER
SYSTEM
statement.You can identify the Oracle session ID for an RMAN channel by looking in the RMAN log for messages with the format shown in the following example:
channel ch1: sid=15 devtype=SBT_TAPE
The sid
and devtype
are displayed for each allocated channel. Note that the Oracle sid
is different from the operating system process ID. You can kill the session by specifying the sid
in a SQL statement, but the commands are not the same as the operating system process kill
commands.
You can specify the sid
in the SQL statement ALTER
SYSTEM
KILL
SESSION
command. It takes two arguments (the sid
printed in the RMAN message and a serial number), both of which can be obtained by querying V$SESSION
. For example, run the following statement, where sid_in_rman_output
is the number from the RMAN message:
SELECT SERIAL# FROM V$SESSION WHERE SID=sid_in_rman_output
;
Then, run the following statement, substituting the sid_in_rman_output
and serial number obtained from the query:
ALTER SYSTEM KILL SESSION 'sid_in_rman_output
,serial#
';
Note that this is no more effective than killing at the operating system level if the process is hung in the media manager.
Finding and killing the processes that are associated with the server sessions is operating system specific. On some platforms the server sessions are not associated with any processes at all. Refer to your operating system specific documentation for more information.
You may sometimes need to kill an RMAN job that is hung in the media manager. The best way to terminate RMAN when the channel connections are hung in the media manager is to kill the session in the media manager. If this action does not solve the problem, then the try killing the Oracle processes of the connections. Note that killing the Oracle process can cause problems for the media manager.
This section contains these topics:
The nature of an RMAN session depends on the operating system. In UNIX, an RMAN session has the following processes associated with it:
DUPLICATE
or performing TSPITR, none otherwise)ALLOCATE
CHANNEL
or CONFIGURE
CHANNEL
commands. One polling connection exists for each distinct connect string used in the ALLOCATE
CHANNEL
or CONFIGURE
CHANNEL
command.RMAN usually hangs because one of the channel connections is waiting in the media manager code for a tape resource. The catalog connection and the default channel seem to hang because they are waiting for RMAN to tell them what to do. Polling connections seem to be in an infinite loop while polling the RPC under the control of the RMAN process.
If you kill the RMAN process itself, then you also kill the catalog connection, the auxiliary connection, the default channel, and the polling connections. Target and auxiliary connections that are not hung in the media manager code are also terminated: only the target and auxiliary connections executing in the media management layer remains active. You must manually kill this process because terminating its session does not kill it. Even after termination, the media manager may keep resources busy or continue processing because it does not realize that the Oracle process is gone. This behavior depends on which media manager you use.
Terminating the catalog connection does not cause RMAN to finish because RMAN is not performing catalog operations while the backup or restore is in progress. Removing default channel and polling connections causes the RMAN process to detect that one of the channels has died and then proceed to exit. In this case, the connections to the hung channels remain active as described previously.
The best way to terminate RMAN when the connections for the allocated channels are hung in the media manager is to kill the Oracle process of the connections. The RMAN process detects this termination and proceed to exit, removing all connections except target connections that are still operative in the media management layer. The caveat about the media manager resources still applies in this case.
To terminate an Oracle process that is hung in the media manager:
V$SESSION
and V$SESSION_WAIT
as described in "Monitoring RMAN Interaction with the Media Manager". For example, execute the following query:
COLUMN EVENT FORMAT a10 COLUMN SECONDS_IN_WAIT FORMAT 999 COLUMN STATE FORMAT a20 COLUMN CLIENT_INFO FORMAT a30 SELECT p.SPID, EVENT, SECONDS_IN_WAIT AS SEC_WAIT, STATE, CLIENT_INFO FROM V$SESSION_WAIT sw, V$SESSION s, V$PROCESS p WHERE sw.EVENT LIKE 'sbt%' AND s.SID=sw.SID AND s.PADDR=p.ADDR /
Examine the SQL output to determine which sbt functions are waiting. For example, the output may be as follows:
SPID EVENT SEC_WAIT STATE CLIENT_INFO ---- ---------- ---------- -------------------- ------------------------------ 8642 sbtwrite2 600 WAITING rman channel=ORA_SBT_TAPE_1 8374 sbtwrite2 600 WAITING rman channel=ORA_SBT_TAPE_2
kill
-9
command:
% kill -9 8642 8374
In general, you can use LIST
, REPORT
, and SHOW
to obtain most RMAN information: backups, copies, database incarnations, configuration settings, and so forth. Nevertheless, you may at times want to use V$
views for information that these commands do not display.
Sometimes it is useful to identify what a server session performing a backup or copy operation is doing. You have access to several views that can assist in monitoring the progress of or obtaining information about RMAN jobs, as described in the following table.
V$BACKUP_SYNC_IO
contains rows when the I/O is synchronous to the process (or thread on some platforms) performing the backup. V$BACKUP_ASYNC_IO
contains rows when the I/O is asynchronous. Asynchronous I/O is obtained either with I/O processes or because it is supported by the underlying operating system.
You can use RMAN to perform the checks discussed in the following sections:
To identify which server sessions correspond to which RMAN channels, you can query V$SESSION
and V$PROCESS
. The SPID
column of V$PROCESS
identifies the operating system ID number for the process or thread. For example, on UNIX the SPID
column shows the process ID, whereas on Windows NT the SPID
column shows the thread ID. You have two basic methods for obtaining this information, depending on whether you have multiple RMAN sessions active concurrently.
When only one RMAN session is active, the easiest method for determining the server session ID for an RMAN channel is to execute the following query on the target database while the RMAN job is executing:
COLUMN CLIENT_INFO FORMAT a30 COLUMN SID FORMAT 999 COLUMN SPID FORMAT 9999 SELECT s.SID, p.SPID, s.CLIENT_INFO FROM V$PROCESS p, V$SESSION s WHERE p.ADDR = s.PADDR AND CLIENT_INFO LIKE 'rman%' /
If you do not run the SET
COMMAND
ID
command in the RMAN job, then the CLIENT_INFO
column displays in the following format:
rman channel=channel_id
For example, the following shows sample output:
SID SPID CLIENT_INFO ---- ------------ ------------------------------ 14 8374 rman channel=ORA_SBT_TAPE_1
If more than one RMAN session is active, it is possible for the V$SESSION.CLIENT_INFO
column to yield the same information for a channel in each session. For example:
SID SPID CLIENT_INFO ---- ------------ ------------------------------ 14 8374 rman channel=ORA_SBT_TAPE_1 9 8642 rman channel=ORA_SBT_TAPE_1
In this case, you have the following methods for determining which channel corresponds to which SID
value.
In this method, you must first obtain the sid
values from the RMAN output and then use these values in your SQL query.
To correlate a process with a channel during a backup:
sid
for the channel. For example, the output may show:
Starting backup at 21-AUG-01
allocated channel: ORA_SBT_TAPE_1
channel ORA_SBT_TAPE_1: sid=14 devtype=SBT_TAPE
V$SESSION
and V$PROCESS
views while the RMAN job is executing. For example, enter:
COLUMN CLIENT_INFO FORMAT a30 COLUMN SID FORMAT 999 COLUMN SPID FORMAT 9999 SELECT s.SID, p.SPID, s.CLIENT_INFO FROM V$PROCESS p, V$SESSION s WHERE p.ADDR = s.PADDR AND CLIENT_INFO LIKE 'rman%' /
Use the sid
value obtained from the first step to determine which channel corresponds to which server session:
SID SPID CLIENT_INFO ---------- ------------ ------------------------------ 14 2036 rman channel=ORA_SBT_TAPE_1 12 2066 rman channel=ORA_SBT_TAPE_1
In this method, you specify a command ID string in the RMAN backup script. You can then query V$SESSION.CLIENT_INFO
for this string.
To correlate a process with a channel during a backup:
COMMAND
ID
to a different value after allocating the channels and then back up the desired object. For example, enter the following in session 1:
RUN { ALLOCATE CHANNEL c1 TYPE sbt; SET COMMAND ID TO 'sess1'; BACKUP DATABASE;
}
Set the command ID to a string such as sess2
in the job running in session 2:
RUN { ALLOCATE CHANNEL c1 TYPE sbt; SET COMMAND ID TO 'sess2'; BACKUP DATABASE;
}
V$SESSION
and V$PROCESS
views while the RMAN job is executing. For example, enter:
SELECT SID, SPID, CLIENT_INFO FROM V$PROCESS p, V$SESSION s WHERE p.ADDR = s.PADDR AND CLIENT_INFO LIKE '%id=sess%';
If you run the SET
COMMAND
ID
command in the RMAN job, then the CLIENT_INFO
column displays in the following format:
id=command_id,rman channel=channel_id
For example, the following shows sample output:
SID SPID CLIENT_INFO ---- ------------ ------------------------------ 11 8358 id=sess1 15 8638 id=sess2 14 8374 id=sess1,rman channel=c1 9 8642 id=sess2,rman channel=c1
The rows that contain the string rman channel
show the channel performing the backup. The remaining rows are for the connections to the target database.
See Also:
Oracle9i Recovery Manager Reference for |
Monitor the progress of backups, copies, and restores by querying the view V$SESSION_LONGOPS
. RMAN uses two types of rows in V$SESSION_LONGOPS
: detail and aggregate rows. Detail rows describe the files being processed by one job step, while aggregate rows describe the files processed by all job steps in an RMAN command. A job step is the creation or restore of one backup set or datafile copy. Detail rows are updated with every buffer that is read or written during the backup step, so their granularity of update is small. Aggregate rows are updated when each job step completes, so their granularity of update is large.
Table 15-3 describes column in V$SESSION_LONGOPS
that are most relevant for RMAN. Typically, you will view the detail rows rather than the aggregate rows to determine the progress of each backup set.
Each server session performing a backup, restore, or copy reports its progress compared to the total amount of work required for a job step. For example, if you perform a database restore that uses two channels, and each channel has two backup sets to restore (a total of four sets), then each server session reports its progress through a single backup set. When that set is completely restored, RMAN begins reporting progress on the next set to restore.
To monitor job progress:
RESTORE DATABASE;
SELECT SID, SERIAL#, CONTEXT, SOFAR, TOTALWORK, ROUND(SOFAR/TOTALWORK*100,2) "%_COMPLETE" FROM V$SESSION_LONGOPS WHERE OPNAME LIKE 'RMAN%' AND OPNAME NOT LIKE '%aggregate%' AND TOTALWORK != 0 AND SOFAR <> TOTALWORK /
If you repeat the query while the restore progresses, then you see output such as the following:
SQL> @longops SID SERIAL# CONTEXT SOFAR TOTALWORK %_COMPLETE ---------- ---------- ---------- ---------- ---------- ---------- 8 19 1 10377 36617 28.34 SQL> @longops SID SERIAL# CONTEXT SOFAR TOTALWORK % COMPLETE ---------- ---------- ---------- ---------- ---------- ---------- 8 19 1 21513 36617 58.75 SQL> @longops SID SERIAL# CONTEXT SOFAR TOTALWORK % COMPLETE ---------- ---------- ---------- ---------- ---------- ---------- 8 19 1 29641 36617 80.95 SQL> @longops SID SERIAL# CONTEXT SOFAR TOTALWORK % COMPLETE ---------- ---------- ---------- ---------- ---------- ---------- 8 19 1 35849 36617 97.9 SQL> @longops no rows selected
%
_COMPLETE
column does not increase, then RMAN is encountering a problem. Refer to "Monitoring RMAN Interaction with the Media Manager" to obtain more information.You can also run the following Perl script from the UNIX command line. It creates a temporary SQL query script and then runs the script in an infinite while
loop. When RMAN completes the job, cancel the Perl script:
#!/usr/local/bin/perl5 open (TMP, ">/tmp/test.sql") || die "can't open /tmp/test.sql\n"; print TMP 'SET FEEDBACK OFF SPOOL /tmp/sql.out SELECT SID, SERIAL#, CONTEXT, SOFAR, TOTALWORK, ROUND(SOFAR/TOTALWORK*100,2) "% COMPLETE" FROM V$SESSION_LONGOPS WHERE OPNAME LIKE \'RMAN%\' AND OPNAME NOT LIKE \'%aggregate%\' AND TOTALWORK != 0 AND SOFAR <> TOTALWORK; SPOOL OFF EXIT; EOF'; close TMP; while (1) { # replace with your sqlplus connection information `sqlplus "/ AS SYSDBA" @/tmp/test`; open(SQLOUT, "</tmp/sql.out") || die "can't open /tmp/sql.out\n"; @whole = <SQLOUT>; foreach (@whole) {print if defined;} close(SQLOUT); }
You can use the event names in the dynamic performance event views to monitor RMAN calls to the media management API. The event names have one-to-one correspondence with sbt functions, as shown in the following examples:
sbtinit sbtopen sbtread sbtwrite sbtbackup
Before making a call to any of functions in the media management API, the server adds a row in V$SESSION_WAIT
, with the STATUS
column including the string WAIT
. The V$SESSION_WAIT.SECONDS_IN_WAIT
column shows the number of seconds that the server has been waiting for this call to return. After an sbt
function is returned from the media manager, this row disappears.
A row in V$SESSION_WAIT
corresponding to an sbt
event name does not indicate a problem, because the server updates these rows at runtime. The rows appear and disappear as calls are made and returned. However, if the SECONDS_IN_WAIT
column is high, then the media manager may be hung.
To monitor the sbt
events, you can run the following SQL query:
COLUMN EVENT FORMAT a10 COLUMN SECONDS_IN_WAIT FORMAT 999 COLUMN STATE FORMAT a20 COLUMN CLIENT_INFO FORMAT a30 SELECT p.SPID, EVENT, SECONDS_IN_WAIT AS SEC_WAIT, STATE, CLIENT_INFO FROM V$SESSION_WAIT sw, V$SESSION s, V$PROCESS p WHERE sw.EVENT LIKE 'sbt%' AND s.SID=sw.SID AND s.PADDR=p.ADDR /
Examine the SQL output to determine which sbt
functions are waiting. For example, the following output indicates that RMAN has been waiting for the sbtbackup
function to return for ten minutes:
SPID EVENT SEC_WAIT STATE CLIENT_INFO ---- ---------- ---------- -------------------- ------------------------------ 8642 sbtbackup 600 WAITING rman channel=ORA_SBT_TAPE_1
See Also:
Oracle9i Database Reference for descriptions of |
Monitor backup and restore performance by querying V$BACKUP_SYNC_IO
and V$BACKUP_ASYNC_IO
.
See Also:
Oracle9i Database Reference for more information on these |
You can often use the dynamic performance view V$RECOVER_FILE
to determine which files need to be recovered and why they need to be recovered. The following query shows the file numbers of datafiles that require recovery, as well as the reason for recovery (if known) and the SCN and time when recovery needs to begin:
COL FILE# FORMAT 999 COL ERROR FORMAT a10 SELECT * FROM V$RECOVER_FILE / FILE# ONLINE ONLINE_ ERROR CHANGE# TIME ----- ------- ------- ---------- ---------- -------------------- 4 ONLINE ONLINE FILE NOT 0 FOUND 5 ONLINE ONLINE FILE NOT 0 FOUND 8 OFFLINE OFFLINE OFFLINE 0 NORMAL
Query V$DATAFILE
and V$TABLESPACE
to obtain filenames and tablespace names for datafiles requiring recovery. For example, enter:
SELECT d.NAME, t.NAME FROM V$DATAFILE d, V$TABLESPACE t WHERE t.TS# = d.TS# AND d.FILE# IN (14,15,21); # use values obtained from V$RECOVER_FILE query NAME NAME --------------------------------- --------------- /oracle/oradata/trgt/drsys01.dbf DRSYS /oracle/oradata/trgt/example01.dbf EXAMPLE /oracle/oradata/trgt/users01.dbf USERS
You can combine these queries in the following SQL*Plus script (sample output follows):
COL df# FORMAT 999 COL df_name FORMAT a35 COL tbsp_name FORMAT a10 COL status FORMAT a7 COL error FORMAT a10 SELECT r.FILE# AS df#, d.NAME AS df_name, t.NAME AS tbsp_name, d.STATUS, r.ERROR, r.CHANGE#, r.TIME FROM V$RECOVER_FILE r, V$DATAFILE d, V$TABLESPACE t WHERE t.TS# = d.TS# AND d.FILE# = r.FILE# / DF# DF_NAME TBSP_NAME STATUS ERROR CHANGE# TIME ---- ----------------------------------- ---------- ------- ---------- ---------- -------- 4 /oracle/oradata/trgt/drsys01.dbf DRSYS ONLINE FILE NOT 0 FOUND 5 /oracle/oradata/trgt/example01.dbf EXAMPLE ONLINE FILE NOT 0 FOUND 8 //oracle/oradata/trgt/users01.dbf USERS OFFLINE OFFLINE 0 NORMAL
This section contains these topics:
In this scenario, you install and test the media manager as explained in "Configuring RMAN to Make Backups to a Media Manager", but you still cannot make RMAN back up to tape. For example, after allocating the sbt
channel, you receive an error stack similar to the following:
RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: =========================================================== RMAN-03009: failure of allocate command on c1 channel at 08/29/2001 17:16:54 ORA-19554: error allocating device, device type: SBT_TAPE, device name: ORA-27211: Failed to load Media Management Library Additional information: 25
The ORA-27211
error indicates that the channel allocation is failing because Oracle is not loading the media management library. If the channel allocation fails, then Oracle generates a trace file in the USER_DUMP_DEST
location that contains the error that caused the channel allocation to fail. The trace file should have the complete path name of the media management library loaded by Oracle as well as any other media manager errors or operating system errors. For example, the trace file on UNIX may be called something like /oracle/rdbms/log/prod1_ora_16226.trc
, and may contain information such as the following:
*** 2001-08-29 17:16:54.385 SKGFQ OSD: Error in function sbtinit on line 2396 SKGFQ OSD: Look for SBT Trace messages in file /oracle/rdbms/log/sbtio.log SBT Initialize failed for oracle.static
The last line of this output indicates that Oracle is loading the default static library instead of the media management library that you installed.
To test the loading of the media management library, try allocating a channel by using the PARMS
parameter SBT_LIBRARY
to force the loading of the media management library. For example, if your library is called /vendor/lib/some_mm_lib.so
and is pointed to by $ORACLE_HOME/lib/libobk.so
, then run a command such as the following, making sure to specify whatever PARMS
settings are required by your media manager:
RUN { ALLOCATE CHANNEL c1 DEVICE TYPE sbt PARMS='SBT_LIBRARY=?/lib/libobk.so', 'ENV=(NSR_SERVER=tape_svr,NSR_CLIENT=oracleclnt,NSR_GROUP=oracle_tapes)'; }
If the channel allocation fails, then check the trace file again to see whether you can learn anything new. If the channel allocation with SBT_LIBRARY
succeeds, but an ordinary sbt
channel allocation fails, then Oracle is probably trying to load a library ($ORACLE_HOME/lib/libobk.so
on UNIX, %ORACLE_HOME%/bin/orasbt.dll
on NT) other than the one you installed. You may have more than one library in the operating system path, and the one that Oracle is loading is the wrong one.
If the problem is that Oracle is not loading the correct library, then make sure that the library is named correctly. For example, on UNIX you should name it $ORACLE_HOME/rdbms/libobk.so
or create a symbolic link with this name that points to your library, and on Windows NT you should name it %ORACLE_HOME%/bin/orasbt.dll
. It is possible to place these files in nondefault directories, but they must be included in the system path so that Oracle can locate them.
See Also: Oracle9i Recovery Manager Reference for descriptions of the legal |
In this scenario, an RMAN backup job starts as normal and then pauses inexplicably:
Recovery Manager: Release 9.2.0.0.0 - Production connected to target database: TRGT connected to recovery catalog database RMAN> BACKUP TABLESPACE SYSTEM, tools; allocated channel: t1 channel t1: sid=16 devtype=SBT_TAPE channel t1: starting datafile backupset set_count=15 set_stamp=338309600 channel t1: including datafile 2 in backupset channel t1: including datafile 1 in backupset channel t1: including current controlfile in backupset # Hanging here for 30 minutes now
If a backup job is hanging, that is, not proceeding, then several scenarios are possible:
As described in "Monitoring RMAN Interaction with the Media Manager", you can query sbt
wait events to gain more information. For example, run the following query on the target instance:
COLUMN EVENT FORMAT a10 COLUMN SECONDS_IN_WAIT FORMAT 999 COLUMN STATE FORMAT a20 COLUMN CLIENT_INFO FORMAT a30 SELECT p.SPID, EVENT, SECONDS_IN_WAIT AS SEC_WAIT, STATE, CLIENT_INFO FROM V$SESSION_WAIT sw, V$SESSION s, V$PROCESS p WHERE sw.EVENT LIKE 'sbt%' AND s.SID=sw.SID AND s.PADDR=p.ADDR /
Examine the SQL output to determine which sbt functions are waiting. For example, the output may be as follows:
SPID EVENT SEC_WAIT STATE CLIENT_INFO ---- ---------- ---------- -------------------- ------------------------------ 8642 sbtbackup 300 WAITING rman channel=ORA_SBT_TAPE_1
Because the causes of a hung backup job can be varied, so are the solutions. For example, backup jobs often hang simply because the tape device has completely filled the current cassette and is waiting for a new tape to be inserted. Ideally, the query of the sbt
wait events should indicate the problem.
If the sbt
wait event query is unhelpful, then examine media manager process, log, and trace files for signs of abnormal termination or other errors (refer to the description of message files in "Identifying Types of Message Output").
See Also: "Terminating an RMAN Session: Basic Steps" to learn how to kill an RMAN session that is hanging |
In this scenario, you run a backup job and receive message output similar to the following:
channel c8: including datafile number 47 in backupset RPC call appears to have failed to start on channel c9 RPC call ok on channel c9 channel c3: including datafile number 18 in backupset
The RPC
call
appears
to
have
failed
message does not usually indicate a problem. The message indicates one of the following:
Timing problems occur in this way. When RMAN begins an RPC, it checks the V$SESSION
performance view. The RPC updates the information in the view to indicate when it starts and finishes. Sometimes RMAN checks V$SESSION
before the RPC has indicated it has started, which in turn generates the following message:
RPC call appears to have failed
If a message stating "RPC
call
ok
" does not appear in the output immediately following the message stating "RPC
call
appears
to
have
failed
", then the backup job encountered a problem.
In this scenario, you attempt a backup and receive the following error messages:
RMAN-3014: Implicit resync of recovery catalog failed RMAN-6038: Recovery catalog package detected an error RMAN-20035: Invalid high RECID error
In one common scenario, you restore a backup control file created through a non-Oracle mechanism, and then open the database without the RESETLOGS
option. If you had created the backup control file through the RMAN BACKUP
command or the SQL ALTER
DATABASE
BACKUP
CONTROLFILE
statement, then Oracle would have required you to reset the online logs.
The control file and the recovery catalog are now not synchronized. The database control file is older than the recovery catalog, because at one time the recovery catalog resynchronized with the old current control file, and now the database is using a backup control file. RMAN detects that the control file currently in use is older than the control file previously used to resynchronize.
Another common scenario occurs when you attempt to copy the target database to a new machine as follows:
CATALOG
to add this control file copy to the repository.RECID
in the recovery catalog moves from 90 to 100.RECID
is 100, but the control file indicates that the highest RECID
is 90. The control file RECID
should always be greater than or equal to the recovery catalog RECID
, so RMAN issues RMAN-20035
.This solution is safest and is strongly recommended:
To reset the database with RMAN:
% sqlplus '/ AS SYSDBA'
ALTER DATABASE MOUNT;
USING
BACKUP
CONTROLFILE
clause stamps the controlfile as a backup, which then permits OPEN
RESETLOGS
. For example, enter:
ALTER DATABASE RECOVER DATABASE UNTIL CANCEL USING BACKUP CONTROLFILE; ALTER DATABASE RECOVER CANCEL;
RESETLOGS
option. For example, enter:
ALTER DATABASE OPEN RESETLOGS;
% rman TARGET SYS/oracle@trgt CATALOG rman/cat@catdb
RESET DATABASE;
BACKUP DATABASE PLUS ARCHIVELOG;
This solution is more difficult than solution 1:
To create the control file with SQL*Plus:
% sqlplus 'SYS/oracle@trgt AS SYSDBA'
SQL> ALTER DATABASE MOUNT;
SQL> ALTER DATABASE BACKUP CONTROLFILE TO TRACE;
# The following commands will create a new control file and use it # to open the database. # Data used by the recovery manager will be lost. Additional logs may # be required for media recovery of offline data files. Use this # only if the current version of all online logs are available. STARTUP NOMOUNT CREATE CONTROLFILE REUSE DATABASE "TRGT" NORESETLOGS ARCHIVELOG -- STANDBY DATABASE CLUSTER CONSISTENT AND UNPROTECTED MAXLOGFILES 32 MAXLOGMEMBERS 2 MAXDATAFILES 32 MAXINSTANCES 1 MAXLOGHISTORY 226 LOGFILE GROUP 1 '/oracle/oradata/trgt/redo01.log' SIZE 25M, GROUP 2 '/oracle/oradata/trgt/redo02.log' SIZE 25M, GROUP 3 '/oracle/oradata/trgt/redo03.log' SIZE 500K -- STANDBY LOGFILE DATAFILE '/oracle/oradata/trgt/system01.dbf', '/oracle/oradata/trgt/undotbs01.dbf', '/oracle/oradata/trgt/cwmlite01.dbf', '/oracle/oradata/trgt/drsys01.dbf', '/oracle/oradata/trgt/example01.dbf', '/oracle/oradata/trgt/indx01.dbf', '/oracle/oradata/trgt/tools01.dbf', '/oracle/oradata/trgt/users01.dbf' CHARACTER SET WE8DEC ; # Take files offline to match current control file. ALTER DATABASE DATAFILE '/oracle/oradata/trgt/tools01.dbf' OFFLINE; ALTER DATABASE DATAFILE '/oracle/oradata/trgt/users01.dbf' OFFLINE; # Configure RMAN configuration record 1 VARIABLE RECNO NUMBER; EXECUTE :RECNO := SYS.DBMS_BACKUP_RESTORE.SETCONFIG('CHANNEL','DEVICE TYPE DISK DEBUG 255'); # Recovery is required if any of the datafiles are restored backups, # or if the last shutdown was not normal or immediate. RECOVER DATABASE # All logs need archiving and a log switch is needed. ALTER SYSTEM ARCHIVE LOG ALL; # Database can now be opened normally. ALTER DATABASE OPEN; # Commands to add tempfiles to temporary tablespaces. # Online tempfiles have complete space information. # Other tempfiles may require adjustment. ALTER TABLESPACE TEMP ADD TEMPFILE '/oracle/oradata/trgt/temp01.dbf' REUSE; # End of tempfile additions.
SHUTDOWN IMMEDIATE
STARTUP NOMOUNT CREATE CONTROLFILE ...; EXECUTE ...; RECOVER DATABASE ALTER SYSTEM ARCHIVE LOG CURRENT; ALTER DATABASE OPEN ...;
In this scenario, a backup job fails because RMAN cannot make a snapshot control file. The message stack is as follows:
RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: =========================================================== RMAN-03002: failure of backup command at 08/30/2001 22:48:44 ORA-00230: operation disallowed: snapshot controlfile enqueue unavailable
When RMAN needs to back up or resynchronize from the control file, it first creates a snapshot or consistent image of the control file. If one RMAN job is already backing up the control file while another needs to create a new snapshot control file, then you may see the following message:
waiting for snapshot controlfile enqueue
Under normal circumstances, a job that must wait for the control file enqueue waits for a brief interval and then successfully obtains the enqueue. RMAN makes up to five attempts to get the enqueue and then fails the job. The conflict is usually caused when two jobs are both backing up the control file, and the job that first starts backing up the control file waits for service from the media manager.
To determine which job is holding the conflicting enqueue:
RMAN-08512:
waiting
for
snapshot
controlfile
enqueue
", start a new SQL*Plus session on the target database:
% sqlplus 'SYS/oracle@trgt AS SYSDBA'
SELECT s.SID, USERNAME AS "User", PROGRAM, MODULE, ACTION, LOGON_TIME "Logon", l.* FROM V$SESSION s, V$ENQUEUE_LOCK l WHERE l.SID = s.SID AND l.TYPE = 'CF' AND l.ID1 = 0 AND l.ID2 = 2;
You should see output similar to the following (the output in this example has been truncated):
SID User Program Module Action Logon --- ---- -------------------- ------------------------- ---------------- --------- 9 SYS rman@h13 (TNS V1-V3) backup full datafile: c1 0000210 STARTED 21-JUN-01
After you have determined which job is creating the enqueue, you can do one of the following:
Commonly, enqueue situations occur when a job is writing to a tape drive, but the tape drive is waiting for a new cassette to be inserted. If you start a new job in this situation, then you will probably receive the enqueue message because the first job cannot complete until the new tape is loaded.
In this scenario, the database archives automatically to two directories: ?/oradata/trgt/arch
and ?/oradata/trgt/arch2
. You tell RMAN to perform a backup and delete the input archived redo logs afterward in the following script:
BACKUP ARCHIVELOG ALL DELETE INPUT;
You then run a crosscheck to make sure the logs are gone and find the following:
CROSSCHECK ARCHIVELOG ALL; validation succeeded for archived log archivelog filename=/oracle/oradata/trgt/arch2/archive1_964.arc recid=19 stamp=368726072
RMAN deleted one set of logs but not the other.
This problem is not an error. When you specify DELETE
INPUT
without the ALL
keyword, RMAN deletes only one copy of each input log. Even if you archive to five destinations, RMAN deletes logs from only one directory.
To force RMAN to delete all existing archived redo logs, use the DELETE
ALL
INPUT
clause of the BACKUP
command. For example, enter:
BACKUP ARCHIVELOG ALL DELETE ALL INPUT;
In this scenario, you schedule regular backups of the archived redo logs. The next time you make a backup, you receive this error:
RMAN-6089: archive log NAME not found or out of sync with catalog
This problem occurs when the archived log that RMAN is looking for cannot be accessed by RMAN, or the recovery catalog needs to be resynchronized. Often, this error occurs when you delete archived logs with an operating system command, which means that RMAN is unaware of the deletion. The RMAN-6089
error occurs because RMAN attempts to back up a log that the repository indicates still exists.
Make sure that the archived logs exists in the specified directory and that the RMAN catalog is synchronized. Check the following:
RMAN-6089
error exists in the correct directory.owner
=
oracle
, group
=
DBA
) to make sure that RMAN can access the file.RESYNC CATALOG;
If you know that the logs are unavailable because you deleted them by using an operating system utility, then run the following command at the RMAN prompt to update RMAN metadata:
CROSSCHECK ARCHIVELOG ALL;
It is always better to use RMAN to delete logs than to use an operating system utility. The easiest method to remove unwanted logs is to specify the DELETE
INPUT
option when backing up archived logs. For example, enter:
BACKUP DEVICE TYPE sbt ARCHIVELOG ALL DELETE ALL INPUT;
In this scenario, you are trying to connect to a release 8.0.4 target database. You receive the following error messages when you try to connect to the target database:
RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ==== RMAN-00571: =========================================================== PLS-00201: identifier 'DBMS_BACKUP_RESTORE.SET_CHARSET' must be declared ORA-06550: line 1, column 7: PL/SQL: Statement ignored RMAN-04015: error setting target database character set to WE8ISO8859P1
Typically, this error message means that the DBMS_BACKUP_RESTORE
package was not created during the installation of the database. Here are possible causes:
If you did not install the PL/SQL option, then install it. If you did install the PL/SQL option, then create the required packages by connecting to SQL*Plus with SYSDBA
privileges and running the following scripts:
SQL> @?/rdbms/admin/dbmsbkrs.sql SQL> @?/rdbms/admin/prvtbkrs.plb
In this scenario, you are connected to the target database while it is not open and attempting to perform an RMAN operation. You receive the following error:
PLS-00553: character set name is not recognized
Typically, this message means that the character set in the client environment, that is, the environment in which you are running the RMAN executable, is different from the character set in the target database environment.
NLS_CHARACTERSET
parameter. For example, run this query:
SQL> SELECT VALUE FROM V$NLS_PARAMETERS WHERE PARAMETER='NLS_CHARACTERSET';
NLS_LANG
environment variable on a UNIX system as follows:
% setenv NLS_LANG american_america.we8dec % setenv NLS_DATE_FORMAT "MON DD YYYY HH24:MI:SS"
RMAN fails with the following errors when trying to connect to the target database:
% rman Recovery Manager: Release 9.2.0.0.0 - Production RMAN> CONNECT TARGET sys/change_on_install@inst1 RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: =========================================================== ORA-01017: invalid username/password; logon denied
RMAN automatically requests a connection to the target database as SYSDBA
. In order to connect to the target as SYSDBA
, you must do one of the following:
DBA
group with respect to the target database (that is, have the ability to connect with SYSDBA
privileges to the target database without a password).orapwd
command and the initialization parameter REMOTE_LOGIN_PASSWORDFILE
.If the target database does not have a password file, then the user you are logged in as must be validated with operating system authentication.
Either create a password file for the target database or add yourself to the administrator list in the operating system.
See Also:
Oracle9i Database Administrator's Guide to learn how to create a password file |
To learn how to create a password file, see
In this scenario, you attempt to duplicate a database with the DUPLICATE
command, but receive the following error stack:
RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: =========================================================== RMAN-03002: failure of Duplicate Db command at 09/04/2001 12:11:29 RMAN-03015: error occurred in stored script Memory Script RMAN-06053: unable to perform media recovery because of missing log RMAN-06025: no backup of log thread 1 seq 16 scn 145858 found to restore
The problem is that RMAN is not able to apply all the archived logs needed for complete recovery. For example, if you only backed up logs through sequence 15
, but the most recent archived log is sequence 16
, then DUPLICATE
fails.
When creating the duplication script, use the SET
UNTIL
command to specify a log sequence number for incomplete recovery. For example, to terminate recovery after applying log sequence 15
, enter:
RUN { SET UNTIL SEQUENCE 16 THREAD 1; # recovers up to but not including log 16 DUPLICATE TARGET DATABASE TO 'dupdb';
}
See Also:
"Creating a Non-Current Duplicate Database: Example" for more information about performing incomplete recovery during the duplication operation |
In this scenario, you back up the database, then run the DUPLICATE
command. You receive the following error stack:
RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: =========================================================== RMAN-03002: failure of Duplicate Db command at 09/04/2001 13:55:11 RMAN-03015: error occurred in stored script Memory Script RMAN-06026: some targets not found - aborting restore RMAN-06023: no backup or copy of datafile 8 found to restore RMAN-06023: no backup or copy of datafile 7 found to restore RMAN-06023: no backup or copy of datafile 6 found to restore RMAN-06023: no backup or copy of datafile 5 found to restore RMAN-06023: no backup or copy of datafile 4 found to restore RMAN-06023: no backup or copy of datafile 3 found to restore RMAN-06023: no backup or copy of datafile 2 found to restore RMAN-06023: no backup or copy of datafile 1 found to restore
The DUPLICATE
command recovers to archived redo logs, but cannot recover into online redo logs. Thus, if the restored backup cannot be made consistent without applying the online redo logs, then duplication fails with RMAN-06023 errors because RMAN is looking for backups created before the most recent archived log.
Issue the following statement after making a backup:
SQL 'ALTER SYSTEM ARCHIVE LOG CURRENT';
This statement archives all records in the online redo logs so that RMAN can now recover the backup by applying the most recent archived redo log.
In this scenario, you list the database incarnations registered in the recovery catalog and see a database with the name UNKNOWN
:
LIST INCARNATION OF DATABASE; RMAN-03022: compiling command: list List of Database Incarnations DB Key Inc Key DB Name DB ID CUR Reset SCN Reset Time ------- ------- ------- ------ --- ---------- ---------- 56 57 TRGT 4052472287 YES 1 Sep 03 2001 06:45:51 1 19 UNKNOWN 4141147584 NO 1 Jan 08 2001 14:47:28 1 2 TRGT 4141147584 YES 14602 Jan 15 2001 15:32:57
One way you get the DB_NAME
of UNKNOWN
is when you register a database that was once opened with the RESETLOGS
option. The DB_NAME
can be changed during a RESETLOGS
operation, so RMAN does not know what the DB_NAME
was for those old incarnations of the database because it was not registered in the recovery catalog at the time. Consequently, RMAN sets the DB_NAME
column to UNKNOWN
when creating the DBINC
record.
The UNKNOWN
name entry is expected behavior after a RESETLOGS
operation. You should not attempt to remove UNKNOWN
entries from the recovery catalog.
|
Copyright © 1996, 2002 Oracle Corporation. All Rights Reserved. |
|