Coskan’s Approach to Oracle

November 18, 2011

On demand AWR Report Generator, for Many Node Cluster

Filed under: Basics, Performance, RAC — coskan @ 11:34 am

update ( 10/10/2012 ): I have changed the script to make work better with hourly requests at it was causing issues. Thanks to Dominic Brooks for the trick
update ( 11/10/2012 ): Added ADDM and Global report generation

Since I got turbo boost (over 1000 hits a day-wooow) from Tanel Poder the Great, better I write something to deserve the fame. This one is a real quick one which made my life easier.

On my second week at current company, I had to cover Exadata for 2 weeks which were completely nightmare due to some problems. I had to work with Oracle Support too much and most of the time they ask AWR reports from 8 nodes for x hour duration. First day I was not armed with anything automatic, so tried to do it manually but it was proper nightmare for 8 nodes. Then I said this is complete rubbish way of wasting my time and I need to automate the process otherwise my life will be miserable. Finally I came up with something after couple of attempts

The script below is the result of this automation attempts.

update ( 11/10/2012 ): –not equal sign may have been corrupted during paste for (next_snap_id”0 AND previous_snap_id”0)

-- Name:    gen_snap.sql
--
-- Purpose: Automating ADDM and AWR and Global RAC report generation 

--Usage:  copy the script your script directory. Edit the parameters section then  run it ,  
---it will generate gen_awr.sql  script . Call that script for generating awr reports
---
--- ALWAYS CHECK gen_awr.sql script before you run !!!!!!!!! 
---
-- @gen_snap.sql  
-- @gen_awr 

-- Versioning:
-- Version    Modified by	Date		What Modified
-- -------    ---------------	----------	--------------
-- 1.0	      Coskan Gundogar	11/10/2012	Initial 


------------PARAMETERS - BEGIN -------------------------------

define _interval=2				   --1 for half hour 2 for 1 hour (30 mins snapshots) / 1 for  hour 2 for 2 hour (60 mins snapshots)
define _database_name='DATABASE_NAME'		 ---- database name 
define _begin_interval_date='111012 07:00:00'	 ----begin date DDMMYY HH24:MI:SS
define _end_interval_date='111012 10:00:00'      ----end date   DDMMYY HH24:MI:SS
define _folder='C:\awr\DATABASE_NAME\111012'  ---report output location
define _option=2 ---1 without ADDM 2 with ADDM 
define _global=1 ---0 if you don't want Global Reports for RAC - It will not generate report for Single Instance with any setting 

------------PARAMETERS - END-------------------------------


SET termout OFF 
host mkdir &_folder
SET termout ON
--host dir &_folder
prompt 
prompt 
prompt GEN_AWR.SQL script generation started
prompt 
SET termout OFF 
SET heading off
SET feedback OFF 
SET echo OFF 
SET linesize 155
SET verify OFF 

spool gen_awr.SQL
SELECT 'set veri off;'||CHR(10)||
'set feedback off;' ||CHR(10)||
'set linesize 1500;' ||CHR(10)||
'prompt '||CHR(10)||
'prompt '||CHR(10)||
'prompt REPORT GENERATION STARTED'||CHR(10)||
'prompt '||CHR(10)||
'SET TIME OFF'||CHR(10)||
'set termout off;'||CHR(10)||
'set long 1000000 pagesize 0 longchunksize 1000;'||CHR(10)||
'COLUMN get_clob format a80'||CHR(10)
FROM dual;
with driver as (
select *
from (
select rownum rn,dbid,snap_id,previous_snap_id,next_snap_id,begin_interval_time  from (
SELECT * FROM (
select instance_number,dbid,snap_id,
LEAD(snap_id, &_interval, 0) OVER (PARTITION BY instance_number ORDER BY snap_id DESC NULLS LAST) previous_snap_id,
LAG(snap_id, &_interval, 0) OVER (PARTITION BY instance_number ORDER BY snap_id DESC NULLS LAST) next_snap_id,
begin_interval_time 
from dba_hist_snapshot 
where dbid=(select dbid from v$database where name='&_database_name')
AND instance_number= (select min(inst_id) from gV$database)
AND begin_interval_time>=TO_DATE('&_begin_interval_date','DDMMYY HH24:MI:SS')-1/24
AND begin_interval_time<TO_DATE('&_end_interval_date','DDMMYY HH24:MI:SS')+1/24
) d WHERE next_snap_id0 AND previous_snap_id0 ---not equal sign may have been corrupted during paste
ORDER BY 3
)  s ) where (
CASE WHEN &_INTERVAL = 2 THEN mod(rn,2)  
WHEN &_INTERVAL=1 THEN 1
END)=1
)
SELECT  
CASE &_option
WHEN 1 THEN 
'spool &_folder\awrrptto_&_database_name'||CHR(95)||TO_CHAR(instance_number)||'_'||TO_CHAR(snap_id)||'_'||TO_CHAR(next_snap_id)||'.html'||CHR(10)|| CHR(10)|| 
'SELECT * FROM TABLE(dbms_workload_repository.awr_report_html('||TO_CHAR(dbid)||','||TO_CHAR(instance_number)||','||TO_CHAR(snap_id)||','||TO_CHAR(next_snap_id)||',8));'||CHR(10)
||CHR(10)||'spool off'
WHEN 2 THEN 
'spool &_folder\awrrptto_&_database_name'||CHR(95)||TO_CHAR(instance_number)||'_'||TO_CHAR(snap_id)||'_'||TO_CHAR(next_snap_id)||'.html'||CHR(10)|| CHR(10)|| 
'SELECT * FROM TABLE(dbms_workload_repository.awr_report_html('||TO_CHAR(dbid)||','||TO_CHAR(instance_number)||','||TO_CHAR(snap_id)||','||TO_CHAR(next_snap_id)||',8));'||CHR(10)
||CHR(10)||'spool off'||CHR(10)||
'BEGIN'||CHR(10)||
'DBMS_ADVISOR.create_task (advisor_name      => ''ADDM'',task_name=> ''ADDM_'||TO_CHAR(snap_id)||'_'||TO_CHAR(next_snap_id)||'_AWR_SNAPSHOT'');'||CHR(10)||
'DBMS_ADVISOR.set_task_parameter (task_name => ''ADDM_'||TO_CHAR(snap_id)||'_'||TO_CHAR(next_snap_id)||'_AWR_SNAPSHOT'',parameter => ''START_SNAPSHOT'',value=> '||TO_CHAR(snap_id)||');'||CHR(10)||
'DBMS_ADVISOR.set_task_parameter (task_name => ''ADDM_'||TO_CHAR(snap_id)||'_'||TO_CHAR(next_snap_id)||'_AWR_SNAPSHOT'',parameter => ''END_SNAPSHOT'', value=> '||TO_CHAR(next_snap_id)||');'||CHR(10)||
'DBMS_ADVISOR.set_task_parameter (task_name => ''ADDM_'||TO_CHAR(snap_id)||'_'||TO_CHAR(next_snap_id)||'_AWR_SNAPSHOT'',parameter => ''INSTANCE'', value=> '||TO_CHAR(instance_number)||');'||CHR(10)||
'DBMS_ADVISOR.set_task_parameter (task_name => ''ADDM_'||TO_CHAR(snap_id)||'_'||TO_CHAR(next_snap_id)||'_AWR_SNAPSHOT'',parameter => ''DB_ID'', value=> '||TO_CHAR(dbid)||');'||CHR(10)||
'DBMS_ADVISOR.execute_task(task_name => ''ADDM_'||TO_CHAR(snap_id)||'_'||TO_CHAR(next_snap_id)||'_AWR_SNAPSHOT'');'||CHR(10)||
'END;'||CHR(10)||
'/'||CHR(10)||
'spool &_folder\addm_&_database_name'||CHR(95)||TO_CHAR(instance_number)||'_'||TO_CHAR(snap_id)||'_'||TO_CHAR(next_snap_id)||'_AWR_SNAPSHOT.txt'||CHR(10)||
'SELECT DBMS_ADVISOR.get_task_report(''ADDM_'||TO_CHAR(snap_id)||'_'||TO_CHAR(next_snap_id)||'_AWR_SNAPSHOT'') AS report'||CHR(10)||
'FROM   dual;'||CHR(10)||
'spool off'||CHR(10)
END
FROM (
select ai.inst_id instance_number,d.dbid,d.snap_id,d.previous_snap_id,d.next_snap_id,d.begin_interval_time from driver d,
gv$database ai
order by 1,3);




SELECT 
CASE &_global 
WHEN 1 then
'set timing off space 1 flush on pause off numwidth 10;' ||CHR(10)||
'set echo off feedback off   newpage 1 recsep off;' ||CHR(10)||
'set trimspool on trimout on define "&" concat "." serveroutput on;'||CHR(10)||
'set underline on;'||CHR(10)||
'set heading off;'||CHR(10)||
'set pagesize 50000;'||CHR(10)||
'set echo off;'||CHR(10)||
'set feedback off;'||CHR(10)||
'set linesize 8000;'||CHR(10)||
'set veri off;'||CHR(10)
WHEN 0 THEN '' 
END 
FROM dual;
with driver as (
select *
from (
select rownum rn,dbid,snap_id,previous_snap_id,next_snap_id,begin_interval_time  from (
SELECT * FROM (
select instance_number,dbid,snap_id,
LEAD(snap_id, &_interval, 0) OVER (PARTITION BY instance_number ORDER BY snap_id DESC NULLS LAST) previous_snap_id,
LAG(snap_id, &_interval, 0) OVER (PARTITION BY instance_number ORDER BY snap_id DESC NULLS LAST) next_snap_id,
begin_interval_time 
from dba_hist_snapshot 
where dbid=(select dbid from v$database where name='&_database_name')
AND instance_number= (select min(inst_number) from gV$active_instances)
AND begin_interval_time>=TO_DATE('&_begin_interval_date','DDMMYY HH24:MI:SS')-&_interval/24
AND begin_interval_time<TO_DATE('&_end_interval_date','DDMMYY HH24:MI:SS')+&_interval/24
) d WHERE next_snap_id0 AND previous_snap_id0   --not equal sign may have been corrupted during paste
 ORDER BY 3
)  s ) where (
CASE WHEN &_INTERVAL = 2 THEN mod(rn,2)  
WHEN &_INTERVAL=1 THEN 1
END)=1
)
SELECT  
CASE &_global 
WHEN 1 then
'spool &_folder\global_awrrptto_&_database_name'||CHR(95)||'_'||TO_CHAR(snap_id)||'_'||TO_CHAR(next_snap_id)||'.html'||CHR(10)|| CHR(10)|| 
'SELECT * FROM TABLE(dbms_workload_repository.awr_global_report_html('||TO_CHAR(dbid)||','''','||TO_CHAR(snap_id)||','||TO_CHAR(next_snap_id)||',8));'||CHR(10)
||CHR(10)||'spool off' 
WHEN 0 THEN '' 
END 
FROM driver;



SELECT
'set termout on' ||CHR(10)||
'SET TIME ON' ||CHR(10)||
'prompt '||CHR(10)||
'prompt '||CHR(10)||
'prompt REPORT GENERATION FINISHED'||CHR(10)||
'prompt '||CHR(10)||
'SET TIME OFF'||CHR(10)
FROM dual;
SET termout ON 

spool off

prompt 
prompt 
prompt GEN_AWR.SQL script generation finished
prompt 
prompt 



!!!To save it click on the view source link on the sourcode upright corner

Nothing rocket science, just using dbms_workload_repository.awr_report_html with some junior level analytic.

All you have to do is give the date, interval (only tested for half hour or one hour intervals)and folder to be saved then rest will be done by the script.

It will create a gen_awr.sql script which you will run and go have a coffee then when you come back your awr reports are ready to be sent.

so the sequence is

@gen_snap  ---assume you save it with this name
@gen_awr 
--coffee break
--zip+upload

I would like to automate it further by doing the same for rac comparison or group of instance but I think I am too lazy or maybe busy with some other non-automatable rubbish :( (I also had some issues with rac comparison report formatting with the DBMS_WORKLOAD_REPOSITORY package so I gave up on that one)

Hope this tiny script save your time as much as it saves mine.

December 29, 2010

How to monitor services on 11GR2 ?

Filed under: Monitoring, RAC — coskan @ 8:10 pm

It is very important to have our services run on the preferred nodes for the stability of our system (as long as all nodes are available)  and we sometimes see some of our services move between nodes without any good reason on 11GR2 grid. It was a big shame we discover it after business realizes the situation so we decided to have a check on service availability. At first I think about using DB to check the info but oracle does not keep the preferred node information in the DB itself :( remaining option is to use srvctl output which means coding needed. Time was limited to write it from scratch with  my coding coding speed so  I asked the question on Oracle-L and get a very good response from Yong Huang . His script was written in perl and running well on 10G and 11GR1 but not giving the expected results on our 11GR2 cluster,  for this reason, I modified the script a bit to use 11GR2 output (did I mention I’m better modifying codes than writing them :) ).

Below is the actual perl script to check the service availability.
Nothing rocket science here , It first gets the service names from crs_stat (I know it is deprecated please do not mention ) and in a for loop it checks the output if srvctl config and srvctl status for the services and than mails the problem services in a single mail.

We are calling perl script in a bash script (bottom of the page) which could also be optimized to run everything in the perl and I am again lazy not doing it and I find current way easier to move it between servers. only tiny bash script needs to be modified. I configure it to run every hour and job is done.

I hope it helps somebody. Thank you very much to Yong for sharing his work with me.

—Actual Script

#!/usr/bin/perl -w
# Modifier = Coskan Gundogar (original script is from YONG HUANG)
# Purpose = checking if services are running on the right node or not
# Maintenance needs
# CRS_HOME is hardcoded. Needs to be changed when CRS home changes
# Script needs to be called by a script which set env variables

$HOST=`hostname -a`;
$RECIPIENTEMAIL = 'yourdbagroup@yourdomain.com';
$ENV{PATH}="/usr/bin:/bin";
$CRS_HOME="/u01/crs/oracle/product/11.2.0/grid/bin/";
$ORASID=$ENV{'ORACLE_SID'};
$ORAHOME=$ENV{'ORACLE_HOME'};
$msg_run="";
$msg_none="";
$msg_all="";
$MSG_SUB = "";

                $_=`$CRS_HOME/crs_stat | awk -F. '/^NAME.*$ORASID.*\.svc\$/ {print \$3}' | sort | uniq`; #Get all service names (domain stripped) 11.2G

@line = split /\n/;

foreach (@line)
{
                 $service = "$_"; #Our domain names are guaranteed to be this.
                 $prefinst=`$ORAHOME/bin/srvctl config service -d $ORASID -s $service | grep -e Preferred | awk '{print \$3}'`;
                 $prefinst=~ s/\s+$//;
                 $statusline = `$ORAHOME/bin/srvctl status service -d $ORASID -s $service`;

                 if ($statusline =~ /is running on instance\(s\) (.*)$/)
                 {
                 $runinst = $1;
                 if ($prefinst ne $runinst)  # if service is running on the wrong node
                   {
                     $msg_avail = "Service \"$service \" preferred instance list differs from service running instance list: \n Preferred : $prefinst \n Running on: $runinst\n";
			         $msg_run=$msg_run . "\n" . $msg_avail;
		             $MSG_SUB = 'PRD-WARNING: ' . $HOST . ' DB: ' . $ORASID . ' Service Availability Problem'
		  }
                 }
                 else #This service is not even running.
                 {
		   			$msg_none=$msg_none . "\n" . $statusline . "\n";
					$MSG_SUB = "PRD-CRITICAL: " . $HOST . " DB: " . $ORASID . " Service Availability Problem"
		 }
                 @prefinst = ();
                 @runinst = ();
}
$msg_all=$msg_run . "\n" . $msg_none;

if ($msg_run ne "" or $msg_none ne "") {
system "mail -s \"$MSG_SUB\"  $RECIPIENTEMAIL <<EOF
$msg_all
EOF";
}
;

–Calling in an sh

export ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_2
export ORACLE_SID=DATABASE_NAME
export PATH=$PATH_BCK:$ORACLE_HOME/bin

/usr/bin/perl /home/oracle/bin/check_services.pl

Not : Problem with the actual script is when you set preferred and available node list in a non-ordered or reverse ordered way oracle uses the same order you gave in the output of srvctl config .Because The script compares srvctl config and srvctl status outputs it shows nodes are not running on the preferred node. I am sure this could be sorted but I prefer to set my services in the right order so I did not change anything to sort this problem.

September 11, 2010

“DBCA could not startup the ASM instance configured on this node” error for lower versions with 11GR2 GI

Filed under: RAC — coskan @ 9:14 am

Last 3 days ,I have been trying to have the toy environment at home, for 11GR2 upgrade project at work.

The challenge was building exactly same environment on virtual machine. Problem is that I did not built the environment at work so I had no idea about tricks which are not clear in doc.

The environment is;

Oracle 2 node RAC with 11GR2 clusterware July PSU with ASM storage

Database with 10.2.0.4.2  with no local listener.

I installed the grid with some minor problems which are not todays topic. Problems solved and grid was completely healthy.

I then installed 10.2.0.4 binaries pinned the nodes for lower version installation as documented in official installation doc like below from GI home


[root@cosarac1 ~]$
crsctl pin css -n cosarac1,cosarac2
[oracle@cosarac1 ~]$ olsnodes -t -n
cosarac1        1       Pinned
cosarac2        2       Pinned

And I ran dbca from 10.2.0.4  and when I chose storage as ASM I started to get “DBCA could not startup the ASM instance configured on this node. To proceed with the database creation using ASM, ASM instance needs to be up and running. Do you want to recreate ASM instance on this node?”  warning.

Checked if ASM is up and running and it was and also being listened by my default listener. So what was the problem ?

Google+otn forums is again the answer

Basically during the installation you need to link the listener.ora in oracle home to listener.ora in grid home

cd <10G_DB_HOME>/network/admin

ln –s <11G_GRID_HOME>/network/admin/endpoints_listener.ora listener.ora

After you create the link it, please do not expect that  it will  just work , because it won’t:)

You also need to have TNS_ADMIN env setting to make dbca work with ASM. (You don’t need to set it on both nodes you only need to set it in the session of the node you run dbca.)

export TNS_ADMIN=<10G_DB_HOME>/network/admin

There is also a bug and one off patch available (8288940) but because our environment did not have 10.2.0.4 home patched with that one of patch  (At office DB was installed before GI upgrade) I did not bother to install the patch when there is a beautiful easy peasy solution.

What I learnt from this  creating the same toy environment at home practice is that,  building a VM RAC following a guide is just 3-4 hours work (for this particular one I followed Tim Halls 11GR2 RAC installation Guide for RAC on Virtualbox )but once you change some of the things in the tutorial  you follow,  then you invite the trouble which is very teaching practice. I learned hell a lot just with this 2 node RAC installation.  Once you learn how to install a rac I suggest you to try to learn how to break a RAC installation :)

Next challenge is building the DR of the Same environment (not with RAC due to resource problem). Maybe I can find something to write about.

December 7, 2009

root.sh failed after ASM disk creation for 11GR2 Grid Infrastructure

Filed under: RAC — coskan @ 4:51 pm

Update 22/12/2009 After the first comment on the post I now know that  there is an easier way to deal with the problem.

How to Proceed from Failed 11gR2 Grid Infrastructure (CRS) Installation [ID 942166.1] (Last update is later than my post might be related :) )

Basically

Step 1: As root, run “$GRID_HOME/crs/install/rootcrs.pl -verbose -deconfig -force” on all nodes, except the last one.

Step 2: As root, run “$GRID_HOME/crs/install/rootcrs.pl -verbose -deconfig -force -lastnode” on last node. This command will zero out OCR and VD disk also.


Last 3 days I was a bit busy with installing Oracle RAC on Solaris 10 x64 on VMWare. I am planning to write a detailed documentation ,but I want to write an issue beforehand, which I managed to solve during the installation .

During grid infrastructure everything went fine till I ran root.sh script for cluster configuration. Script failed with the error stack below (I truncated the worked part)

# /u01/app/11.2.0/grid/root.sh
....
....
....
ASM created and started successfully.

DiskGroup DATA created successfully.

Errors in file :
ORA-27091: unable to queue I/O
ORA-15081: failed to submit an I/O operation to a disk
ORA-06512: at line 4
PROT-1: Failed to initialize ocrconfig
Command return code of 255 (65280) from command: /u01/grid/11.2.0/bin/ocrconfig -upgrade grid oinstall
Failed to create Oracle Cluster Registry configuration, rc 255
CRS-2500: Cannot stop resource 'ora.crsd' as it is not running
CRS-4000: Command Stop failed, or completed with errors.
Command return code of 1 (256) from command: /u01/grid/11.2.0/bin/crsctl stop resource ora.crsd -init
Stop of resource "ora.crsd -init" failed
Failed to stop CRSD
CRS-2673: Attempting to stop 'ora.asm' on 'solarac2'
CRS-2677: Stop of 'ora.asm' on 'solarac2' succeeded
CRS-2673: Attempting to stop 'ora.ctssd' on 'solarac2'
CRS-2677: Stop of 'ora.ctssd' on 'solarac2' succeeded
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'solarac2'
CRS-2677: Stop of 'ora.cssdmonitor' on 'solarac2' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'solarac2'
CRS-2677: Stop of 'ora.cssd' on 'solarac2' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'solarac2'
CRS-2677: Stop of 'ora.gpnpd' on 'solarac2' succeeded
CRS-2679: Attempting to clean 'ora.gpnpd' on 'solarac2'
CRS-2681: Clean of 'ora.gpnpd' on 'solarac2' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'solarac2'
CRS-2677: Stop of 'ora.gipcd' on 'solarac2' succeeded
CRS-2673: Attempting to stop 'ora.mdnsd' on 'solarac2'
CRS-2677: Stop of 'ora.mdnsd' on 'solarac2' succeeded
Initial cluster configuration failed.  See /u01/grid/11.2.0/cfgtoollogs/crsconfig/rootcrs_solarac2.log for details

I tried to run root.sh again which I shouldn’t have done because it is documented not to do. (I have to confess that I did not read the installation document well)

The error stack was different like below

# /u01/app/11.2.0/grid/root.sh
Running Oracle 11g root.sh script...
.........
Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root.sh script.
Now product-specific root actions will be performed.
2009-12-06 22:57:05: Parsing the host name
2009-12-06 22:57:05: Checking for super user privileges
2009-12-06 22:57:05: User has super user privileges
Using configuration parameter file: /u01/11.2.0/grid/crs/install/crsconfig_params
CRS is already configured on this node for crshome=0
Cannot configure two CRS instances on the same cluster.
Please deconfigure before proceeding with the configuration of new home.

As you see it didn’t allow me to re-run it. I needed to find a way to deconfigure the configuration. After a quick search on official doc I found the way here.

According to the doc, all I needed to do is run the command below and re-run the root.sh

/crs/install/rootcrs.pl -deconfig

Here is what happened when I run deconfigure

2009-12-07 00:35:17: Parsing the host name
2009-12-07 00:35:17: Checking for super user privileges
2009-12-07 00:35:17: User has super user privileges
Using configuration parameter file: /u01/grid/11.2.0/crs/install/crsconfig_params
Oracle Clusterware stack is not active on this node
Restart the clusterware stack (use /u01/grid/11.2.0/bin/crsctl start crs) and retry
Failed to verify resources

Still wasn’t working ??? I tried force option and it seemed like it de-configured successfully (maybe :) )

# /u01/grid/11.2.0/crs/install/rootcrs.pl -deconfig -force
2009-12-07 00:39:13: Parsing the host name
2009-12-07 00:39:13: Checking for super user privileges
2009-12-07 00:39:13: User has super user privileges
Using configuration parameter file: /u01/grid/11.2.0/crs/install/crsconfig_params
PRCR-1035 : Failed to look up CRS resource ora.cluster_vip.type for 1
PRCR-1068 : Failed to query resources
Cannot communicate with crsd
PRCR-1070 : Failed to check if resource ora.gsd is registered
Cannot communicate with crsd
PRCR-1070 : Failed to check if resource ora.ons is registered
Cannot communicate with crsd
PRCR-1070 : Failed to check if resource ora.eons is registered
Cannot communicate with crsd

CRS-4133: Oracle High Availability Services has been stopped.
Successfully deconfigured Oracle clusterware stack on this node

It says it did successfully deconfigured but when I run the root.sh again I got this

Disk Group DATA already exists. Cannot be created again

Configuration of ASM failed, see logs for details
Did not succssfully configure and start ASM
CRS-2500: Cannot stop resource 'ora.crsd' as it is not running
CRS-4000: Command Stop failed, or completed with errors.
Command return code of 1 (256) from command: /u01/grid/11.2.0/bin/crsctl stop resource ora.crsd -init
Stop of resource "ora.crsd -init" failed
Failed to stop CRSD
CRS-2500: Cannot stop resource 'ora.asm' as it is not running
CRS-4000: Command Stop failed, or completed with errors.
Command return code of 1 (256) from command: /u01/grid/11.2.0/bin/crsctl stop resource ora.asm -init
Stop of resource "ora.asm -init" failed
Failed to stop ASM
CRS-2673: Attempting to stop 'ora.ctssd' on 'solarac1'
CRS-2677: Stop of 'ora.ctssd' on 'solarac1' succeeded
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'solarac1'
CRS-2677: Stop of 'ora.cssdmonitor' on 'solarac1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'solarac1'
CRS-2677: Stop of 'ora.cssd' on 'solarac1' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'solarac1'
CRS-2677: Stop of 'ora.gpnpd' on 'solarac1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'solarac1'
CRS-2677: Stop of 'ora.gipcd' on 'solarac1' succeeded
CRS-2673: Attempting to stop 'ora.mdnsd' on 'solarac1'
CRS-2677: Stop of 'ora.mdnsd' on 'solarac1' succeeded
Initial cluster configuration failed.  See /u01/grid/11.2.0/cfgtoollogs/crsconfig/rootcrs_solarac2.log for details

On the mentioned logfile it says

2009-12-07 00:43:26: Executing as grid: /u01/grid/11.2.0/bin/asmca -silent -diskGroupName DATA -diskList /dev/rdsk/c1t1d0s1,/dev/rdsk/c1t2d0s1,/dev/rdsk/c1t3
d0s1,/dev/rdsk/c1t4d0s1 -redundancy EXTERNAL -configureLocalASM
2009-12-07 00:43:26: Running as user grid: /u01/grid/11.2.0/bin/asmca -silent -diskGroupName DATA -diskList /dev/rdsk/c1t1d0s1,/dev/rdsk/c1t2d0s1,/dev/rdsk/c
1t3d0s1,/dev/rdsk/c1t4d0s1 -redundancy EXTERNAL -configureLocalASM
2009-12-07 00:43:26:   Invoking "/u01/grid/11.2.0/bin/asmca -silent -diskGroupName DATA -diskList /dev/rdsk/c1t1d0s1,/dev/rdsk/c1t2d0s1,/dev/rdsk/c1t3d0s1,/d
ev/rdsk/c1t4d0s1 -redundancy EXTERNAL -configureLocalASM" as user "grid"
2009-12-07 00:43:30: Configuration of ASM failed, see logs for details

Basically it configures asm with asmca command. asmca utility does not have drop diskgroup option which makes it unusable for this situation. (there is deleteasm option but it does not work fine because it needs a working asm instance which wasn’t possible after failed root.sh)

I didn’t want to delete all CRS installation so I needed a way to remove diskgroup information from ASM disks?

All I needed was dd command to remove the disk header information from the devices.

I had 4 disk presented for that disk group so I used dd command for all of them (I am not sure maybe I needed only the firs device I need to check invaluable presentation of Julian Dyke about ASM Internals)

# dd if=/dev/zero of=/dev/rdsk/c1t2d0s1 bs=1024K count=100
dd: bad numeric argument: "1024K"
bash-3.00# dd if=/dev/zero of=/dev/rdsk/c1t2d0s1 bs=1k count=1000000
1000000+0 records in
1000000+0 records out
# dd if=/dev/zero of=/dev/rdsk/c1t1d0s1 bs=1k count=1000000
1000000+0 records in
1000000+0 records out
# dd if=/dev/zero of=/dev/rdsk/c1t3d0s1 bs=1k count=1000000
1000000+0 records in
1000000+0 records out
# dd if=/dev/zero of=/dev/rdsk/c1t4d0s1 bs=1k count=1000000
1000000+0 records in
1000000+0 records out

After this deletion I re-run the deconfigure script and re-run the root.sh. Everything worked fine without any problem at all. The story will continue with How to install 11GR2 RAC on Solaris 10 on VMware (give me a bit more time to finish)

footnoteSmilar issue reported on metalink for Linux ( ML 955550.1)

Sources used
Oracle® Grid Infrastructure Installation Guide 11g Release 2 (11.2) for Solaris Operating System

How to use Files in place of Real Disk Devices for ASM – (Solaris) by Jeff Hunter

How to rerun root.sh during initial installation of GRID Infrastructure. by RACHELP

June 28, 2007

Sub Latch Wait Events with weak documentation

Filed under: RAC — coskan @ 8:57 am

Prior to Oracle Database 10g, all latch waits show up as the latch free wait event . In Oracle Database 10g, common latches are broken out and have independent wait event names and statistics but documentation about this independent wait event statistics are forgotten.

Yesterday I saw long waits of “latch: KCL gc element parent latch ” event on 10046 trace output of one of our sqls. I searched the documentation, metalink and google, all I have found is 15 bug pages on metalink, no page on documentation, 6 pages on google and all of this pages are not talking about defintion of event. They only show that this event has a place on the tkproof outputs. But my sql is suffering from this event what should I do now to lower this long waits???.

I hope documentation about latch sub events will be more definitive on upcoming 11G.

Ps: Any comment about definition of this wait event will be welcome.

January 29, 2007

Changing UNDO_RETENTION on RAC (10.1.0.X)

Filed under: RAC — coskan @ 12:13 pm

4 months ago while we were archiving our database (2 node RAC) we got continuous Ora-1555 error for archiving queries.

I decided to update undo_retention parameter with command

alter system set undo_retention=36000 scope=both;

than DB hanged. One node was accepting “sqlplus / as sysdba” while

other node (the node i alter the DB ) wasn’t.

srvctl stop database -d DB_NAME

srvctl stop instance -d DB_NAME -i INSTANCE_NAME

commands did not work for the unresponding node.

I took the system state dumps from the working node and killed the other nodes processes from OS.

After CRS restart everthing worked fine.

I did not understand why till the respond from Metalink Tar.

This was related with the bug https://metalink.oracle.com/metalink/plsql/showdoc?db=Bug&id=4220405

It says

ALTER SYSTEM SET UNDO_RETENTION=<N> HANGS RAC INSTANCES which has been closed as base bug 3023661)
Bug 4220405 ALTER SYSTEM SET UNDO_RETENTION=<N> HANGS RAC INSTANCES which has been closed as base bug 3023661)
The problem is caused by deadlock between CKPT and PZ99 slave. The internal algorithm to query the
current value of undo_retention of each instance and modify it has problem. It does unnecessary gv$ query and
lob$ update when spfile is used. It is coding problem. The bug has been fixed in 10.2 and it is not backportable to 10.1.0.x due to code structure change.

The Usage of UNDO_RETENTION is below

alter system set undo_retention=1800 sid=’RAC1′;
alter system set undo_retention=1800 sid=’RAC2′;

Be carefull while changing undo_retention on RAC

 

The Silver is the New Black Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 193 other followers