Here comes the nice one, What would you check when you are on 3 node RAC where 4 different database are balanced with services and Developer of one of the DBs come and asked you why their newly deployed batch is 3 times slower on PRD comparing to QA box which was snap copied (storage snap) from production box. RAC specs are exactly same plus all parameters are same ?
PRD runtime plan
PROD> @mon_sqlid2 52dr7m8nnx5mk DBMS_SQLTUNE.REPORT_SQL_MONITOR(SQL_ID=>'52DR7M8NNX5MK') ---------------------------------------------------------------------------------------------------------------------------------------------------------- SQL Monitoring Report SQL Text ------------------------------ select PKEY, ISIN from BATCH_TB where BATCH = :1 order by PKEY Global Information ------------------------------ Status : DONE (ALL ROWS) Instance ID : 2 Session : BATCH_APP (414:13998) SQL ID : 52dr7m8nnx5mk SQL Execution ID : 33554695 Execution Started : 06/14/2011 13:25:52 First Refresh Time : 06/14/2011 13:25:56 Last Refresh Time : 06/14/2011 13:26:08 Duration : 16s Module/Action : JDBC Thin Client/- Service : PRD Program : JDBC Thin Client Fetch Calls : 1 Binds ======================================================================================================================== | Name | Position | Type | Value | ======================================================================================================================== | :1 | 1 | NUMBER | 37755 | ======================================================================================================================== Global Stats ========================================================================================= | Elapsed | Cpu | IO | Application | Other | Fetch | Buffer | Read | Read | | Time(s) | Time(s) | Waits(s) | Waits(s) | Waits(s) | Calls | Gets | Reqs | Bytes | ========================================================================================= | 16 | 0.62 | 16 | 0.01 | 0.03 | 1 | 60425 | 3874 | 469MB | ========================================================================================= SQL Plan Monitoring Details (Plan Hash Value=4053290152) ==================================================================================================================================================== | Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Read | Read | Activity | Activity Detail | | | | | (Estim) | | Active(s) | Active | | (Actual) | Reqs | Bytes | (%) | (# samples) | ==================================================================================================================================================== | 0 | SELECT STATEMENT | | | | | | 1 | | | | | | | 1 | SORT ORDER BY | | 347 | 13216 | | | 1 | | | | | | | 2 | TABLE ACCESS FULL | BATCH_TB | 347 | 13215 | 17 | +1 | 1 | 0 | 2996 | 370MB | 100.00 | direct path read (17) | ====================================================================================================================================================
QA runtime plan
QA> @mon_sqlid2 52dr7m8nnx5mk DBMS_SQLTUNE.REPORT_SQL_MONITOR(SQL_ID=>'52DR7M8NNX5MK') -------------------------------------------------------------------------------------------------------------------------------------------------------------------- SQL Monitoring Report SQL Text ------------------------------ select PKEY, ISIN from BATCH_TB where BATCH = :1 order by PKEY Global Information ------------------------------ Status : DONE (ALL ROWS) Instance ID : 3 Session : BATCH_APP (621:853) SQL ID : 52dr7m8nnx5mk SQL Execution ID : 50331656 Execution Started : 06/14/2011 11:57:46 First Refresh Time : 06/14/2011 11:57:50 Last Refresh Time : 06/14/2011 11:57:52 Duration : 6s Module/Action : JDBC Thin Client/- Service : QA Program : JDBC Thin Client Fetch Calls : 1 Binds ======================================================================================================================== | Name | Position | Type | Value | ======================================================================================================================== | :1 | 1 | NUMBER | 37755 | ======================================================================================================================== Global Stats =========================================================================== | Elapsed | Cpu | IO | Cluster | Fetch | Buffer | Read | Read | | Time(s) | Time(s) | Waits(s) | Waits(s) | Calls | Gets | Reqs | Bytes | =========================================================================== | 6.22 | 1.56 | 4.14 | 0.53 | 1 | 59621 | 966 | 107MB | =========================================================================== SQL Plan Monitoring Details (Plan Hash Value=4053290152) ========================================================================================================================================================= | Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Read | Read | Activity | Activity Detail | | | | | (Estim) | | Active(s) | Active | | (Actual) | Reqs | Bytes | (%) | (# samples) | ========================================================================================================================================================= | 0 | SELECT STATEMENT | | | | | | 1 | | | | | | | 1 | SORT ORDER BY | | 338 | 12544 | | | 1 | | | | | | | 2 | TABLE ACCESS FULL | BATCH_TB | 338 | 12543 | 6 | +1 | 1 | 0 | 435 | 47MB | 100.00 | Cpu (2) | | | | | | | | | | | | | | db file scattered read (4) | =========================================================================================================================================================
(I hope you like the posh sql monitoring output which I really enjoy to use. I have to admit, I wasn’t the big fan of sql monitoring on GUI-just because of the speed of OEM- till I discovered the sqlplus options like the ones above, and now I can’t stop myself from using it and it is very good alternative to trace which is expensive by all means on PRD )
From the ouput above, you can see PRD database is doing direct path reads and QA database is using db_file_scattered reads and PRD is 3 times slower for each run. Somehow oracle favours doing 11G direct path implementation for full table scan (Links for direct path Doug Burns,Alex Fatkulin, Dion Cho ) on PROD. If you read the links you can see that this has to do something with the memory the buffer cache and small table threshold (for more information about this threshold Charles Hooper-great discussion with Jonathan Lewis on comments, Tanel Poder ).
Initially when I checked the output, I saw something weird which is against our policy for this DB. We configured the user services to run on Node-2 only and somehow on QA query runs on Node-3 (instance id), I double checked this with developer and we found out that on test app server they are not using services as service name but instead they use instance name, so he is able to run query on node-3 on QA. I thought this might be the case since Oracle instance is not highly active in terms of memory usage on node-3 so Oracle might favour using expensive buffer cache reads instead of direct path reads. This was a goodish as initial shot but ended up disappointment because when I test the same query on node-2 results were again same. Wasted 5 important minutes
After losing 5 minutes on initial shot, now it is time to be more scientific, Lets check what the “documented” memory parameters are
===================QA Parameters ============================ QA> show parameter sga_ NAME TYPE VALUE ------------------------------------ ----------- ------------- sga_max_size big integer 10G sga_target big integer 8G QA> show parameter pga_ NAME TYPE VALUE ------------------------------------ ----------- ------------- pga_aggregate_target big integer 6435M QA> show parameter db_cache_size NAME TYPE VALUE ------------------------------------ ----------- ------------- db_cache_size big integer 0 QA> show parameter shared_pool NAME TYPE VALUE ------------------------------------ ----------- ------------- shared_pool_reserved_size big integer 70464307 shared_pool_size big integer 0 QA> show parameter memory NAME TYPE VALUE ------------------------------------ ----------- ------------ hi_shared_memory_address integer 0 memory_max_target big integer 0 memory_target big integer 0 shared_memory_address integer 0 ===================PRD Parameters ============================ PROD> show parameter sga_ NAME TYPE VALUE ------------------------------------ ----------- ---------------------------- sga_max_size big integer 10G sga_target big integer 8G PROD> show parameter pga_ NAME TYPE VALUE ------------------------------------ ----------- ---------------------------- pga_aggregate_target big integer 6435M PROD> show parameter db_cache_size NAME TYPE VALUE ------------------------------------ ----------- ---------------------------- db_cache_size big integer 0 PROD> show parameter shared_pool NAME TYPE VALUE ------------------------------------ ----------- ---------------------------- shared_pool_reserved_size big integer 291923558 shared_pool_size big integer 0 PROD> show parameter memory NAME TYPE VALUE ------------------------------------ ----------- ---------------------------- hi_shared_memory_address integer 0 memory_max_target big integer 0 memory_target big integer 0 shared_memory_address integer 0 PROD>
This output told me that we have everything same in terms of static parameters, lets check what the small table threshold —> using Tanel Poders pd script for hidden parameters (needs sys access)
QA> @pd small_table_threshold NAME VALUE ISDEFAULT DESCRIPTION ---------------------------------------- ------------------------------ --------- ------------------------------------------------------- _small_table_threshold 15598 TRUE lower threshold level of table size for direct reads PROD> @pd small_table_threshold NAME VALUE ISDEFAULT DESCRIPTION ---------------------------------------- ------------------------------ --------- ------------------------------------------------------ _small_table_threshold 5830 TRUE lower threshold level of table size for direct reads
Lets check if the small_table_threshold*5 (check Tanels answer on the oracle_l question ) is bigger than number of blocks of the BATCH_TBS which is probably the decision maker for favouring direct path reads
=============================QA ====================================
---number of blocks of BATCH_TB table
QA> select blocks from dba_segments where segment_name='BATCH_TB';
BLOCKS
----------
60416
---5 times small_table_threshold
QA> select 15598 * 5 from dual;
15598*5
----------
77990
=============================PRD ====================================
---number of blocks of BATCH_TB table
PRD> select blocks from dba_segments where segment_name='BATCH_TB';
BLOCKS
----------
61440
---5 times small_table_threshold
PRD> select 5830 * 5 from dual;
5830*5
----------
29150
As you can see we find the root cause but I don’t think developer really cares about the root cause. All he need is a an answer Why they hell this code is working on QA and how can we avoid this problem ?
To answer his questions we need to look at the formula of small_table_threshold which is %2 of the buffer cache size. Which means our buffer cache is bigger on QA. Lets check
=============================QA ====================================
QA> select 0.02*current_size/8192 from v$sga_dynamic_components
2 where component='DEFAULT buffer cache';
0.02*CURRENT_SIZE/8192
----------------------
15482.88
QA> select component,current_size/1024/1024 current_size_mb
2 from v$sga_dynamic_components;
COMPONENT CURRENT_SIZE_MB
--------------------------- ---------------
shared pool 1888
large pool 32
java pool 160
streams pool 0
DEFAULT buffer cache 6048
KEEP buffer cache 0
RECYCLE buffer cache 0
DEFAULT 2K buffer cache 0
DEFAULT 4K buffer cache 0
DEFAULT 8K buffer cache 0
DEFAULT 16K buffer cache 0
DEFAULT 32K buffer cache 0
Shared IO Pool 0
ASM Buffer Cache 0
14 rows selected.
=============================PRD ====================================
PRD> select 0.02*current_size/8192 from v$sga_dynamic_components
2 where component='DEFAULT buffer cache';
0.02*CURRENT_SIZE/8192
----------------------
5324.8
PRD> select component,current_size/1024/1024 current_size_mb
2 from v$sga_dynamic_components;
COMPONENT CURRENT_SIZE_MB
---------------------------- ---------------
shared pool 5952
large pool 64
java pool 32
streams pool 0
DEFAULT buffer cache 2080
KEEP buffer cache 0
RECYCLE buffer cache 0
DEFAULT 2K buffer cache 0
DEFAULT 4K buffer cache 0
DEFAULT 8K buffer cache 0
DEFAULT 16K buffer cache 0
DEFAULT 32K buffer cache 0
Shared IO Pool 0
ASM Buffer Cache 0
As you can see %2 of buffer cache formula is true and PRD buffer cache is 3 times less than QA buffer cache.
Why this can be ? Here is the answer
QA> select count(*) from v$sql;
COUNT(*)
———-
26135
PRD> select count(*) from v$sql;
COUNT(*)
———-
68711
We are running 3 times less number of sqls on QA which leads us having 1.8GB shared pool enough on QA but not enough on PRD.
This causes shrink of buffer cache and makes us think we have visually “SAME” environment but end up realistically “DIFFERENT” environments.
My solution for the problem was creating a better index on the table which needed to be there anyway (cannot touch system wide PRD setting):)
The real and long term solution to avoid these kind of QA-PRD different problems is in my opinion having bare minimums by setting db_cache_size and shared_pool_size
and not letting Oracle to decide to shrink below real production levels.
Hope you enjoyed as much as I enjoyed to diagnose it.
!!Make sure you read all the links in this post they are all informative
Note: I needed to stop my upgrade series but will continue as soon as possible
Really sorry if someone is waiting an input from me.
mon_sqlid2 script is like follow
set long 10000000 set longchunksize 10000000 select dbms_sqltune.report_sql_monitor (sql_id=>'&1') from dual;
pd script to get hidden parameters (From Tanel Poder)
col pd_name head NAME for a40
col pd_value head VALUE for a30
column pd_descr heading DESCRIPTION format a55 word_wrap
select n.ksppinm pd_name, c.ksppstvl pd_value, ksppstdf isdefault,n.ksppdesc pd_descr
from x$ksppi n, x$ksppcv c
where n.indx=c.indx
and (
lower(n.ksppinm) like lower('&1')
or lower(n.ksppdesc) like lower('&1')
);
