Here comes the nice one, What would you check when you are on 3 node RAC where 4 different database are balanced with services and Developer of one of the DBs come and asked you why their newly deployed batch is 3 times slower on PRD comparing to QA box which was snap copied (storage snap) from production box. RAC specs are exactly same plus all parameters are same ?
PRD runtime plan
PROD> @mon_sqlid2 52dr7m8nnx5mk DBMS_SQLTUNE.REPORT_SQL_MONITOR(SQL_ID=>'52DR7M8NNX5MK') ---------------------------------------------------------------------------------------------------------------------------------------------------------- SQL Monitoring Report SQL Text ------------------------------ select PKEY, ISIN from BATCH_TB where BATCH = :1 order by PKEY Global Information ------------------------------ Status : DONE (ALL ROWS) Instance ID : 2 Session : BATCH_APP (414:13998) SQL ID : 52dr7m8nnx5mk SQL Execution ID : 33554695 Execution Started : 06/14/2011 13:25:52 First Refresh Time : 06/14/2011 13:25:56 Last Refresh Time : 06/14/2011 13:26:08 Duration : 16s Module/Action : JDBC Thin Client/- Service : PRD Program : JDBC Thin Client Fetch Calls : 1 Binds ======================================================================================================================== | Name | Position | Type | Value | ======================================================================================================================== | :1 | 1 | NUMBER | 37755 | ======================================================================================================================== Global Stats ========================================================================================= | Elapsed | Cpu | IO | Application | Other | Fetch | Buffer | Read | Read | | Time(s) | Time(s) | Waits(s) | Waits(s) | Waits(s) | Calls | Gets | Reqs | Bytes | ========================================================================================= | 16 | 0.62 | 16 | 0.01 | 0.03 | 1 | 60425 | 3874 | 469MB | ========================================================================================= SQL Plan Monitoring Details (Plan Hash Value=4053290152) ==================================================================================================================================================== | Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Read | Read | Activity | Activity Detail | | | | | (Estim) | | Active(s) | Active | | (Actual) | Reqs | Bytes | (%) | (# samples) | ==================================================================================================================================================== | 0 | SELECT STATEMENT | | | | | | 1 | | | | | | | 1 | SORT ORDER BY | | 347 | 13216 | | | 1 | | | | | | | 2 | TABLE ACCESS FULL | BATCH_TB | 347 | 13215 | 17 | +1 | 1 | 0 | 2996 | 370MB | 100.00 | direct path read (17) | ====================================================================================================================================================
QA runtime plan
QA> @mon_sqlid2 52dr7m8nnx5mk DBMS_SQLTUNE.REPORT_SQL_MONITOR(SQL_ID=>'52DR7M8NNX5MK') -------------------------------------------------------------------------------------------------------------------------------------------------------------------- SQL Monitoring Report SQL Text ------------------------------ select PKEY, ISIN from BATCH_TB where BATCH = :1 order by PKEY Global Information ------------------------------ Status : DONE (ALL ROWS) Instance ID : 3 Session : BATCH_APP (621:853) SQL ID : 52dr7m8nnx5mk SQL Execution ID : 50331656 Execution Started : 06/14/2011 11:57:46 First Refresh Time : 06/14/2011 11:57:50 Last Refresh Time : 06/14/2011 11:57:52 Duration : 6s Module/Action : JDBC Thin Client/- Service : QA Program : JDBC Thin Client Fetch Calls : 1 Binds ======================================================================================================================== | Name | Position | Type | Value | ======================================================================================================================== | :1 | 1 | NUMBER | 37755 | ======================================================================================================================== Global Stats =========================================================================== | Elapsed | Cpu | IO | Cluster | Fetch | Buffer | Read | Read | | Time(s) | Time(s) | Waits(s) | Waits(s) | Calls | Gets | Reqs | Bytes | =========================================================================== | 6.22 | 1.56 | 4.14 | 0.53 | 1 | 59621 | 966 | 107MB | =========================================================================== SQL Plan Monitoring Details (Plan Hash Value=4053290152) ========================================================================================================================================================= | Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Read | Read | Activity | Activity Detail | | | | | (Estim) | | Active(s) | Active | | (Actual) | Reqs | Bytes | (%) | (# samples) | ========================================================================================================================================================= | 0 | SELECT STATEMENT | | | | | | 1 | | | | | | | 1 | SORT ORDER BY | | 338 | 12544 | | | 1 | | | | | | | 2 | TABLE ACCESS FULL | BATCH_TB | 338 | 12543 | 6 | +1 | 1 | 0 | 435 | 47MB | 100.00 | Cpu (2) | | | | | | | | | | | | | | db file scattered read (4) | =========================================================================================================================================================
(I hope you like the posh sql monitoring output which I really enjoy to use. I have to admit, I wasn’t the big fan of sql monitoring on GUI-just because of the speed of OEM- till I discovered the sqlplus options like the ones above, and now I can’t stop myself from using it and it is very good alternative to trace which is expensive by all means on PRD )
From the ouput above, you can see PRD database is doing direct path reads and QA database is using db_file_scattered reads and PRD is 3 times slower for each run. Somehow oracle favours doing 11G direct path implementation for full table scan (Links for direct path Doug Burns,Alex Fatkulin, Dion Cho ) on PROD. If you read the links you can see that this has to do something with the memory the buffer cache and small table threshold (for more information about this threshold Charles Hooper-great discussion with Jonathan Lewis on comments, Tanel Poder ).
Initially when I checked the output, I saw something weird which is against our policy for this DB. We configured the user services to run on Node-2 only and somehow on QA query runs on Node-3 (instance id), I double checked this with developer and we found out that on test app server they are not using services as service name but instead they use instance name, so he is able to run query on node-3 on QA. I thought this might be the case since Oracle instance is not highly active in terms of memory usage on node-3 so Oracle might favour using expensive buffer cache reads instead of direct path reads. This was a goodish as initial shot but ended up disappointment because when I test the same query on node-2 results were again same. Wasted 5 important minutes 🙂
After losing 5 minutes on initial shot, now it is time to be more scientific, Lets check what the “documented” memory parameters are
===================QA Parameters ============================ QA> show parameter sga_ NAME TYPE VALUE ------------------------------------ ----------- ------------- sga_max_size big integer 10G sga_target big integer 8G QA> show parameter pga_ NAME TYPE VALUE ------------------------------------ ----------- ------------- pga_aggregate_target big integer 6435M QA> show parameter db_cache_size NAME TYPE VALUE ------------------------------------ ----------- ------------- db_cache_size big integer 0 QA> show parameter shared_pool NAME TYPE VALUE ------------------------------------ ----------- ------------- shared_pool_reserved_size big integer 70464307 shared_pool_size big integer 0 QA> show parameter memory NAME TYPE VALUE ------------------------------------ ----------- ------------ hi_shared_memory_address integer 0 memory_max_target big integer 0 memory_target big integer 0 shared_memory_address integer 0 ===================PRD Parameters ============================ PROD> show parameter sga_ NAME TYPE VALUE ------------------------------------ ----------- ---------------------------- sga_max_size big integer 10G sga_target big integer 8G PROD> show parameter pga_ NAME TYPE VALUE ------------------------------------ ----------- ---------------------------- pga_aggregate_target big integer 6435M PROD> show parameter db_cache_size NAME TYPE VALUE ------------------------------------ ----------- ---------------------------- db_cache_size big integer 0 PROD> show parameter shared_pool NAME TYPE VALUE ------------------------------------ ----------- ---------------------------- shared_pool_reserved_size big integer 291923558 shared_pool_size big integer 0 PROD> show parameter memory NAME TYPE VALUE ------------------------------------ ----------- ---------------------------- hi_shared_memory_address integer 0 memory_max_target big integer 0 memory_target big integer 0 shared_memory_address integer 0 PROD>
This output told me that we have everything same in terms of static parameters, lets check what the small table threshold —> using Tanel Poders pd script for hidden parameters (needs sys access)
QA> @pd small_table_threshold NAME VALUE ISDEFAULT DESCRIPTION ---------------------------------------- ------------------------------ --------- ------------------------------------------------------- _small_table_threshold 15598 TRUE lower threshold level of table size for direct reads PROD> @pd small_table_threshold NAME VALUE ISDEFAULT DESCRIPTION ---------------------------------------- ------------------------------ --------- ------------------------------------------------------ _small_table_threshold 5830 TRUE lower threshold level of table size for direct reads
Lets check if the small_table_threshold*5 (check Tanels answer on the oracle_l question ) is bigger than number of blocks of the BATCH_TBS which is probably the decision maker for favouring direct path reads
=============================QA ==================================== ---number of blocks of BATCH_TB table QA> select blocks from dba_segments where segment_name='BATCH_TB'; BLOCKS ---------- 60416 ---5 times small_table_threshold QA> select 15598 * 5 from dual; 15598*5 ---------- 77990 =============================PRD ==================================== ---number of blocks of BATCH_TB table PRD> select blocks from dba_segments where segment_name='BATCH_TB'; BLOCKS ---------- 61440 ---5 times small_table_threshold PRD> select 5830 * 5 from dual; 5830*5 ---------- 29150
As you can see we find the root cause but I don’t think developer really cares about the root cause. All he need is a an answer Why they hell this code is working on QA and how can we avoid this problem ?
To answer his questions we need to look at the formula of small_table_threshold which is %2 of the buffer cache size. Which means our buffer cache is bigger on QA. Lets check
=============================QA ==================================== QA> select 0.02*current_size/8192 from v$sga_dynamic_components 2 where component='DEFAULT buffer cache'; 0.02*CURRENT_SIZE/8192 ---------------------- 15482.88 QA> select component,current_size/1024/1024 current_size_mb 2 from v$sga_dynamic_components; COMPONENT CURRENT_SIZE_MB --------------------------- --------------- shared pool 1888 large pool 32 java pool 160 streams pool 0 DEFAULT buffer cache 6048 KEEP buffer cache 0 RECYCLE buffer cache 0 DEFAULT 2K buffer cache 0 DEFAULT 4K buffer cache 0 DEFAULT 8K buffer cache 0 DEFAULT 16K buffer cache 0 DEFAULT 32K buffer cache 0 Shared IO Pool 0 ASM Buffer Cache 0 14 rows selected. =============================PRD ==================================== PRD> select 0.02*current_size/8192 from v$sga_dynamic_components 2 where component='DEFAULT buffer cache'; 0.02*CURRENT_SIZE/8192 ---------------------- 5324.8 PRD> select component,current_size/1024/1024 current_size_mb 2 from v$sga_dynamic_components; COMPONENT CURRENT_SIZE_MB ---------------------------- --------------- shared pool 5952 large pool 64 java pool 32 streams pool 0 DEFAULT buffer cache 2080 KEEP buffer cache 0 RECYCLE buffer cache 0 DEFAULT 2K buffer cache 0 DEFAULT 4K buffer cache 0 DEFAULT 8K buffer cache 0 DEFAULT 16K buffer cache 0 DEFAULT 32K buffer cache 0 Shared IO Pool 0 ASM Buffer Cache 0
As you can see %2 of buffer cache formula is true and PRD buffer cache is 3 times less than QA buffer cache.
Why this can be ? Here is the answer
QA> select count(*) from v$sql;
COUNT(*)
———-
26135
PRD> select count(*) from v$sql;
COUNT(*)
———-
68711
We are running 3 times less number of sqls on QA which leads us having 1.8GB shared pool enough on QA but not enough on PRD.
This causes shrink of buffer cache and makes us think we have visually “SAME” environment but end up realistically “DIFFERENT” environments.
My solution for the problem was creating a better index on the table which needed to be there anyway (cannot touch system wide PRD setting):)
The real and long term solution to avoid these kind of QA-PRD different problems is in my opinion having bare minimums by setting db_cache_size and shared_pool_size
and not letting Oracle to decide to shrink below real production levels.
Hope you enjoyed as much as I enjoyed to diagnose it.
!!Make sure you read all the links in this post they are all informative
Note: I needed to stop my upgrade series but will continue as soon as possible 😦 Really sorry if someone is waiting an input from me.
mon_sqlid2 script is like follow
set long 10000000 set longchunksize 10000000 select dbms_sqltune.report_sql_monitor (sql_id=>'&1') from dual;
pd script to get hidden parameters (From Tanel Poder)
col pd_name head NAME for a40 col pd_value head VALUE for a30 column pd_descr heading DESCRIPTION format a55 word_wrap select n.ksppinm pd_name, c.ksppstvl pd_value, ksppstdf isdefault,n.ksppdesc pd_descr from x$ksppi n, x$ksppcv c where n.indx=c.indx and ( lower(n.ksppinm) like lower('&1') or lower(n.ksppdesc) like lower('&1') );
Coskan, I think this is an excellent blog entry as it takes the reader through a good diagnostic approach and is well written and displayed.
I noted that in your seemingly similar environments shared_pool_reserved_size is 278M on prod and only 67M on QA
What do you see as the benefit of setting sga_max_size as 10Gb but sga_target to only use 8Gb. Teh additional 2Gb is still grabbed by oracle and is therefore unusable to anybody else. I appreciate that you can grow the database parameters later, without a bounce, but to my mind you might as well use that extra 2Gb in the buffer cache for now.
regards
John
Comment by John — June 24, 2011 @ 12:53 pm
John, Thank you very much for stopping by and your comments. The shared_pool_reserved_size is not set explicitly in our case.
QA> show spparameter shared
SID NAME TYPE VALUE
——– —————————– ———– —————————-
* hi_shared_memory_address integer
* max_shared_servers integer
* shared_memory_address integer
* shared_pool_reserved_size big integer
* shared_pool_size big integer
* shared_server_sessions integer
* shared_servers integer
it normally should have been auto adjusted to its default which is 5% of the value of SHARED_POOL_SIZE but a little higher in our case which is I am not sure why.
When it comes to 10GB-8GB is just for the flexibility to grow “online” in case of a problem and secure the additional 2GB space to this database. I always find it more practical to give resources gradually to the end users so we can track down the real usage and not letting them waste the resources. When you have resource somebody will use it but real question is should they 🙂
Comment by coskan gundogar — June 24, 2011 @ 1:15 pm
RAC specs are exactly same plus all parameters are same ?
Ignore the parameters behind the curtain, bwahaha.
Excellent example! I especially liked the part where the spread of sqls made such a difference. That’s the kind of trap a load test could easily miss.
Comment by jgarry — June 24, 2011 @ 11:28 pm
Nice diagnostic approach. Thanks for sharing it 🙂
Comment by Anand — July 6, 2011 @ 2:11 am
Excellent approach, thanks for sharing. waiting for other series on upgrade.
For dbms_Sqltune.report_sql_monitor, it requires additional parameters to report sqls for other sessions. But here you simply passed sql_id. Did these sqls were run from your session only ?
Regards,
Jagjeet Singh
Comment by Jagjeet Singht — July 6, 2011 @ 3:59 pm
Hi Jagjeet,
You are right it normally asks for other parameters as well if you want to see the runtime of a sql for a spesific session. At this occasion I am sure that this is the one I run so I only pass the sql_id but if you want to see session id spesific then better you also pass sql_id and sid together like below
undefine sql_id session_id
set long 10000000
set longchunksize 10000000
select dbms_sqltune.report_sql_monitor (sql_id=>’&sql_id’,session_id=>&session_id) from dual;
Comment by Coskan Gundogar — July 10, 2011 @ 11:09 am
Nice point Coskan.
Enjoyed reading ….
Damir Vadas
Comment by Damir Vadas — July 13, 2011 @ 10:10 pm
It is frustrating if you try to log on to your site, and the server is
down. Businesses looking for web hosting services would typically already have a their own domain name or register one when signing up for hosting.
In other words it’s time to turn the system off of autopilot and
decide how you want to get to your destination.
Comment by 3hostcoupons.com — December 5, 2015 @ 10:24 am