Coskan’s Approach to Oracle

June 24, 2011

Plan is same, Wait profile is different ?

Filed under: Performance — coskan @ 10:42 am

Here comes the nice one, What would you check when you are on 3 node RAC where 4 different database are balanced with services and Developer of one of the DBs come and asked you why their newly deployed batch is 3 times slower on PRD comparing to QA box which was snap copied (storage snap) from production box. RAC specs are exactly same plus all parameters are same ?

PRD runtime plan


PROD> @mon_sqlid2  52dr7m8nnx5mk

DBMS_SQLTUNE.REPORT_SQL_MONITOR(SQL_ID=>'52DR7M8NNX5MK')
----------------------------------------------------------------------------------------------------------------------------------------------------------
SQL Monitoring Report

SQL Text
------------------------------
select PKEY, ISIN from BATCH_TB where BATCH = :1 order by PKEY

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  2
 Session             :  BATCH_APP (414:13998)
 SQL ID              :  52dr7m8nnx5mk
 SQL Execution ID    :  33554695
 Execution Started   :  06/14/2011 13:25:52
 First Refresh Time  :  06/14/2011 13:25:56
 Last Refresh Time   :  06/14/2011 13:26:08
 Duration            :  16s
 Module/Action       :  JDBC Thin Client/-
 Service             :  PRD
 Program             :  JDBC Thin Client
 Fetch Calls         :  1

Binds
========================================================================================================================
| Name | Position |  Type  |                                           Value                                           |
========================================================================================================================
| :1   |        1 | NUMBER | 37755                                                                                     |
========================================================================================================================

Global Stats
=========================================================================================
| Elapsed |   Cpu   |    IO    | Application |  Other   | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Calls |  Gets  | Reqs | Bytes |
=========================================================================================
|      16 |    0.62 |       16 |        0.01 |     0.03 |     1 |  60425 | 3874 | 469MB |
=========================================================================================

SQL Plan Monitoring Details (Plan Hash Value=4053290152)
====================================================================================================================================================
| Id |      Operation       |   Name   |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read | Read  | Activity |    Activity Detail    |
|    |                      |          | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs | Bytes |   (%)    |      (# samples)      |
====================================================================================================================================================
|  0 | SELECT STATEMENT     |          |         |       |           |        |     1 |          |      |       |          |                       |
|  1 |   SORT ORDER BY      |          |     347 | 13216 |           |        |     1 |          |      |       |          |                       |
|  2 |    TABLE ACCESS FULL | BATCH_TB |     347 | 13215 |        17 |     +1 |     1 |        0 | 2996 | 370MB |   100.00 | direct path read (17) |
====================================================================================================================================================

QA runtime plan



QA> @mon_sqlid2 52dr7m8nnx5mk

DBMS_SQLTUNE.REPORT_SQL_MONITOR(SQL_ID=>'52DR7M8NNX5MK')
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL Monitoring Report

SQL Text
------------------------------
select PKEY, ISIN from BATCH_TB where BATCH = :1 order by PKEY

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  3
 Session             :  BATCH_APP (621:853)
 SQL ID              :  52dr7m8nnx5mk
 SQL Execution ID    :  50331656
 Execution Started   :  06/14/2011 11:57:46
 First Refresh Time  :  06/14/2011 11:57:50
 Last Refresh Time   :  06/14/2011 11:57:52
 Duration            :  6s
 Module/Action       :  JDBC Thin Client/-
 Service             :  QA
 Program             :  JDBC Thin Client
 Fetch Calls         :  1

Binds
========================================================================================================================
| Name | Position |  Type  |                                           Value                                           |
========================================================================================================================
| :1   |        1 | NUMBER | 37755                                                                                     |
========================================================================================================================

Global Stats
===========================================================================
| Elapsed |   Cpu   |    IO    | Cluster  | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Waits(s) | Calls |  Gets  | Reqs | Bytes |
===========================================================================
|    6.22 |    1.56 |     4.14 |     0.53 |     1 |  59621 |  966 | 107MB |
===========================================================================

SQL Plan Monitoring Details (Plan Hash Value=4053290152)
=========================================================================================================================================================
| Id |      Operation       |   Name   |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read | Read  | Activity |      Activity Detail       |
|    |                      |          | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs | Bytes |   (%)    |        (# samples)         |
=========================================================================================================================================================
|  0 | SELECT STATEMENT     |          |         |       |           |        |     1 |          |      |       |          |                            |
|  1 |   SORT ORDER BY      |          |     338 | 12544 |           |        |     1 |          |      |       |          |                            |
|  2 |    TABLE ACCESS FULL | BATCH_TB |     338 | 12543 |         6 |     +1 |     1 |        0 |  435 |  47MB |   100.00 | Cpu (2)                    |
|    |                      |          |         |       |           |        |       |          |      |       |          | db file scattered read (4) |
=========================================================================================================================================================

(I hope you like the posh sql monitoring output which I really enjoy to use. I have to admit, I wasn’t the big fan of sql monitoring on GUI-just because of the speed of OEM- till I discovered the sqlplus options like the ones above, and now I can’t stop myself from using it and it is very good alternative to trace which is expensive by all means on PRD )

From the ouput above, you can see PRD database is doing direct path reads and QA database is using db_file_scattered reads and PRD is 3 times slower for each run. Somehow oracle favours doing 11G direct path implementation for full table scan (Links for direct path Doug Burns,Alex Fatkulin, Dion Cho ) on PROD. If you read the links you can see that this has to do something with the memory the buffer cache and small table threshold (for more information about this threshold Charles Hooper-great discussion with Jonathan Lewis on comments, Tanel Poder ).

Initially when I checked the output, I saw something weird which is against our policy for this DB. We configured the user services to run on Node-2 only and somehow on QA query runs on Node-3 (instance id), I double checked this with developer and we found out that on test app server they are not using services as service name but instead they use instance name, so he is able to run query on node-3 on QA. I thought this might be the case since Oracle instance is not highly active in terms of memory usage on node-3 so Oracle might favour using expensive buffer cache reads instead of direct path reads. This was a goodish as initial shot but ended up disappointment because when I test the same query on node-2 results were again same. Wasted 5 important minutes 🙂

After losing 5 minutes on initial shot, now it is time to be more scientific, Lets check what the “documented” memory parameters are

===================QA Parameters ============================

QA> show parameter sga_

NAME                                 TYPE        VALUE
------------------------------------ ----------- -------------
sga_max_size                         big integer 10G
sga_target                           big integer 8G
QA> show parameter pga_

NAME                                 TYPE        VALUE
------------------------------------ ----------- -------------
pga_aggregate_target                 big integer 6435M
QA> show parameter db_cache_size

NAME                                 TYPE        VALUE
------------------------------------ ----------- -------------
db_cache_size                        big integer 0
QA> show parameter shared_pool

NAME                                 TYPE        VALUE
------------------------------------ ----------- -------------
shared_pool_reserved_size            big integer 70464307
shared_pool_size                     big integer 0
QA> show parameter memory

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------
hi_shared_memory_address             integer     0
memory_max_target                    big integer 0
memory_target                        big integer 0
shared_memory_address                integer     0




===================PRD Parameters ============================

PROD> show parameter sga_

NAME                                 TYPE        VALUE
------------------------------------ ----------- ----------------------------
sga_max_size                         big integer 10G
sga_target                           big integer 8G
PROD> show parameter pga_

NAME                                 TYPE        VALUE
------------------------------------ ----------- ----------------------------
pga_aggregate_target                 big integer 6435M

PROD> show parameter db_cache_size

NAME                                 TYPE        VALUE
------------------------------------ ----------- ----------------------------
db_cache_size                        big integer 0
PROD> show parameter shared_pool

NAME                                 TYPE        VALUE
------------------------------------ ----------- ----------------------------
shared_pool_reserved_size            big integer 291923558
shared_pool_size                     big integer 0
PROD> show parameter memory

NAME                                 TYPE        VALUE
------------------------------------ ----------- ----------------------------
hi_shared_memory_address             integer     0
memory_max_target                    big integer 0
memory_target                        big integer 0
shared_memory_address                integer     0
PROD>


This output told me that we have everything same in terms of static parameters, lets check what the small table threshold —> using Tanel Poders pd script for hidden parameters (needs sys access)


QA> @pd small_table_threshold

NAME                                     VALUE                          ISDEFAULT DESCRIPTION
---------------------------------------- ------------------------------ --------- -------------------------------------------------------
_small_table_threshold                   15598                          TRUE      lower threshold level of table size for direct reads


PROD> @pd small_table_threshold

NAME                                     VALUE                          ISDEFAULT DESCRIPTION
---------------------------------------- ------------------------------ --------- ------------------------------------------------------
_small_table_threshold                   5830                           TRUE      lower threshold level of table size for direct reads

Lets check if the small_table_threshold*5 (check Tanels answer on the oracle_l question ) is bigger than number of blocks of the BATCH_TBS which is probably the decision maker for favouring direct path reads

=============================QA ====================================

---number of blocks of BATCH_TB table 

QA> select blocks from dba_segments where segment_name='BATCH_TB';

    BLOCKS
----------
     60416


---5 times small_table_threshold 

QA> select 15598 * 5 from dual;

   15598*5
----------
     77990



=============================PRD ====================================

---number of blocks of BATCH_TB table 

PRD> select blocks from dba_segments where segment_name='BATCH_TB';

    BLOCKS
----------
     61440


---5 times small_table_threshold 

PRD> select 5830 * 5 from dual;

    5830*5
----------
     29150

As you can see we find the root cause but I don’t think developer really cares about the root cause. All he need is a an answer Why they hell this code is working on QA and how can we avoid this problem ?

To answer his questions we need to look at the formula of small_table_threshold which is %2 of the buffer cache size. Which means our buffer cache is bigger on QA. Lets check

=============================QA ====================================


QA> select 0.02*current_size/8192 from v$sga_dynamic_components 
  2 where component='DEFAULT buffer cache';

0.02*CURRENT_SIZE/8192
----------------------
              15482.88



QA>  select component,current_size/1024/1024 current_size_mb
  2  from v$sga_dynamic_components;

COMPONENT                   CURRENT_SIZE_MB
--------------------------- ---------------
shared pool                            1888
large pool                               32
java pool                               160
streams pool                              0
DEFAULT buffer cache                   6048
KEEP buffer cache                         0
RECYCLE buffer cache                      0
DEFAULT 2K buffer cache                   0
DEFAULT 4K buffer cache                   0
DEFAULT 8K buffer cache                   0
DEFAULT 16K buffer cache                  0
DEFAULT 32K buffer cache                  0
Shared IO Pool                            0
ASM Buffer Cache                          0

14 rows selected.


=============================PRD ====================================


PRD> select 0.02*current_size/8192 from v$sga_dynamic_components
  2  where component='DEFAULT buffer cache';

0.02*CURRENT_SIZE/8192
----------------------
                5324.8

PRD> select component,current_size/1024/1024 current_size_mb
  2  from v$sga_dynamic_components;

COMPONENT                    CURRENT_SIZE_MB
---------------------------- ---------------
shared pool                             5952
large pool                                64
java pool                                 32
streams pool                               0
DEFAULT buffer cache                    2080
KEEP buffer cache                          0
RECYCLE buffer cache                       0
DEFAULT 2K buffer cache                    0
DEFAULT 4K buffer cache                    0
DEFAULT 8K buffer cache                    0
DEFAULT 16K buffer cache                   0
DEFAULT 32K buffer cache                   0
Shared IO Pool                             0
ASM Buffer Cache                           0


As you can see %2 of buffer cache formula is true and PRD buffer cache is 3 times less than QA buffer cache.
Why this can be ? Here is the answer

QA> select count(*) from v$sql;

COUNT(*)
———-
26135

PRD> select count(*) from v$sql;

COUNT(*)
———-
68711

We are running 3 times less number of sqls on QA which leads us having 1.8GB shared pool enough on QA but not enough on PRD.
This causes shrink of buffer cache and makes us think we have visually “SAME” environment but end up realistically “DIFFERENT” environments.

My solution for the problem was creating a better index on the table which needed to be there anyway (cannot touch system wide PRD setting):)
The real and long term solution to avoid these kind of QA-PRD different problems is in my opinion having bare minimums by setting db_cache_size and shared_pool_size
and not letting Oracle to decide to shrink below real production levels.

Hope you enjoyed as much as I enjoyed to diagnose it.

!!Make sure you read all the links in this post they are all informative

Note: I needed to stop my upgrade series but will continue as soon as possible 😦 Really sorry if someone is waiting an input from me.

mon_sqlid2 script is like follow

 set long 10000000
 set longchunksize 10000000
 select dbms_sqltune.report_sql_monitor (sql_id=>'&1') from dual;

pd script to get hidden parameters (From Tanel Poder)

col pd_name head NAME for a40
col pd_value head VALUE for a30
column pd_descr heading DESCRIPTION format a55 word_wrap

select n.ksppinm pd_name, c.ksppstvl pd_value, ksppstdf isdefault,n.ksppdesc pd_descr
from x$ksppi n, x$ksppcv c
where n.indx=c.indx
and (
   lower(n.ksppinm) like lower('&1') 
   or lower(n.ksppdesc) like lower('&1')
);

8 Comments »

  1. Coskan, I think this is an excellent blog entry as it takes the reader through a good diagnostic approach and is well written and displayed.

    I noted that in your seemingly similar environments shared_pool_reserved_size is 278M on prod and only 67M on QA

    What do you see as the benefit of setting sga_max_size as 10Gb but sga_target to only use 8Gb. Teh additional 2Gb is still grabbed by oracle and is therefore unusable to anybody else. I appreciate that you can grow the database parameters later, without a bounce, but to my mind you might as well use that extra 2Gb in the buffer cache for now.
    regards
    John

    Comment by John — June 24, 2011 @ 12:53 pm

  2. John, Thank you very much for stopping by and your comments. The shared_pool_reserved_size is not set explicitly in our case.

    QA> show spparameter shared

    SID NAME TYPE VALUE
    ——– —————————– ———– —————————-
    * hi_shared_memory_address integer
    * max_shared_servers integer
    * shared_memory_address integer
    * shared_pool_reserved_size big integer
    * shared_pool_size big integer
    * shared_server_sessions integer
    * shared_servers integer

    it normally should have been auto adjusted to its default which is 5% of the value of SHARED_POOL_SIZE but a little higher in our case which is I am not sure why.

    When it comes to 10GB-8GB is just for the flexibility to grow “online” in case of a problem and secure the additional 2GB space to this database. I always find it more practical to give resources gradually to the end users so we can track down the real usage and not letting them waste the resources. When you have resource somebody will use it but real question is should they 🙂

    Comment by coskan gundogar — June 24, 2011 @ 1:15 pm

  3. RAC specs are exactly same plus all parameters are same ?

    Ignore the parameters behind the curtain, bwahaha.

    Excellent example! I especially liked the part where the spread of sqls made such a difference. That’s the kind of trap a load test could easily miss.

    Comment by jgarry — June 24, 2011 @ 11:28 pm

  4. Nice diagnostic approach. Thanks for sharing it 🙂

    Comment by Anand — July 6, 2011 @ 2:11 am

  5. Excellent approach, thanks for sharing. waiting for other series on upgrade.

    For dbms_Sqltune.report_sql_monitor, it requires additional parameters to report sqls for other sessions. But here you simply passed sql_id. Did these sqls were run from your session only ?

    Regards,
    Jagjeet Singh

    Comment by Jagjeet Singht — July 6, 2011 @ 3:59 pm

    • Hi Jagjeet,

      You are right it normally asks for other parameters as well if you want to see the runtime of a sql for a spesific session. At this occasion I am sure that this is the one I run so I only pass the sql_id but if you want to see session id spesific then better you also pass sql_id and sid together like below

      undefine sql_id session_id
      set long 10000000
      set longchunksize 10000000
      select dbms_sqltune.report_sql_monitor (sql_id=>’&sql_id’,session_id=>&session_id) from dual;

      Comment by Coskan Gundogar — July 10, 2011 @ 11:09 am

  6. Nice point Coskan.
    Enjoyed reading ….

    Damir Vadas

    Comment by Damir Vadas — July 13, 2011 @ 10:10 pm

  7. It is frustrating if you try to log on to your site, and the server is
    down. Businesses looking for web hosting services would typically already have a their own domain name or register one when signing up for hosting.

    In other words it’s time to turn the system off of autopilot and
    decide how you want to get to your destination.

    Comment by 3hostcoupons.com — December 5, 2015 @ 10:24 am


RSS feed for comments on this post. TrackBack URI

Leave a comment

Create a free website or blog at WordPress.com.