Coskan’s Approach to Oracle

January 29, 2010

Working with statspack-part-1b-Solution/Answer/another Question

Filed under: Diary, Performance — coskan @ 6:58 pm

<—-Part 1a

In part 1a I was trying to find a problem to troubleshoot and I spotted 2 possible problems and drilled down into one of them and stuck. Below is the list of what I did till I stuck and started to spend time on doing more research than it actually needs because I forgot the most important and easy step which I will explain in a moment.

What I did so far,

  • found the possible bottleneck on graph
  • cross-checked the bottleneck with DB time
  • checked the statspack and found “row cache lock” “log file switch (checkpoint incomplete)”, “buffer busy” waits as queue time
  • checked if Dictionary Cache statistics gives clue – No luck -at least for me
  • checked if DB did more redo than redo size can handle- redo activity was way lower than actual office hours
  • checked the segments for buffer busy waits and found only 1 segment caused all the wait but buffer busy waits and row cache lock are not related are they ?

What else I have in my hand ? I have a skipped snapshot slot. One of the statspack snaps could not be gathered.

What should I have done, before going and reading and learning more (which wasn’t a waste of time for learning but will be a waste of time in pressure situation) about those top events ?

Answer was I should have checked the alert log especially when my information is minimum with this historic problems. I was on call that week and I “assumed” that nothing was wrong with DB because I did not get called. Assumptions without proofs can be big problem and big time waster for your troubleshooting practices. After 4 hours of searching on the net for causes of row cache lock (which is not a common event), I decided to check alert log because of that skipped snapshot. What I found ?

Wed Jan 20 21:47:09 2010
Errors in file j:\data\oracle\trace\prod01\bdump\prod01_arc1_7976.trc:
ora-19816: Message 19816 not found; No message file for product=RDBMS, facility=ORA; arguments: [db_recovery_file_dest]
ora-27044: Message 27044 not found; No message file for product=RDBMS, facility=ORA
OSD-04008: WriteFile() failure, unable to write to file
O/S-Error: (OS 112) There is not enough space on the disk.
....
....
....
....
Wed Jan 20 22:28:48 2010
kcrrdmx: Successful archiving of previously failed ORL
Archiver process freed from errors. No longer stopped
Wed Jan 20 22:28:49 2010

As you see, I found that the database was hanged because the disk was full . Why was it full ? Looks like a mistake maybe not. It might be because we do exports on the same disk with db_file_recovery_area and probably One of my team mates took extra export during the day and forgot to delete it. With the redo amount generated that day it is unlikely that we can fill the logical limit and we did not get any %85 warning on that day on the alert log. Why did not I get called, because the issue lasted 45 minutes and probably technical guys did not check Big Brother monitoring that time:) I have no clue how it solved

I asked myself what is it to do with “row cache lock” why not “log file switch (archiving needed)”

Definition of the expected wait event fits with the alert log so it should be that event or should it ?

log file switch (archiving needed)

Waiting for a log switch because the log that the LGWR will be switching into has not been archived yet. Check the alert file to make sure that archiving has not stopped due to a failed archive write. To speed archiving, consider adding more archive processes or putting the archive files on striped disks.

Wait Time: 1 second

Parameters: None

Let’s check how it was;

SQL> @sp_evtrends
Please enter the ORACLE_SID value: prod01
Please enter the report begin date: sysdate-11
Please enter the report end date:sysdate-8

Some useful database statistics to search upon:
------------------------------------------------------------
        ARCH random i/o
        ARCH sequential i/o
	....
	.....
	.....
What statistic do you want to analyze? log file switch (archiving needed)

...
...
...
Daily trends for "log file switch (archiving needed)"...

Day    Hour   Statistic Name                         Secs Waited Percentage of total over all hours for each day
------ ------ ------------------------------ ------------------- ------------------------------------------------
17:00  log file switch (archiving needed)                0.00
18:00  log file switch (archiving needed)                0.00
19:00  log file switch (archiving needed)                0.00
20:00  log file switch (archiving needed)                0.00
21:00  log file switch (archiving needed)                0.00
22:00  log file switch (archiving needed)                0.00
23:00  log file switch (archiving needed)                0.00

How come ? Alert log says I was getting “log file switch (archiving needed)” but why my system event doesn’t show it? Lets check cumulative wait on this event since server started (12/30/2009)

SQL> select * from v$system_event where event like 'log file switch (archiving needed)';

EVENT                                                            TOTAL_WAITS TOTAL_TIMEOUTS TIME_WAITED AVERAGE_WAIT TIME_WAITED_MICRO   EVENT_ID WAIT_CLASS_ID WAIT_CLASS# WAIT_CLASS
---------------------------------------------------------------- ----------- -------------- ----------- ------------ ----------------- ---------- ------------- ----------- ---------------
log file switch (archiving needed)                                    132442          80465     8062918        60.88        8.0629E+10  681532028    3290255840           2 Configuration

It says my DB waited on this event before but when ?

SQL> @sp_event_delta2 "log file switch (archiving needed)"

   SNAP_ID SNAP_TIME         EVENT                                                            TOTAL_WAITS TOTAL_TIMEOUTS TIME_WAITED_MICRO
---------- ----------------- ---------------------------------------------------------------- ----------- -------------- -----------------
      7517 20100108 06:07:52 log file switch (archiving needed)                                        27              5           7022081
      7766 20100113 06:18:58 log file switch (archiving needed)                                         4              0            194101
      7831 20100114 13:49:06 log file switch (archiving needed)                                        16              0           5062268
      7892 20100115 21:05:16 log file switch (archiving needed)                                    132395          80460        8.0617E+10

I have 4 more day and time slots for this event and looks like on 15/01/2010 I had a similar issue. Lets check further for that day.

SQL> @sp_systime2
Please enter the ORACLE_SID value: prod01
Please enter the report begin date: sysdate-15
Please enter the report end date:sysdate-13

                                                                        Total             Cum
             Service                                                  Seconds    % of    % of
Day          or Wait  Name                                              Spent   Total   Total
------------ -------- ----------------------------------- ------------------- ------- -------

14-JAN       Service  SQL execution                                 32,871.29   39.24   39.24
             Wait     db file sequential read                       14,875.81   17.76   56.99
             Service  Recursive SQL execution                       13,203.42   15.76   72.75
             Wait     log file parallel write                        3,381.76    4.04   76.79
             Wait     db file scattered read                         3,069.28    3.66   80.45
             Wait     log file sync                                  2,651.77    3.17   83.62
             Wait     db file parallel write                         2,330.67    2.78   86.40
             Wait     Backup: sbtwrite2                              2,323.22    2.77   89.17
             Wait     enq: TX - row lock contention                  1,515.07    1.81   90.98
             Wait     Log archive I/O                                1,387.98    1.66   92.64
             Service  Parsing SQL                                    1,138.64    1.36   94.00
             Wait     log file sequential read                         814.92    0.97   94.97

15-JAN       Wait     log file switch (archiving needed)            80,616.90   31.14   31.14
             Wait     buffer busy waits                             41,401.25   15.99   47.13
             Wait     row cache lock                                39,873.16   15.40   62.53
             Service  SQL execution                                 37,210.97   14.37   76.90
             Wait     db file sequential read                       15,267.63    5.90   82.79
             Service  Recursive SQL execution                       13,166.82    5.09   87.88
             Wait     db file scattered read                         5,490.75    2.12   90.00
             Wait     log file parallel write                        5,157.94    1.99   91.99
             Wait     log file sync                                  3,793.75    1.47   93.46
             Wait     db file parallel write                         3,447.89    1.33   94.79
             Wait     enq: TX - row lock contention                  3,175.70    1.23   96.02
             Wait     Log archive I/O                                2,231.86    0.86   96.88

....
....
...
                                                                              Total             Cum
                   Service                                                  Seconds    % of    % of
Day          Hour  or Wait  Name                                              Spent   Total   Total
------------ ----- -------- ----------------------------------- ------------------- ------- -------
15-JAN      19:00 Service  SQL execution                                  2,193.28   34.66   34.66
                   Wait     db file sequential read                        1,423.59   22.50   57.16
                   Wait     log file parallel write                          543.65    8.59   65.75
                   Wait     db file parallel write                           445.41    7.04   72.79
                   Wait     log file sequential read                         344.37    5.44   78.24
                   Wait     db file scattered read                           304.99    4.82   83.06
         *** 21:00 Wait     log file switch (archiving needed)            80,616.90   48.82   48.82
                   Wait     buffer busy waits                             41,336.81   25.03   73.86
                   Wait     row cache lock                                39,873.16   24.15   98.01
                   Service  SQL execution                                  1,637.31    0.99   99.00
                   Wait     db file sequential read                          359.08    0.22   99.22
                   Service  Recursive SQL execution                          325.17    0.20   99.41
             22:00 Service  SQL execution                                  3,763.74   53.76   53.76
                   Wait     db file sequential read                        1,475.86   21.08   74.84
                   Wait     db file scattered read                           947.84   13.54   88.38
                   Wait     db file parallel write                           313.39    4.48   92.86
                   Service  Recursive SQL execution                          273.72    3.91   96.77

Between 20-21 (attention!! there is no 20:00 data which means problem is not between 21-22 but 20-21)
log file switch (archiving needed) is 48 percent of our response time. Which is very expectable if there was an archiver hang
problem on that day.

Lets cross-check if there was a snapshot at 20

SQL> @sp_stats_snap sysdate-14 sysdate-13

SNAP_ID SNAP_TIME
---------- -----------------
...
...
      7890 20100115 19:19:18
      7891 20100115 19:49:18
      7892 20100115 21:05:16  ----where is 20:00
      7893 20100115 21:35:18
      7894 20100115 22:05:18
      7895 20100115 22:35:18
      7896 20100115 23:05:18
      7897 20100115 23:35:18

The reason our report shows the waits between 21-22 because query checks the time when the snapshot gathered so it assumes those happened at 21,
Tim wrote the queries for hourly interval so it sometimes mislead you in this kind of situations be carefull and pay more attention.

Alert log also says the problem started around 20:05 and ended at 21.

Fri Jan 15 20:00:45 2010
Errors in file j:\data\oracle\trace\prod01\bdump\prod01_arc1_7976.trc:
ORA-19816: Message 19816 not found; No message file for product=RDBMS, facility=ORA; arguments: [db_recovery_file_dest]
ORA-27044: Message 27044 not found; No message file for product=RDBMS, facility=ORA
OSD-04008: WriteFile() failure, unable to write to file
O/S-Error: (OS 112) There is not enough space on the disk.
...
...
...
...
kcrrdmx: Successful archiving of previously failed ORL
Archiver process freed from errors. No longer stopped
Fri Jan 15 21:05:15 2010

So for the exactly same issue (maybe not ) our DB did not cry for “log file switch (archiving needed)” why ?

I am thinking about possible answers but I need to test meanwhile maybe you want to share your ideas
before I wrote the 3th part for my answer?

Today I kept my promise about giving the reason but I have to close an answering post with another question
which is the best part of troubleshooting, it never ends.

Moral of the story, “Check the alert log”

January 27, 2010

Working with statspack-part-1a-Diagnosis

Filed under: Basics, Performance — coskan @ 6:24 pm

Part 1b—->

Note: Post is a bit long and might be boring, if you are already good with statspack, but there is a question at the end if you would like to answer

Yesterday, I decided to look for something to troubleshoot without waiting for any user complaints so I can write something useful here :)
Primary reason for this practice was to get used to interpret statspack reports to find possible bottlenecks
secondary reason was because I was bored about being underutilized.
I always feel comfortable troubleshooting session wide problems, non-historic performance problems or historic problems with Diagnostics Pack,
but I am not that experienced on statspack.
In my current environment I don’t have Diagnostic Pack option so I have to make good friendship with statspack and
the best way to be a good friend is to spend more time with your new friend as I am going to explain how.

Characteristics of the DB I will work on:
daily load is between 08:00 and 19:00
export at 21:00 running for roughly 1 hour
DB backup start 23:00

There are couple of ways to spot a possible bottlenecks without getting individual reports.

—First one is from Tanel Poder and his famous Perfsheet. This tool is already configured and ready to be used with statspack.
The query he uses to generate the report is the same idea behind the statspack delta views of  Tim Gorman. Lets start

first screen is login screen (you need to activate the macros).

Inputs for statspack report is
Data Source- TNS Entry of the DB
Username/Password- username with statspack schema objects read privilege
Which Instance-instance name (lower case or modify the related sql in Queries sheet with upper function )
Which Query-Statspack CPU+Wait+Stats Hourly
From Date-start date
To Date-end date
then press show me. You will get warnings during the graph generation saying PivotTable report will not fit on the sheet … Just Press OK till the warnings end
Your graph might be empty on Office 2007. Just go to the Report Filter and select WAIT from EVENT_TYPES to see any abnormality on wait events.

First Graph is WAIT graph like below. As you see, I was able to spot a possible problem on 20/01/2010 at first try.

Lets make it more visible by adding top 10 filter to Legend Fields Section.

Another option is focus on the date where the abnormality is happened. Lets select only 20th. FYI top 10 filter is still on

Without the top 10 filter. Graph is slightly different but still giving what we need.

Lets Focus on only time interval between 18 and 23

As you see there is an abnormal increase on row cache lock logfile switch and buffer busy

Lets do double-check with system stats. By selecting all hours and available dates and events your graph will be a bit misleading because of very high metrics like below

What can we do ? DB time is something we want to decrease when troubleshooting, so lets focus on DB Time by filtering Legend Fields on DB Time

We have 2 peaks, which is a bit confusing for the initial assumption of problem timing. Lets focus on 20th again

There are two peaks  between 13-15 and 21-23. We have an interesting situation here in terms of DB time. From wait event perspective abnormality is between 20-22 but DB time perspective we have 1 more time slot between 13-15 .

At this point  I quit Perfsheet. I spotted what I needed for initial problem chase now I can go to other options before, gathering the report. (these options are way more easier than statspack report generation thats why I did not directly go to the statspack)

Second option is from Tim Gorman himself. sp_systime which I find very useful in terms of overall response time analysis. ( i needed to change the script to focus between the dates instead of days before sysdate option)

Here is the ouput

First Part is overall response time for that day.


SQL> @sp_systime2
Please enter the ORACLE_SID value: PROD1
Please enter the report begin date: sysdate-8
Please enter the report end date:sysdate-5

                                                                        Total             Cum
Service Seconds % of % of
Day or Wait Name Spent Total Total
------------ -------- ----------------------------------- ------------------- ------- -------
19-JAN       Service  SQL execution                                115,326.24   78.78   78.78
             Wait     db file sequential read                        7,697.33    5.26   84.04
             Wait     db file scattered read                         6,213.01    4.24   88.29
             Service  Recursive SQL execution                        4,638.75    3.17   91.46
Wait Backup: sbtwrite2 2,671.85 1.83 93.28
Wait Backup: sbtbackup 2,081.53 1.42 94.70
             Wait     log file parallel write                        1,423.58    0.97   95.67
             Service  Parsing SQL                                    1,109.49    0.76   96.43
             Wait     db file parallel write                           913.09    0.62   97.06
             Wait     log file sync                                    761.78    0.52   97.58
Wait Datapump dump file I/O 631.13 0.43 98.01
             Wait     Log archive I/O                                  549.08    0.38   98.38

20-JAN       Service  Recursive SQL execution                      255,597.67   65.42   65.42
             Wait     row cache lock                                48,098.17   12.31   77.74
Wait log file switch (checkpoint incompl 41,613.30 10.65 88.39
             Wait     buffer busy waits                             13,474.25    3.45   91.84
             Wait     db file sequential read                        9,788.51    2.51   94.34
             Wait     db file scattered read                         6,831.08    1.75   96.09
Wait Backup: sbtwrite2 3,512.51 0.90 96.99
             Wait     log file parallel write                        1,897.38    0.49   97.48
Wait Datapump dump file I/O 1,674.20 0.43 97.90
             Wait     db file parallel write                         1,389.77    0.36   98.26
             Wait     log file sync                                  1,073.14    0.27   98.53
             Service  Parsing SQL                                      964.58    0.25   98.78

21-JAN       Service  SQL execution                                 33,045.26   44.91   44.91
             Service  Recursive SQL execution                       12,785.72   17.38   62.28
             Wait     db file sequential read                        8,296.99   11.28   73.56
             Wait     db file scattered read                         6,731.96    9.15   82.71
Wait Backup: sbtwrite2 3,340.23 4.54 87.25
             Wait     log file parallel write                        1,856.99    2.52   89.77
             Wait     db file parallel write                         1,359.19    1.85   91.62
             Wait     log file sync                                  1,131.31    1.54   93.16
             Service  Parsing SQL                                    1,011.46    1.37   94.53
Wait enq: TX - row lock contention 861.24 1.17 95.70
             Wait     Log archive I/O                                  686.01    0.93   96.63
Wait Backup: sbtbackup 402.52 0.55 97.18

If you check the top 5 you will easily spot the row cache lock, log file switch and buffer busy waits are different from the other days

Second part of the report is hourly based response time for the days. I truncated the output for 8 hours but the full is here ***

 12:00 Service  SQL execution                                  6,268.92   76.66   76.66
       Wait     db file sequential read                          634.26    7.76   84.42
       Wait     db file scattered read                           344.02    4.21   88.63
       Service  Recursive SQL execution                          230.23    2.82   91.44
       Wait     log file parallel write                          147.58    1.80   93.25
       Wait     log file sync                                    129.94    1.59   94.84
 13:00 Service  SQL execution                                  6,942.33   83.97   83.97
       Wait     db file sequential read                          375.15    4.54   88.51
       Wait     db file scattered read                           207.74    2.51   91.02
       Service  Recursive SQL execution                          200.86    2.43   93.45
Wait enq: TX - row lock contention 162.83 1.97 95.42
       Wait     log file parallel write                           94.08    1.14   96.56
 14:00 Service  Recursive SQL execution                      241,820.91   99.35   99.35
       Wait     db file scattered read                           574.84    0.24   99.59
       Wait     db file sequential read                          515.74    0.21   99.80
       Wait     log file parallel write                           97.09    0.04   99.84
Wait enq: TX - row lock contention 87.19 0.04 99.88
       Service  Parsing SQL                                       81.25    0.03   99.91
 15:00 Service  SQL execution                                  3,620.32   62.09   62.09
       Service  Recursive SQL execution                        1,210.42   20.76   82.86
       Wait     db file sequential read                          328.60    5.64   88.49
       Wait     db file scattered read                           294.73    5.06   93.55
       Service  Parsing SQL                                       73.42    1.26   94.81
Wait enq: TX - row lock contention 72.03 1.24 96.04
 16:00 Service  SQL execution                                  3,667.78   70.86   70.86
       Wait     db file sequential read                          429.87    8.30   79.16
       Service  Recursive SQL execution                          371.24    7.17   86.33
       Wait     db file scattered read                           215.28    4.16   90.49
Wait enq: TX - row lock contention 144.67 2.79 93.29
       Wait     log file parallel write                          100.81    1.95   95.24
 17:00 Service  SQL execution                                  2,904.54   70.63   70.63
       Wait     db file sequential read                          397.42    9.66   80.29
       Service  Recursive SQL execution                          202.13    4.92   85.21
       Wait     db file scattered read                           138.29    3.36   88.57
       Wait     log file sync                                     94.87    2.31   90.88
       Wait     log file parallel write                           90.94    2.21   93.09
 18:00 Service  SQL execution                                  1,791.86   65.17   65.17
       Wait     db file sequential read                          302.02   10.98   76.15
       Service  Recursive SQL execution                          176.04    6.40   82.56
       Wait     db file scattered read                           151.83    5.52   88.08
       Wait     log file parallel write                           80.44    2.93   91.00
       Wait     log file sync                                     63.46    2.31   93.31
 19:00 Service  SQL execution                                  2,261.88   44.86   44.86
       Wait     db file sequential read                          758.41   15.04   59.91
       Wait     db file scattered read                           540.35   10.72   70.62
       Wait     db file parallel write                           272.36    5.40   76.03
       Wait     log file parallel write                          269.47    5.34   81.37
       Wait     log file sequential read                         154.58    3.07   84.44
 20:00 Service  SQL execution                                  3,517.30   73.29   73.29
       Wait     db file sequential read                          467.34    9.74   83.03
       Wait     db file scattered read                           251.47    5.24   88.27
       Wait     log file parallel write                          161.71    3.37   91.64
       Wait     db file parallel write                            92.73    1.93   93.57
       Wait     Log archive I/O                                   90.24    1.88   95.45
 21:00 Service  SQL execution                                  2,875.04   45.72   45.72
Wait Datapump dump file I/O 1,504.28 23.92 69.64
       Wait     db file sequential read                          873.68   13.89   83.53
       Wait     db file scattered read                           417.53    6.64   90.17
       Wait     log file parallel write                          130.15    2.07   92.24
       Wait     db file parallel write                           106.51    1.69   93.93
 22:00 Wait     row cache lock                                48,097.61   41.83   41.83
Wait log file switch (checkpoint incompl 41,594.75 36.17 78.00
       Wait     buffer busy waits                             13,454.89   11.70   89.70
       Service  Recursive SQL execution                        8,958.04    7.79   97.49
       Wait     db file sequential read                        1,485.98    1.29   98.79
       Wait     db file scattered read                           856.65    0.75   99.53
23:00 Wait Backup: sbtwrite2 2,801.73 76.03 76.03
Wait Backup: sbtbackup 352.22 9.56 85.59
       Wait     db file scattered read                           140.16    3.80   89.39
Wait Backup: sbtclose2 93.25 2.53 91.92
       Wait     db file sequential read                           78.52    2.13   94.05
       Service  SQL execution                                     65.77    1.78   95.84

What do we see here ? The graph told us there might be a possible problem between 13-15 and 21-23.

First; focus on 13-15

13:00 Service  SQL execution                                  6,942.33   83.97   83.97
       Wait     db file sequential read                          375.15    4.54   88.51
       Wait     db file scattered read                           207.74    2.51   91.02
       Service  Recursive SQL execution                          200.86    2.43   93.45
Wait enq: TX - row lock contention 162.83 1.97 95.42
       Wait     log file parallel write                           94.08    1.14   96.56
 14:00 Service  Recursive SQL execution                      241,820.91   99.35   99.35
       Wait     db file scattered read                           574.84    0.24   99.59
       Wait     db file sequential read                          515.74    0.21   99.80
       Wait     log file parallel write                           97.09    0.04   99.84
Wait enq: TX - row lock contention 87.19 0.04 99.88
       Service  Parsing SQL                                       81.25    0.03   99.91

Second ; focus on 21-23

21:00 Service  SQL execution                                  2,875.04   45.72   45.72
Wait Datapump dump file I/O 1,504.28 23.92 69.64
       Wait     db file sequential read                          873.68   13.89   83.53
       Wait     db file scattered read                           417.53    6.64   90.17
       Wait     log file parallel write                          130.15    2.07   92.24
       Wait     db file parallel write                           106.51    1.69   93.93
 22:00 Wait     row cache lock                                48,097.61   41.83   41.83
Wait log file switch (checkpoint incompl 41,594.75 36.17 78.00
       Wait     buffer busy waits                             13,454.89   11.70   89.70
       Service  Recursive SQL execution                        8,958.04    7.79   97.49
       Wait     db file sequential read                        1,485.98    1.29   98.79
       Wait     db file scattered read                           856.65    0.75   99.53

Nothing common but tells lots of things. First one spent %99 of response time on Recursive SQL Execution which means no wait but on CPU doing something which we don’t know yet.
Second one spent %41 of its time on row cache lock and %36 on log file switch. Now we are ready for the statspack to see what can we find but
before that I want to show 2 more scripts whose output is can be obtained by Perfsheet as well (like DB time)

First script is Tim Gormans sp_evtrends scripts which shows trends of events. There are two problems; sometimes unnecessarily high number of * character  shown on the screen but comparing it with the seconds waited it can easily be ignored. Another thing is that, if you focus on 1 day sometimes * character graph might be misleading as it is with excel graphs so it is better to focus on 5-10 days

Scripts asks for db_name, begin date, end date and statistics to analyze

SQL> @sp_evtrends
Please enter the ORACLE_SID value: PROD1
Please enter the report begin date: sysdate-8
Please enter the report end date:sysdate-6

Some useful database statistics to search upon:

Name
------------------------------------------------------------
ARCH random i/o
ARCH sequential i/o
ARCH wait for archivelog lock
	....
	....
	....

Daily trends for "row cache lock"...

What statistic do you want to analyze? row cache lock

Day    Statistic Name                         Secs Waited Percentage of total over all days
------ ------------------------------ ------------------- ------------------------------------------------------------
19-JAN row cache lock                                0.00

20-JAN row cache lock                          480,981.73 ************************************************************

Daily/hourly trends for "row cache lock"...

Day    Hour   Statistic Name                         Secs Waited Percentage of total over all hours for each day
------ ------ ------------------------------ ------------------- ------------------------------------------------------------
19-JAN 00:00  row cache lock                                0.00
       01:00  row cache lock                                0.00
       02:00  row cache lock                                0.00
       03:00  row cache lock                                0.00
       04:00  row cache lock                                0.00
       05:00  row cache lock                                0.00
       06:00  row cache lock                                0.00
       07:00  row cache lock                                0.00
       08:00  row cache lock                                0.00
       09:00  row cache lock                                0.00
       10:00  row cache lock                                0.00
       11:00  row cache lock                                0.00
       12:00  row cache lock                                0.00
       13:00  row cache lock                                0.00
       14:00  row cache lock                                0.00
       15:00  row cache lock                                0.00
       16:00  row cache lock                                0.00
       17:00  row cache lock                                0.00
       18:00  row cache lock                                0.00
       19:00  row cache lock                                0.00
       20:00  row cache lock                                0.00
       21:00  row cache lock                                0.00
       22:00  row cache lock                                0.00 ************************************************************
       23:00  row cache lock                                0.00

20-JAN 00:00  row cache lock                                0.00
       01:00  row cache lock                                0.00
       02:00  row cache lock                                0.00
       03:00  row cache lock                                0.00
       04:00  row cache lock                                0.00
       05:00  row cache lock                                0.00
       06:00  row cache lock                                0.00
       07:00  row cache lock                                0.00
       08:00  row cache lock                                0.00
       09:00  row cache lock                                0.00
       10:00  row cache lock                                0.00
       11:00  row cache lock                                0.00
       12:00  row cache lock                                0.00
       13:00  row cache lock                                0.00
       14:00  row cache lock                                0.00
       15:00  row cache lock                                0.00
       16:00  row cache lock                                5.65
       17:00  row cache lock                                0.00
       18:00  row cache lock                                0.00
       19:00  row cache lock                                0.00
       20:00  row cache lock                                0.00
       21:00  row cache lock                                0.00
       22:00  row cache lock                          480,976.09 ************************************************************
       23:00  row cache lock                                0.00

As you see it shows the same number of * for 19th which is wrong when you check with time spent so abnormality can be spotted by both time spent and number of *.

Second script is modified version for sys_time_model statistics called sp_sys_time_trends.

SQL> @sp_sys_time_trends
</a>Please enter the ORACLE_SID value: PROD1
Please enter the report begin date: sysdate-8
Please enter the report end date:sysdate-6

Some useful database statistics to search upon:

Name
------------------------------------------------------------
DB CPU
DB time
Java execution elapsed time
PL/SQL compilation elapsed time
PL/SQL execution elapsed time
RMAN cpu time (backup/restore)
background cpu time
background elapsed time
connection management call elapsed time
failed parse (out of shared memory) elapsed time
failed parse elapsed time
hard parse (bind mismatch) elapsed time
hard parse (sharing criteria) elapsed time
hard parse elapsed time
inbound PL/SQL rpc elapsed time
parse time elapsed
repeated bind elapsed time
sequence load elapsed time
sql execute elapsed time
What statistic do you want to analyze? DB time

Daily trends for "DB time"...

Day    Statistic Name                      VALUE Percentage of total over all days
------ ------------------------------ ---------- ------------------------------------------------------------
19-JAN DB time                        1408469.51 **************

20-JAN DB time                        4797565.14 **********************************************

Daily/hourly trends for "DB time"...

Day    Hour   Statistic Name                      VALUE Percentage of total over all hours for each day
------ ------ ------------------------------ ---------- ------------------------------------------------------------
19-JAN 00:00  DB time                        18267.2492 *
01:00  DB time                        57907.3551 **
02:00  DB time                        64889.8266 ***
03:00  DB time                        37988.8748 **
04:00  DB time                        39093.3842 **
05:00  DB time                        39267.5194 **
06:00  DB time                        56103.5928 **
07:00  DB time                        43312.1147 **
08:00  DB time                        50542.4821 **
09:00  DB time                        62883.2236 ***
10:00  DB time                        89721.9315 ****
11:00  DB time                        86621.3671 ****
12:00  DB time                        78163.5806 ***
13:00  DB time                        64324.9537 ***
14:00  DB time                        78970.2241 ***
15:00  DB time                        78037.7742 ***
16:00  DB time                        78215.1336 ***
17:00  DB time                          69875.99 ***
18:00  DB time                        61075.9801 ***
19:00  DB time                        47658.5181 **
20:00  DB time                         41372.216 **
21:00  DB time                        57569.6918 **
22:00  DB time                        65593.3167 ***
23:00  DB time                        41013.2088 **

20-JAN 00:00  DB time                        39789.4329
01:00  DB time                        76274.4764 *
02:00  DB time                        39564.2615
03:00  DB time                        37327.7534
04:00  DB time                        38174.4725
05:00  DB time                         38335.113
06:00  DB time                        55688.7588 *
07:00  DB time                        39039.2476
08:00  DB time                        45155.1028 *
09:00  DB time                        62292.3892 *
10:00  DB time                        73688.6751 *
11:00  DB time                        86222.6032 *
12:00  DB time                        79400.9817 *
13:00  DB time                        81021.4711 *
14:00  DB time                        2498694.98 *******************************
15:00  DB time                        57824.5047 *
16:00  DB time                        50252.1229 *
17:00  DB time                        39474.4176
18:00  DB time                         25728.465
19:00  DB time                         40950.006 *
20:00  DB time                        45258.7127 *
21:00  DB time                        66186.8379 *
22:00  DB time                        1177471.46 ***************
23:00  DB time                        3748.88236

As you can spot like the excell graph we have DB time increased for 14-15 and 22-23 period

Now It is time to get the statspack report and dig further.
First I want to focus on the row cache problem if I have space I will talk about the recursive cursor problem maybe as second part.
We normally collect 30 minutes but available snaps are like below.

                          Snap
Snap Id   Snap Started    Level
8135 20 Jan 2010 21:05     7
8136 20 Jan 2010 21:35     7
8137 20 Jan 2010 22:28     7
8138 20 Jan 2010 22:58     7

If you look carefully you will see that snap at 22:00 is missing. There must be something wrong on DB which prevented statspack
from gathering but lets assume that you could not spot it for now as I could’t which was a deadly mistake caused increase in trouble shooting time
I generated statspack report between 21:35 and 22:28

Here is the statspack summary (full statspack is at here. SQLs are a bit edited for keeping privacy)

STATSPACK report for

Database    DB Id    Instance     Inst Num Startup Time    Release     RAC
~~~~~~~~ ----------- ------------ -------- --------------- ----------- ---
1814528001 PROD1 1 30-Dec-09 01:59 10.2.0.4.0 NO

Host Name: MACHINE1 Num CPUs: 8 Phys Memory (MB): 32,763
~~~~

Snapshot       Snap Id     Snap Time      Sessions Curs/Sess Comment
~~~~~~~~    ---------- ------------------ -------- --------- -------------------
Begin Snap:       8136 20-Jan-10 21:35:59      128       7.2
  End Snap:       8137 20-Jan-10 22:28:50      189       5.4
   Elapsed:               52.85 (mins)

Cache Sizes                       Begin        End
~~~~~~~~~~~                  ---------- ----------
               Buffer Cache:     4,128M     4,032M  Std Block Size:         8K
           Shared Pool Size:     6,032M     6,128M      Log Buffer:    13,913K

Load Profile                            Per Second       Per Transaction
~~~~~~~~~~~~                       ---------------       ---------------
                  Redo size:            126,609.46            124,954.43
              Logical reads:             10,300.31             10,165.66
              Block changes:                904.39                892.57
             Physical reads:                375.25                370.34
            Physical writes:                 17.54                 17.31
                 User calls:                 15.88                 15.67
Parses: 9.22 9.09
Hard parses: 1.08 1.06
                      Sorts:                 60.58                 59.79
                     Logons:                  0.13                  0.13
                   Executes:                327.42                323.14
               Transactions:                  1.01

  % Blocks changed per Read:    8.78    Recursive Call %:    98.49
 Rollback per transaction %:    1.28       Rows per Sort:    28.80

Instance Efficiency Percentages
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Buffer Nowait %: 99.96 Redo NoWait %: 95.11
            Buffer  Hit   %:   99.73    In-memory Sort %:  100.00
            Library Hit   %:   99.10        Soft Parse %:   88.32
         Execute to Parse %:   97.19         Latch Hit %:   99.91
Parse CPU to Parse Elapsd %: 64.77 % Non-Parse CPU: 99.51

 Shared Pool Statistics        Begin   End
                               ------  ------
             Memory Usage %:   74.42   72.87
    % SQL with executions>1:   33.97   33.40
  % Memory for SQL w/exec>1:   56.11   55.38

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
row cache lock                                  16,524      48,098   2911   45.6
log file switch (checkpoint incomplete)         68,137      41,595    610   39.5
buffer busy waits                               13,462      13,455    999   12.8
CPU time                                                     1,472           1.4
db file sequential read                        723,098         446      1     .4
          -------------------------------------------------------------
Host CPU CPUs-8
~~~~~~~~              Load Average
Begin End User System Idle WIO WCPU
                    ------- -------   ------- ------- ------- ------- --------
                                         3.69    2.12   94.19

Instance CPU
~~~~~~~~~~~~
              % of total CPU for Instance:    3.30
              % of busy  CPU for Instance:   56.81
  %DB time waiting for CPU - Resource Mgr:

Memory Statistics                       Begin          End
~~~~~~~~~~~~~~~~~                ------------ ------------
                  Host Mem (MB):     32,762.6     32,762.6
SGA use (MB): 10,240.0 10,240.0
                   PGA use (MB):        776.7        787.4
% Host Mem used for SGA+PGA: 33.6 33.7
          -------------------------------------------------------------

Time Model System Stats DB/Inst: PROD1/prod1 Snaps: 8136-8137
-> Ordered by % of DB time desc, Statistic name

Statistic                                       Time (s) % of DB time
----------------------------------- -------------------- ------------
sql execute elapsed time 101,089.5 100.0
DB CPU                                             821.2           .8
PL/SQL execution elapsed time                      620.9           .6
parse time elapsed                                  22.4           .0
hard parse elapsed time                             21.2           .0
PL/SQL compilation elapsed time                      1.7           .0
connection management call elapsed                   0.2           .0
hard parse (sharing criteria) elaps 0.1 .0
repeated bind elapsed time                           0.0           .0
hard parse (bind mismatch) elapsed                   0.0           .0
failed parse elapsed time                            0.0           .0
sequence load elapsed time                           0.0           .0
DB time                                        101,090.9
background elapsed time                          6,382.7
background cpu time                                 16.6

Wait Events DB/Inst: PROD1/prod1 Snaps: 8380-8382
-> s - second, cs - centisecond, ms - millisecond, us - microsecond
-> %Timeouts: value of 0 indicates value was < .5%. Value of null is truly 0
-> Only events with Total Wait Time (s) >= .001 are shown
-> ordered by Total Wait Time desc, Waits desc (idle events last)

                                                                    Avg
                                                %Time Total Wait   wait    Waits
Event Waits -outs Time (s) (ms) /txn
--------------------------------- ------------ ------ ---------- ------ --------
row cache lock                          16,524     97     48,098   2911      5.1
log file switch (checkpoint incom 68,137 61 41,595 610 21.2
buffer busy waits                       13,462    100     13,455    999      4.2
..
..
..

Wait Event Histogram DB/Inst: PROD1/prod1 Snaps: 8136-8137
-> Total Waits - units: K is 1000, M is 1000000, G is 1000000000
-> % of Waits - column heading: <=1s is truly <1024ms, >1s is truly >=1024ms
-> % of Waits - value: .0 indicates value was <.05%, null is truly 0
-> Ordered by Event (idle events last)

                           Total ----------------- % of Waits ------------------
Event Waits <1ms <2ms <4ms <8ms <16ms <32ms <=1s >1s
-------------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
row cache lock               16K   2.9                                  .0  97.1
log file switch (checkpoin 66K 21.0 2.7 12.8 2.1 .3 61.1
buffer busy waits            13K    .0                            .0 100.0

As you see row cache locks avg wait time is nearly 3 second with 97% is over 1 second which looks like it is the problem

We already spotted row cache lock but where to look next ?

Documentation says ;

row cache lock

The session is trying to get a data dictionary lock. 

Wait Time: Wait up to 60 seconds.

Parameter	Description
cache id	The CACHE# column value in the V$ROWCACHE view
mode	See "mode"
request	The pipe timer set by the user

In Statspack report we have a section called “Dictionary Cache Stats”. Lets check what we have there

Dictionary Cache Stats DB/Inst: PROD1/prod1 Snaps: 8136-8137
->"Pct Misses" should be very low (< 2% in most cases)
->"Final Usage" is the number of cache entries being used in End Snapshot

Get Pct Scan Pct Mod Final
Cache Requests Miss Reqs Miss Reqs Usage
------------------------- ------------ ------ ------- ----- -------- ----------
dc_awr_control                      18    0.0       0              0          1
dc_database_links                  107    0.0       0              0          1
dc_files                            13    0.0       0              0         13
dc_global_oids                  21,571    0.1       0              2        255
dc_histogram_data               73,396    2.2       0              0      3,904
dc_histogram_defs              279,541    1.4       0              0      7,652
dc_object_grants                   102   11.8       0              0        124
dc_object_ids                1,050,311    0.1       0              5      2,879
dc_objects                      14,052    1.3       0             31      2,605
dc_profiles                         74    0.0       0              0          1
dc_rollback_segments             4,147    0.0       0             64        300
dc_segments                     50,349    0.2       0            121      2,415
dc_sequences                        10   10.0       0             10         13
dc_table_scns                        1  100.0       0              0          0
dc_tablespace_quotas               289    0.0       0              0          9
dc_tablespaces                  66,671    0.0       0              0         11
dc_usernames                    18,888    0.1       0              0         53
dc_users                        99,804    0.0       0              0         71
outstanding_alerts                  26   15.4       0              4         24
          -------------------------------------------------------------

From my old shared pool post I remember that Pct Miss is important about dictionary cache tuning.
I did the comparison with the normal times and the only thing I spot was increase in Pct miss for
dc_object_grants and dc_sequences with less get request but without specific session activity all
I can do is this much for this event. so I moved to logfile switch event to see if I can find something

Definition **** log file switch (checkpoint incomplete)
The log file switch wait indicates that the process is waiting for the log file switch to complete,
but the log file switch is not possible because the checkpoint process for that log file
has not completed. You may see this event when the redo log files are sized too small.
Wait Parameters Wait parameters are not used for log file switch (checkpoint incomplete).
Wait Time One second

This is an event we see occasionally but not with the redo size we generated for that time interval. Lets see how many log switches has been done for that day

SQL> @gen_redo

DAY            COUNT#       MIN#       MAX# DAILY_AVG_MB
---------- ---------- ---------- ---------- ------------
2010-01-26        790      60361      61150        50560
2010-01-25        722      59639      60360        46208
2010-01-24        469      59170      59638        30016
2010-01-23        291      58879      59169        18624
2010-01-22        651      58228      58878        41664
2010-01-21        394      57834      58227        25216
2010-01-20        473      57361      57833        30272

10 rows selected.

Session altered.

DAY        00    01    02    03    04    05    06    07    08    09    10    11    12    13    14    15    16    17    18    19    20    21    22    23
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
20/01/2010    0    50     0     0     0     1    56     0     1     9    17    43    37    27     8     8    14    12     9   101    57    15     7     1
21/01/2010    0    52     0     1     0     0    56     0     1     6    13    12    12    12    17    10     4    11     6   100    58    18     4     1
22/01/2010    0    51     0     0     1     0    58     0     1     3    11    27    14    18    82    25    17    60   104   100    57    18     4     0
23/01/2010    1    51     2     1     0     0    54     1     0     0     0     0     1     0     0     0     1     0     0   100    57    18     3     1
24/01/2010    0    52     1     3     3     3    67     8     7     1     0     1     0     0   136    53    16     1     0    76    20    18     1     2
25/01/2010    0    53     0     0     0     1    59     1     1   120    45    89    40     4    80    17    16    78     6    90    11     8     3     0
26/01/2010    0    53    12    39    13     0    57     0     1    40    28    63    38   142     4    99    59     5    17    82    16    19     2     1

8 rows selected.

looks like nothing specific happened number of log switches looks innocent.

What is next buffer busy looks like the third one with little effect but still can give something ?

Lets check the trends

SQL> @sp_evtrends
Please enter the ORACLE_SID value: PROD1
Please enter the report begin date: sysdate-8
Please enter the report end date:sysdate

Daily trends for "buffer busy waits"...

Day    Statistic Name                         Secs Waited Percentage of total over all days
------ ------------------------------ ------------------- ------------------------------------------------------------
19-JAN buffer busy waits                           127.95

20-JAN buffer busy waits                       134,742.54 **********************************************************

21-JAN buffer busy waits                           205.92

22-JAN buffer busy waits                           553.43

23-JAN buffer busy waits                             0.31

24-JAN buffer busy waits                            32.49

25-JAN buffer busy waits                         1,449.90 *

26-JAN buffer busy waits                         1,131.49

Daily/hourly trends for "DB time"... (truncated)

Day    Hour   Statistic Name                      VALUE Percentage of total over all hours for each day
------ ------ ------------------------------ ---------- ------------------------------------------------------------

20-JAN 00:00  buffer busy waits                             0.00
       01:00  buffer busy waits                             0.00
       02:00  buffer busy waits                             0.00
       03:00  buffer busy waits                             0.00
       04:00  buffer busy waits                             0.00
       05:00  buffer busy waits                             0.00
       06:00  buffer busy waits                             0.00
       07:00  buffer busy waits                             0.00
       08:00  buffer busy waits                             0.00
       09:00  buffer busy waits                             4.86
       10:00  buffer busy waits                             6.03
       11:00  buffer busy waits                            43.35
       12:00  buffer busy waits                            24.02
       13:00  buffer busy waits                            20.27
       14:00  buffer busy waits                            16.45
       15:00  buffer busy waits                            24.45
       16:00  buffer busy waits                            27.31
       17:00  buffer busy waits                            23.86
       18:00  buffer busy waits                             2.78
       19:00  buffer busy waits                             0.24
       20:00  buffer busy waits                             0.00
       21:00  buffer busy waits                             0.00
       22:00  buffer busy waits                       134,548.91 ************************************************************
       23:00  buffer busy waits                             0.00

Looks like normally our DB doesn’t spend time on this event which is good.

We have a nice section in statspack report for buffer busy wait events


Segments by Buffer Busy Waits DB/Inst: PROD1/prod1 Snaps: 8136-8137
-> End Segment Buffer Busy Waits Threshold:       100

                                                                    Buffer
Subobject Obj. Busy Pct
Owner Tablespace Object Name Name Type Waits Total
---------- ---------- -------------------- ------------ ----- ------------ -----
APP_USER USERS ULG TABLE 9,560 100.0
APP_USER USERS PK_ULG INDEX 1 .0
          -------------------------------------------------------------

All of the waits happened on the same object. Before checking I checked trend with another Tim Gorman script sp_buffer_busy (again edited for time interval and output only segments with buffer_busy waits >0)

I won’t put the output but output says this table showed up only 3 more times with wait total less then 30, in 10 days period. If you check the actual statspack for sqls
ULG table is not shown on any of them which is a bit strange again.

Maybe I can find something else, if I get half an hour period report for

		 	Snap
Snap Id   Snap Started    Level
8137 20 Jan 2010 22:28     7
8138 20 Jan 2010 22:58     7

From top 5 wait events

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
db file sequential read                        613,887       1,040      2   42.0
db file scattered read                         266,197         853      3   34.5
CPU time                                                       418          16.9
db file parallel write                          45,244          86      2    3.5
log file parallel write                          4,449          26      6    1.1
          -------------------------------------------------------------

Our problem disappeared from top 5 list and DB time=16,656.2 which is very small comparing to previous value 101,090.9.
problem is gone but what has actually happened ?

I am pretty sure that there wasn’t any change on that day. backup ran successfully after 23:00 and export finished before 9:30 so what can this be ?

I know the answer but whoever comes up with a good idea please be my guest to comment. If you are good enough, I promise to buy beer.

Answer will be here as an another post on Friday. If I have time I will also show what happened for the other DB time peak on same day.

Note:
With this post I did not re-invent the wheel but try to show you the available free tools you can use and modify for your own fast troubleshooting. As Tanel says right tool right time. These guys share these tools to help you , my advice is use it if you need the.
My script library consist of many scripts from Tanel Poder, Riyaj Shamsudeen ,Tim Gorman, Kerry Osborne, Dion Cho, Craig Shallahamer and from many others plus my own.
Thanks you very much to all guys who share their tools with us.

How much time I spent for diagnosis? Everything on this page took 10 minutes total to run. Rest is your experience and skills to search on metalink and internet.
for me coming to last point took only 1.5 hour.

List of scripts I used;

September 10, 2009

Oracle Performance Firefighting by Craig Shallahamer

Filed under: Book Reviews, Diary, Performance — coskan @ 3:42 pm

Today I am going to write about an amazing book Oracle Performance Firefighting by Craig Shallahamer, which was released in July 2009. After all the papers I read from Craig, my fingers were crossed for the release date and I ordered the book for our company immediately after release.  The reason I was very excited about the book is that  I always feel uncomfortable with overall system performance tuning and this is the area where Craig looks  very good at.

During the review, I will try to cover what this book promises and gives us by going through  all the chapters one by one. For overall, this book is not for sql tuning  It is about how to find the problem and giving possible other solutions for common problems ( in addition to tuning your code as a first option.) by giving the information about how Oracle works in most problematic areas like buffer cache, shared pool, redo, undo and especially  latches and mutexes.

Before starting to go through chapters, I want to say that the book is very well organized  and the way Craig teaches avoids boredom about numbers and math.   I could not  leave the book even on holiday.  The way he covers the very hard to understand topics brings curiosity about the next topic.  One more thing is the number of usage of the word OraPub (his company) kept minimum.  While reading  Optimizing Oracle Performance by Cary Millsap and Jeffrey Holt, I really hated the word Hotsos because  it was nearly on every page and that was very annoying. If you are curious about this problem it is not happening that much in this book.

Lets start with going through chapters (Chapters with * are the ones that really needs attention)

1- Methods and madness: This is the introduction to how to be a firefighter from both methodic and holistic perspective. Craig gives information  What you need during Oracle firefighting and how you can build a methodology . He covers the basics of analysing the system as a whole under the name of  OraPub 3 Circle Analysis.  When I first read about the OraPub 3 Circle Analysis I did not expect much but after a few chapters I admit that it works quite fine.  Most interesting topic to me was how to write the summary of your firefighting story.  That is something I need to work on.

2- Listening to Oracle’s Pain: This chapter tries to explain how to understand what Oracle is crying for by using Oracle Wait Interface and ORTA (oracle response time analysis). It covers how Oracle kernel developers instrumented their code for creating Oracle Wait Interface. It wasnt something I did not know but still nice to remember most them.

3- Serialization Control (*): This chapter is where this book starts to rock about explaining what is going on in Oracle by telling how Latches Mutexes and Locks work and differ from each other.  After finishing this chapter I believed that the book already worth the money I paid and there were 6 more chapters to go. Craig definitely knows how Oracle works and better than just knowing he knows how to explain it like latches for dummies.  It is really hard not to remember or not understand what he is explaining.

4- Idenfying and Understanding Operating System Contention: This chapter tries to cover how to use unix/linux tool to gather information from operating system to do the operating system circle of OraPub 3 Circle Analysis. Apart from using tool it is very good about the way Craig teaches about how to talk and convince Network, OS and Storage guys in case of a problem without making their nerves. I personally liked very much the word Craig uses like “there is a memory pressure” instead of “paging” :)

5- Oracle Performance Diagnosis: This chapter is going deep into how to gather overall performance related data from Oracle and how to interpret the data you gathered without going to compulsive tuning disorder. He covers the wait event myths section which was very interesting and he tries explain why profiling is not “always” the best approach. Making the most out of DBMS_MONITOR  tool is also covered well. Craig also gives internals about ASH (Active Session History). How it works why it is good and how to use it. Craig also gives the simple trick about how to learn Oracle internals which is reading Oracle patents also search oracle in (http://www.faqs.org/patents/asnl/op).

6- Oracle Buffer Cache Internals(*): Till I read this book, I never dreamed  about that, one day I will really fully understand and be able to explain to others, how  Oracle buffer cache, shared pool and latches works, despite all official documentation and the books I read so far. This and upcoming 2 chapters  gave me this opportunity and I want to thank Craig very much for explaining these topics this well. Diagrams and way he explains are totally amazing.  Chapter is not just explaining buffer cache it is also how to tune it.  Knowing the internals of Buffer Cache can be the only reason to have the book because it is nearly the main thing for optimal performance. I really like too much about latch and enqueue subtopics.

7- Oracle Shared Pool Internals(*): This is another reason that makes this book unique.  I already wrote my own compilation about shared pool in a blog post but this chapter taught me many more things like lathes and mutexes in shared pool , In memory undo   and how to tune them. This chapter is also a must to learn to do proper performance firefighting.

8- Oracle Redo Management Internals(*): In this chapter Craig gives how redo management works in Oracle and how to tune redo related issues which is again very important performance related topic.  He goes over  every redo related problem and gives possible solutions to them. Learning the dangerous commit_write parameter usage was  completely new to me.

9- Oracle Performance Analysis(*): In this chapter every topic so far is integrated with each other in  a possible problem scenario. Before going into the scenario Craig cover response service and queue time calculations and prepares us to scenario. In this chapter Craig starts an Overall system performance tuning in OraPub three circle analysis by analysing Oracle OS and Application together. He goes into the analysis three times and this gives reader the clues when and where  to or not to stop tuning.  This chapter totally depends on math and to be honest it is really easy to understands

Now it is time to talk about what I don’t like about the book. Craig introduces and explains too many internal parameters and he gives them as an option in case there is a problem. I loved to learn them because it is very helpful when you talk with Oracle support and also they are part of understanding Oracle but I prefer he mentioned more about not to use them before asking Oracle . He says not to use them couple of times but still not enough to me.   Second thing is that we don’t  have option to download the test case codes Craig mentions apart from OraPub monitoring kit.  Third there is no index at the end of the book.Although I don’t use indexes that much but still nice to have.  One last thing because it is only available from OraPub you might wait a bit if you are ordering outside of US like me, I wish there was an e-book option which I always find it easy but probably it is because of avoiding pricy.

After negative part lets come to the conclusion. This book is very informative very well written and in my humble opinion, a must on every DBA’s desk and I suggest it to every DBA. If you are experienced and read too many performance tuning books and feel comfortable with performance tuning but still struggling to understand latches mutexes buffer cache shared pool  or overall performance tuning and if you want to fully understand them Oracle Performance Firefighting is totally for you. If you are a new for Oracle Performance Tuning,  this book is a must to understand how Oracle works but please be careful about hidden/unsupported parameters.

My only regret is not to order 2 copies one for company and one for me. :) I hope you will like the book as much as I liked.

Many thank to Craig Shallahamer for bringing  deep knowledge and experience about Oracle Performance tuning and sharing in this very special book.

May 20, 2009

TOP by Christian Antognini

Filed under: Book Reviews, Diary, Performance — coskan @ 4:29 pm

I finally finished “reading” of Troubleshooting Oracle Performance by Christian Antognini.

The book is already reviewed as a blog post by many other bloggers like Carry Millsap, Jonathan Lewis (they  also wrote forewords for the book), Jason Arneil and  Tonguc Yilmaz.

Now it is my turn to say something about this invaluable book.

First of  all this review is based on what I read because I read the book mostly when I am on train and I did not try %90 of what is written on the book yet. After I go over the book again and try all the scripts,  I will write more posts about the book which is going to be, what I learned from TOP by Christian Antognini series.

The book is  very well structured and ordered,  you don’t loose yourself by references to upcoming topics that much.  It is not written by a native English speaker,  so you don’t struggle with language,  it is simple and easy to understand as much as it can be.  Everything he explained is supported by a proof. The best part of this book among other two performance related bibles (Cost Based Oracle Fundamentals and Optimizing Oracle Performance which are also very much recommended) you never get lost from the math. I graduated as an Computer Engineer but sometimes I really struggle to understand where all those numbers come from when I look at the other books.   the way Christian explains the Math in TOP is very understandableand there is no way you can complain about too much math in this book.

Apart from all these general readability issues, TOP gives you a method for designing a better application and a method to solve most of the common problems on Oracle Databases.  It covers all available supported Oracle versions, so you also find chance to see the differences on Oracle Troubleshooting and optimal design from 9i to 11g.

What TOP doesn’t give,  is magic. If you expect magical internal ways for troubleshooting Oracle, this book is not right for you. You need to look other definitive books :)

Christian covered nearly everything ,  The only problem is that ,he doest give step by step approach like Tanel does. It would be excellent, if Christian add one more chapter and give step by step approach in case of a problem. Although Last chapter does something smilar , I still think it is a bit small.

I have a dream that Christian and Tanel work on a book together :) It would be great to have a  book which covers nearly everything you can face while troubleshooting Oracle. Coincidentally their masterclasses at UKOUG were one after the other and during reading this book I attended Tanel’s class,and I found the chance to compare them twice, My conclusion is,  I think their work deserves to be combined. Just a feedback from audience :)

Last word about this book it is excellent piece of work. I recommend it to everyone who wants to learn Troubleshooting Oracle Performance and optimal Oracle design for better performance.

Many  Thanks goes to  Christian for sharing his work with community.

April 30, 2009

What I learned from AOT by Tanel Poder

Filed under: Basics, Blogroll, Diary, Performance — coskan @ 8:28 pm

Guys who personally knows me, knows also that I am a very big fan of Tanel Poder and his work. I have been following him, since he started blogging in 2007 and after reading all his stuff,  at UKOUG 2008 Conference I found chance to have his masterclass about Advanced Oracle Troubleshooting. When I saw how he uses the things he wrote on his blog, all my approach to troubleshooting oracle had changed, but something was still missing. Blog and masterclass session wasn’t enough for me, I wanted more and more everytime I read a post from him. God must have heard my wishes and sent him to Edinburgh for his Advance Oracle Troubleshooting seminar and gave me the chance to fill the missing part.

After nearly 8 Oracle trainings ( 3 of them were celebrity seminars), I can personally say, If you love Oracle, if you love performance tuning and troubleshooting, if you look for a method to troubleshoot, if you want to learn something systematic and also practical   this is the best one you can attend to find what you are looking for. What makes this training unique is not just his methodology and knowledge, it is also his instructing skills. I don’t really enjoy sessions of  the instructors who are strict about their agenda and who don’t allow questions before they finish the topic but Tanel was always open to questions and I think he is aware that off-topic questions can lead to cover an upcoming topic too, so he doesn’t say, “I will see it next chapters” (which might not be covered because of the time limit. ) he answers them all. I asked as many questions as I can and none of them were remained without an answer not just by talk also by proof. Maybe he couldn’t finish all the chapters totally, but he nearly covered all in the middle of other chapters, by allowing questions. To get more benefit from this training, I strongly recommend you to read all the posts on his blog and try them before you come. Believe me you will focus more and benefit as much as I did. One more confession is that this one was the only training I never lost concentration :)

After all this voluntarily marketing, let me tell you tiny bits about what I learned from this young (it was a big shock to hear his IT experience is nearly 18 years at my age I think thats why Jonathan Lewis said that he saw his youth on him :) ), %100 geek and genius evangelist.

*If the session state is WAITING it means waiting, otherwise no matter what is written on state column of v$session it is on the CPU as simple as this. (sw script)

*On ideal world, if the session is running, session stats (snapper script) will show you some increasing stats

* For short intervals of snapping with snapper scripts DBTime and DB CPU might show non realistic values ideal is 5 and over I think, not less

*Oracle does not immediately put a line to an event trace, it does it when the call finishes.

* on solaris kill -STOP SPID suspends the process without killing it and you can resume it again by PRUN. check manual page for you operation system.

*using wireshark , snoop (solaris) tcpdump (linux) to check network related issues

*Oracle might lie about a process but OS wont. What OS says on process stack is always the most reliable one.

*sed.sql is the simple but life and time saver script to check the P1 P2 P3 value definitions of a wait event.

*procmon is a life saver tool for dodgy errors with “unknown error” message on Microsoft OS.

* on a dump trace file you can ignore function calls at the top of stack they are related with dumping process

* If you see sqlid sqltext but no plan for the running sql. dont get shocked there can be an ongoing hard parse.

* vmstat on unix/solaris sr which is scan rate column needs to be 0 for healthy monitoring

*on linux swappiness is an important memory setting which isnt referenced in online documents or metalink officially.

* good practice if you can take the dump of the process if you need to kill it. for finding the problem

*there is an hang analyze analyzer on EM. Never took my attention.

* v$wait_chain is similar with hang analyze dump but available after 11G

* Finding latch related problems are not a big issue any more after latchprof and latchprofx. Wont tell anything here everything is on his blog.

* root block is the next block after the block header. useful info for latch buffer cache chain issues. (might be leaf block on very tiny tables)

* quest spotlight has session _spin_count autotuner in it . (see Tanels comment below. )

*To drink more by Working less , you should be practical and script the things you use most in addition to his scripts

*there is a tool called trcsess which can be used to bring tracefiles together.

* he also say it is pointless to check statspack for something still going on.
I personally used statspack/awr only 3 times and one was at a job interview. Some stupid guy before doing rman online backup to 7/24 system, tried to test me what is wrong with that statspack report. Was totally crap.
To be honest Purely or initially statspack oriented problem solving is something like gues from your .ss type problem solving. Thanks Tanel for this term :) And his perfsheet tool makes statspack more useful to check what has happened last week type problems on a graphical interface. Having your own ASH if you don’t have Diagnostic Pack is far more better than getting lost on system wide reports.

* If you cant sleep on the plane, sleep enough before the flight and fly during office hours and work during the flight. systematic approach from a man who flys nearly 2 times a week.

* if you are googling an about oracle internal functions put .ch or .ru  domains for some better results (yes you need google translate to understand the page)

*BEST ONE I asked him how do you learn all these interesting stuff. He gave me a perfect clue. Sign up for metalink headlines mailing list. Bugs docs they are all there after they are created. Perfect resource.

There are two things, I don’t agree with him, about his approach. First is , I want to see the sql first about a session I am troubleshooting. If I know the system this will give me a basic idea, it can also lead wrong way but still SQL is something worth to see before starting other steps.
Write one script run on all versions idea is another thing I am not very keen to follow. I prefer to have 3 different script home to benefit from new columns added to tables. The only thing I need to do is call the right home in a bat file before I connect.

These were just my tiny highlights, from a big training. If you need more just attend his class, you wont regret.

I heard that he will come to London for seminar possibly after UKOUG 2009. If I were you, I won’t miss it.

Thank you very much Tanel for sharing your knowledge with us. You are the rising sun over Oracle Performance World.

One special thanks is coming for Thomas C. Presslie from PISEC who was the organizator of this event. Venue, environment and all training package and his hospitality  was wonderfull. I wish I could come for the upcoming ones :(

March 23, 2009

Query Result Cache on Encrypted Tablespace on 11G

Filed under: Performance — coskan @ 10:45 am

Question: Will query result cache work on Encrypted tablespaces.

I asked this question to Tom Kyte after his All About Encryption presentation at UKOUG 2008. His answer was ” I think it will work but you need to try”

After 4 months, here is the answer to my question;

 
SQL> ---Encrypted tablespace creation
SQL> CREATE TABLESPACE encryptedtst
  2  DATAFILE 'C:\ORACLE\PRODUCT\ORADATA\ORACOS\encryptedtst_01.dbf' SIZE 128K
  3  AUTOEXTEND ON NEXT 64K
  4  ENCRYPTION USING 'AES256'
  5  DEFAULT STORAGE(ENCRYPT);

Tablespace created.

SQL> connect hr/hr;
Connected.

SQL> alter table employees enable row movement;

Table altered.

SQL> alter table employees
  2  move tablespace encryptedtst;

Table altered.

SQL> select manager_id,sum(salary) from employees group by manager_id;

MANAGER_ID SUM(SALARY)
---------- -----------
       100      155400
                 27000
       123       25900
       120       22100
	..	...

19 rows selected.

Elapsed: 00:00:00.04

Execution Plan
----------------------------------------------------------
Plan hash value: 1192169904

--------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |    18 |   144 |     4  (25)| 00:00:01 |
|   1 |  HASH GROUP BY     |           |    18 |   144 |     4  (25)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| EMPLOYEES |   107 |   856 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------

SQL> select /*+ result_cache*/manager_id,sum(salary) from employees group by manager_id;

MANAGER_ID SUM(SALARY)
---------- -----------
       100      155400
                 27000
       123       25900
       120       22100
	..	...

19 rows selected.

Elapsed: 00:00:00.06

Execution Plan
----------------------------------------------------------
Plan hash value: 1192169904

--------------------------------------------------------------------------------------------------
| Id  | Operation           | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |                            |    18 |   144 |     4  (25)| 00:00:01 |
|   1 |  RESULT CACHE       | 1u45jnmw74kd60brbww9gr7ss8 |       |       |            |          |
|   2 |   HASH GROUP BY     |                            |    18 |   144 |     4  (25)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| EMPLOYEES                  |   107 |   856 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------

Result Cache Information (identified by operation id):
------------------------------------------------------

   1 - column-count=2; dependencies=(HR.EMPLOYEES); name="select /*+ result_cache*/ manager_id,sum(salary) from employees group by manager_id"

SQL> select /*+ result_cache*/ manager_id,sum(salary) from employees group by manager_id;

MANAGER_ID SUM(SALARY)
---------- -----------
       100      155400
                 27000
       123       25900
       120       22100
	..	...

19 rows selected.

Elapsed: 00:00:00.04

Execution Plan
----------------------------------------------------------
Plan hash value: 1192169904

--------------------------------------------------------------------------------------------------
| Id  | Operation           | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |                            |    18 |   144 |     4  (25)| 00:00:01 |
|   1 |  RESULT CACHE       | 1u45jnmw74kd60brbww9gr7ss8 |       |       |            |          |
|   2 |   HASH GROUP BY     |                            |    18 |   144 |     4  (25)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| EMPLOYEES                  |   107 |   856 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------

Result Cache Information (identified by operation id):
------------------------------------------------------

   1 - column-count=2; dependencies=(HR.EMPLOYEES); name="select /*+ result_cache*/ manager_id,sum(salary) from employees group by manager_id"

SQL>


SQL>--Check if it is really cached
SQL> select name from v$result_cache_objects;

NAME
-----------------------------------------------------------------------------------------------
HR.EMPLOYEES
select /*+ result_cache*/ manager_id,sum(salary) from employees group by manager_id


As you see on the explain plan and v$result_cache_table, the result is cached

Conclusion is; Oracle did not forget to combine two new features together by the help of keeping objects of encrypted tablespaces, unencrypted in SGA for better performance

September 14, 2007

What I learned about shared pool management

Filed under: Performance — coskan @ 11:00 pm

Update on 07/06/11: Please also watch this video tutorial from Tanel Poder after you read this article

Nowadays, I am working on performance tuning guide of 10gR2 and I found it is worthy to write about shared pool management after reading the metalink note Note:396940.1 Troubleshooting and Diagnosing ORA-4031 Error.

Note: This blog entry is my summary of the references at the bottom and to be honest I wrote this long entry for myself for not to get lost between those references again :)

Shared pool is used to cache different types of data such as textual and executable forms of PL/SQL blocks and SQL statements, dictionary cache data, and other data. If you use shared pool effectively you can reduce resource consumption in at least four ways

  1. Parse overhead is avoided if the SQL statement is already in the shared pool. This saves CPU resources on the host and elapsed time for the end user.
  2. Latching resource usage is significantly reduced, which results in greater scalability.
  3. Shared pool memory requirements are reduced, because all applications use the same pool of SQL statements and dictionary resources.
  4. I/O resources are saved, because dictionary elements that are in the shared pool do not require disk access.

Main components of shared pool are library cache (executable forms of SQL cursors, PL/SQL programs, and Java classes.) and the dictionary cache (usernames, segment information, profile data, tablespace information, and sequence numbers. ).The library cache stores the executable (parsed or compiled) form of recently referenced SQL and PL/SQL code. The dictionary cache stores data referenced from the data dictionary. This caches are managed by LRU algorithm to “age out” memory structures that have not been reused over time. Allocation of memory from the shared pool is performed in chunks. This allows large objects (over 5k) to be loaded into the cache without requiring a single contiguous area, hence reducing the possibility of running out of enough contiguous memory due to fragmentation. Starting with 9i The Shared Pool divide its shared memory areas into subpools. Each subpool will have Free List Buckets (containing pointers to memory chunks within the subpool ) and , memory structure entries, and LRU list. This architecture is designed to to increase the throughput of shared pool in that now each subpool is protected by a Pool child latch. This means there is no longer contention in the Shared Pool for a single latch as in earlier versions.

Infrequently, Java, PL/SQL, or SQL cursors may make allocations out of the shared pool that are larger than 5k then Oracle must search for and free enough memory to satisfy this request. This operation could conceivably hold the latch resource for detectable periods of time, causing minor disruption to other concurrent attempts at memory allocation. To allow these allocations to occur most efficiently, Oracle segregates a small amount of the shared pool. This memory is used if the shared pool does not have enough space. The segregated area of the shared pool is called the reserved pool which is also divided into subpools. Smaller objects will not fragment the reserved list, helping to ensure the reserved list will have large contiguous chunks of memory. Once the memory allocated from the reserved list is freed, it returns to the reserved list.

By using automatic shared memory management (ASMM) option available with 10G, which is activated by setting SGA_TARGET parameter with a value greater than 0, shared pool size is started to be managed by Oracle, under the limits of SGA_TARGET and other SGA components.

After these explanations lets start to explain how to manage shared pool with ASMM.

1- Using Shared Pool Effectively

avoid hard parsing by

  • using bind variables instead of literal values in your queries The script below can be used to find sqls which use literals
    SELECT substr(sql_text,1,40) "SQL",
     count(*) ,
     sum(executions) "TotExecs"
     FROM v$sqlarea
     WHERE executions < 5
     GROUP BY substr(sql_text,1,40)
     HAVING count(*) > 30
     ORDER BY 2
     ;
    
  • Standardizing naming conventions for bind variables and spacing conventions for SQL statements and PL/SQL blocks.
  • Because they are stored, Consider using stored procedures whenever possible
  • Avoiding users from change the optimization approach and goal for their individual sessions.
  • Reducing the number of entries in the dictionary cache by explicitly qualifying the segment owner, rather than using public synonyms or try to connect to the database through a single user ID, rather than individual user IDs because Reducing the number of distinct userIDs also reduces the load on the dictionary cache.SELECT employee_id FROM hr.employees WHERE department_id = :dept_id;
  • Using PL/SQL packages when your system has thousands of users, each with individual user sign-on and public synonyms because a package is executed as the owner, rather than the caller, which reduces the dictionary cache load considerably.
  • Avoid performing DDLs in peak hours because ddl operations invalidate the dependent SQLs and cause hard parsing when the statement called again.
  • Cache the sequence numbers by using CACHE keyword of CREATE/ALTER SEQUENCE clause to reduce the frequency of dictionary cache locks,
  • Try to avoid closing of rapidly executed cursors
  • Check for hash values which maps different literals. The query below should return no rows otherwise there is possibility for a bug
     SELECT hash_value, count(*)
    FROM v$sqlarea
    GROUP BY hash_value
    HAVING count(*) > 5
    ;
    

3- Identify which sqls are using lots of shared memory

SELECT substr(sql_text,1,20) "Stmt", count(*),
 sum(sharable_mem) "Mem",
 sum(users_opening) "Open",
 sum(executions) "Exec"
 FROM v$sql
 GROUP BY substr(sql_text,1,20)
 HAVING sum(sharable_mem) > 1426063 -–%10 of Shared Pool Size
 ;

---Dynamic version 
 SELECT substr(sql_text,1,20) "Stmt", count(*),
  sum(sharable_mem)/1024/1024 "Mem",
  sum(users_opening) "Open",
  sum(executions) "Exec"
  FROM v$sql
  GROUP BY substr(sql_text,1,20)
  HAVING sum(sharable_mem) > (select current_size*0.1 from v$sga_dynamic_components where component='shared pool');--10 percent

Stmt COUNT(*) Mem Open Exec
/* OracleOEM */ SEL 18 1445971 2 54

This should show if there are similar literal statements, or multiple versions of a statements which account for a large portion of the memory in the shared pool.

4- Identify which allocations causing shared pool to be aged out

spool ageout.txt
SELECT *FROM x$ksmlru 
WHERE ksmlrnum>0;
spool off

This select returns no more than 10 rows and then erases the contents of the X$KSMLRU table so be sure to SPOOL the output. The X$KSMLRU table shows which memory allocations have caused the MOST memory chunks to be thrown out of the shared pool since it was last queried. This is sometimes useful to help identify sessions or statements which are continually causing space to be requested.

5- Why are there multiple child cursors.

V$SQL_SHARED_CURSOR explains why a particular child cursor is not shared with existing child cursors. Each column identifies a specific reason why the cursor cannot be shared.

SELECT SA.SQL_TEXT,SA.VERSION_COUNT,SS.*
FROM V$SQLAREA SA,V$SQL_SHARED_CURSOR SS
WHERE SA.ADDRESS=SS.ADDRESS
AND SA.VERSION_COUNT > 5
ORDER BY SA.VERSION_COUNT 

6- Monitor Shared Pool sizing operations

You can see the shrinking and growing operations from V$SGA_RESIZE_OPS dynamic view and you can guess why there is need for this operations by focusing the sql at the sizing operation times.

column initial_size format 999999999999999
column target_size format 999999999999999
column final_size format 999999999999999

select to_char(end_time, 'dd-Mon-yyyy hh24:mi') end, oper_type, initial_size,
 target_size, final_size from V$SGA_RESIZE_OPS
 where component='shared pool'
 order by end;
END OPER_TYPE INITIAL_SIZE TARGET_SIZE FINAL_SIZE
12-Sep-2007 19:05 STATIC 0 134217728 134217728
12-Sep-2007 22:01 SHRINK 134217728 130023424 130023424
13-Sep-2007 11:35 SHRINK 130023424 125829120 125829120
13-Sep-2007 11:36 SHRINK 125829120 121634816 121634816
13-Sep-2007 22:08 GROW 121634816 125829120 125829120
13-Sep-2007 22:09 GROW 125829120 130023424 130023424
13-Sep-2007 22:10 GROW 130023424 134217728 134217728
13-Sep-2007 22:12 GROW 134217728 138412032 138412032
14-Sep-2007 09:49 GROW 138412032 142606336 142606336
14-Sep-2007 16:13 GROW 142606336 146800640 146800640

7- Minimum Size of Shared Pool

Current size of the shared pool;

column bytes format 999999999999999
select bytes,bytes/1024/1024/1024 size_gb from v$sgainfo where name='Shared Pool Size';

You can use the sizing advices from the view v$shared_pool_advice. This view displays information about estimated parse time in the shared pool for different pool sizes and the sizes range from %10 to %200 of current shared pool size. This can give you idea for sizing SGA and obliquely shared pool by the help of ASMM.

select * from V$SHARED_POOL_ADVICE;

Suggested minimum shared pool size;

column cr_shared_pool_size format 99999999
column sum_obj_size format 99999999
column sum_sql_size format 99999999
column sum_user_size format 99999999
column min_shared_pool format 9999999


select cr_shared_pool_size,
sum_obj_size, sum_sql_size,
sum_user_size,
(sum_obj_size + sum_sql_size+sum_user_size)* 1.3 min_shared_pool
from (select sum(sharable_mem) sum_obj_size
from v$db_object_cache where type<> 'CURSOR'),
(select sum(sharable_mem) sum_sql_size from v$sqlarea),
(select sum(250*users_opening) sum_user_size from v$sqlarea),
(select to_Number(b.ksppstvl) cr_shared_pool_size
 from x$ksppi a, x$ksppcv b, x$ksppsv c
 where a.indx = b.indx and a.indx = c.indx
 and a.ksppinm ='__shared_pool_size' );

CR_SHARED_POOL_SIZE SUM_OBJ_SIZE SUM_SQL_SIZE SUM_USER_SIZE MIN_SHARED_POOL
146800640 9520659 25660770 11750 45751132,7

You should set the suggested minimum shared pool size to avoid shrinking operation of ASMM

SQL>alter system set shared_pool_size=73M;
System altered.

8- How much free memory in SGA is available for shared pool and how to interpret the free memory

First of all find the free memory in shared pool. If you have free memory you should relax but if you don’t have go to the step below

 SELECT * FROM V$SGASTAT
 WHERE NAME = 'free memory'
 AND POOL = 'shared pool';

The X$KSMSP view shows the breakdown of memory in the SGA. You can run this query to build trend information on memory usage in the SGA. Remember, the ‘free’ class in this query is not specific to the Shared Pool, but is across the SGA. Dont use the script below when db is under load. Check out Jonathan Lewis’s experiences on this view from here

SELECT KSMCHCLS CLASS, COUNT(KSMCHCLS) NUM, SUM(KSMCHSIZ) SIZ,
To_char( ((SUM(KSMCHSIZ)/COUNT(KSMCHCLS)/1024)),'999,999.00')||'k' "AVG SIZE"
FROM X$KSMSP GROUP BY KSMCHCLS;


CLASS    NUM        SIZ        AVG SIZE 
-------- ---------- ---------- ------------ 
R-free           12    8059200      655.86k <<= Reserved List 
R-freea          24        960         .04k <<= Reserved List 
free            331  151736448      447.67k <<= Free Memory 
freeabl        4768    7514504        1.54k <<= Memory for user/system 
                                                    processing 
perm              2   30765848   15,022.39k <<= Memory allocated to the
                                                  system 
recr           3577    3248864         .89k <<= Memory for user/system
                                                  processing 
Watch for trends using these guidelines:

Watch for trends using these guidelines:

a) if ‘free’ memory is low (less than 5mb or so) you may need to increase the shared_pool_size and shared_pool_reserved_size. You should expect ‘free’ memory to increase and decrease over time. Seeing trends where ‘free’ memory decreases consistently is not necessarily a problem, but seeing consistent spikes up and down could be a problem.
b) if ‘freeable’ or ‘perm’ memory continually grows then it is possible you are seeing a memory bug.
c) if ‘freeabl’ and ‘recr’ memory classes are always huge, this indicates that you have a lot of cursor info stored that is not releasing.
d) if ‘free’ memory is huge but you are still getting 4031 errors, the problem is likely reloads and invalids in the library cache causing fragmentation.
!!!!!!!!!! Note says that this query can hang database on HP platforms

To see the free memory chunks detailed use the script below

select KSMCHIDX "SubPool", 'sga heap('||KSMCHIDX||',0)'sga_heap,ksmchcom ChunkComment,
decode(round(ksmchsiz/1000),0,'0-1K', 1,'1-2K', 2,'2-3K',3,'3-4K',
4,'4-5K',5,'5-6k',6,'6-7k',7,'7-8k',8,
'8-9k', 9,'9-10k','> 10K') "size",
count(*),ksmchcls Status, sum(ksmchsiz) Bytes
from x$ksmsp
where KSMCHCOM = 'free memory'
group by ksmchidx, ksmchcls,
'sga heap('||KSMCHIDX||',0)',ksmchcom, ksmchcls,decode(round(ksmchsiz/1000),0,'0-1K',
 1,'1-2K', 2,'2-3K', 3,'3-4K',4,'4-5K',5,'5-6k',6,
 '6-7k',7,'7-8k',8,'8-9k', 9,'9-10k','> 10K')
 order by 1;
SubPool SGA_HEAP CHUNKCOMMENT size COUNT(*) STATUS BYTES
1 sga heap(1,0) free memory > 10K 34 R-free 7238192
1 sga heap(1,0) free memory 3-4K 2 free 6284
1 sga heap(1,0) free memory > 10K 241 free 35707400
1 sga heap(1,0) free memory 8-9k 1 free 7712
1 sga heap(1,0) free memory 2-3K 4 free 6752
1 sga heap(1,0) free memory 0-1K 2090 free 133288
1 sga heap(1,0) free memory 9-10k 21 free 188676
1 sga heap(1,0) free memory 1-2K 30 free 25868

If you see lack of large chunks it is possible that you can face with ORA-04031 in near future.
9- Is library_cache or dictionary_cache utilization satisfactory ?

The statistics below is based since the start of the instance. You should take interval statistics to interpret these values for performance issues .

  • Library Cache Stats
SELECT NAMESPACE, PINS, PINHITS, RELOADS, INVALIDATIONS
FROM V$LIBRARYCACHE
ORDER BY NAMESPACE;

SQL> SELECT SUM(PINHITS)/SUM(PINS) FROM V$LIBRARYCACHE;

SUM(PINHITS)/SUM(PINS)
----------------------
            .964764429
NAMESPACE PINS PINHITS RELOADS INVALIDATIONS
BODY 72782 72582 49 0
CLUSTER 1175 1161 3 0
INDEX 2800 2023 42 0
JAVA DATA 0 0 0 0
JAVA RESOURCE 0 0 0 0
JAVA SOURCE 0 0 0 0
OBJECT 0 0 0 0
PIPE 0 0 0 0
SQL AREA 563349 541678 2069 342
TABLE/PROCEDURE 175850 165318 2005 0
TRIGGER 6923 6802 34 0

High invalidations indicates that there is parsing problem with the namespace and high reloads indicates that there is a sizing problem which causes aging out.

  • Library cache hit ratio;
SQL>SELECT SUM(PINHITS)/SUM(PINS) FROM V$LIBRARYCACHE;
SUM(PINHITS)/SUM(PINS)----------------------
95558088

low hit ratio is an indication of a sizing or caching problem

  • Dictionary cache stats
 SELECT PARAMETER
, SUM(GETS)
, SUM(GETMISSES)
, 100*SUM(GETS – GETMISSES) / SUM(GETS) PCT_SUCC_GETS
, SUM(MODIFICATIONS) UPDATES
FROM V$ROWCACHE
WHERE GETS > 0
GROUP BY PARAMETER;
PARAMETER SUM(GETS) SUM(GETMISSES) PCT_SUCC_GETS UPDATES
dc_constraints 99 35 64,6464646 99
dc_tablespaces 90104 14 99,9844624 0
dc_tablespace_quotas 13 3 76,9230769 0
dc_awr_control 1351 2 99,8519615 121
dc_object_grants 867 174 79,9307958 0
dc_histogram_data 52053 6181 88,1255643 3047
dc_rollback_segments 55098 92 99,8330248 263
dc_sequences 100 27 73 100
dc_usernames 6632 33 99,5024125 0
dc_segments 23404 2466 89,4633396 331
dc_objects 37434 3776 89,9129134 358
dc_histogram_defs 65987 16796 74,5465016 3280
dc_table_scns 8 8 0 0
dc_users 171638 105 99,9388247 0
outstanding_alerts 1674 58 96,5352449 66
dc_files 80 10 87,5 0
dc_object_ids 134005 2646 98,0254468 123
dc_global_oids 52337 185 99,6465216 0
dc_profiles 1962 4 99,7961264 0

High updates with low pct_succ_gets can be a clue of performance problems when accessing that dictionary object. For frequently accessed dictionary caches, the ratio of total GETMISSES to total GETS should be less than 10% or 15%, depending on the application. If this ratio is higher and every previous control is OK then you should consider to increase the shared pool size

  • Dictionary cache hit ratio;
SQL>SELECT (SUM(GETS – GETMISSES – FIXED)) / SUM(GETS) "ROW CACHE" FROM V$ROWCACHE;

ROW CACHE

--------------------,

9516921886454345524

Low hit ratio is an indication of a sizing problem.

10- Are there any objects candidate for library cache pinning ?

Having objects pinned will reduce fragmentation and changes of encountering the ORA-04031 error. Objects causing a large number of other objects been flushed out from the shared pool are candidates to be pinned into the shared pool using dbms_shared_pool.keep procedure. You can check the x$ksmlru fixed table to see the candidates. This table keeps track of the objects and the corresponding number of objects flushed out of the shared pool to allocate space for the load. These objects are stored and flushed out based on the Least Recently Used (LRU) algorithm. Because this is a fixed table, once you query the table, Oracle will automatically reset the table so first insert the contents to temporary table like below,

CREATE TABLE LRU_TMP AS SELECT * FROM X$KSMLRU;

INSERT INTO LRU_TMP SELECT * FROM X$KSMLRU;

Use the LRU_TMP table for analysis. You can use a query below to see more information on candidate code in the library cache.

SELECT USERNAME, KSMLRCOM, KSMLRHON, KSMLRNUM, KSMLRSIZ, SQL_TEXT
FROM V$SQLAREA A, LRU_TMP K, V$SESSION S
WHERE KSMLRSIZ > 3000
AND A.ADDRESS=S.SQL_ADDRESS AND A.HASH_VALUE = S.SQL_HASH_VALUE
AND SADDR=KSMLRSES;

You can see the candidates to pin from the query below

COL STORED_OBJECT FORMAT A40;
COL SQ_EXECUTIONS FORMAT 999,999;
SELECT /*+ ORDERED USE_HASH(D) USE_HASH(C) */
O.KGLNAOWN||'.'||O.KGLNAOBJ STORED_OBJECT,
SUM(C.KGLHDEXC) SQL_EXECUTIONS
FROM SYS.X$KGLOB O, SYS.X$KGLRD D, SYS.X$KGLCURSOR C
WHERE
O.INST_ID = USERENV('INSTANCE') AND
D.INST_ID = USERENV('INSTANCE') AND
C.INST_ID = USERENV('INSTANCE') AND
O.KGLOBTYP IN (7, 8, 9, 11, 12) AND
D.KGLHDCDR = O.KGLHDADR AND
C.KGLHDPAR = D.KGLRDHDL
GROUP BY O.KGLNAOWN, O.KGLNAOBJ
HAVING SUM(C.KGLHDEXC) > 0
ORDER BY 2 DESC;

You should pin objects you find immediatelly after the each restart of instance. You can pin the object by DBMS_SHARED_POOL package like below

EXECUTE DBMS_SHARED_POOL.KEEP(OWNER.TRIGGER, ‘R’)

11- Is my Reserved Area sized properly?

An ORA-04031 error referencing a large failed requests indicates the Reserved Area is too fragmented.

col free_space for 999,999,999,999 head "TOTAL FREE"
col avg_free_size for 999,999,999,999 head "AVERAGE|CHUNK SIZE"
col free_count for 999,999,999,999 head "COUNT"
col request_misses for 999,999,999,999 head "REQUEST|MISSES"
col request_failures for 999,999,999,999 head "REQUEST|FAILURES"
col max_free_size for 999,999,999,999 head "LARGEST CHUNK"

select free_space, avg_free_size, free_count, max_free_size, request_misses, request_failures
from v$shared_pool_reserved;
TOTAL FREE AVERAGE
CHUNK SIZE
COUNT LARGEST CHUNK REQUEST
MISSES
REQUEST
FAILURES
7,238,192 212,888 34 212,888 0 0

The reserved pool is small when:

REQUEST_FAILURES > 0 (and increasing)

The DBA should Increase shared_pool_reserved_size and shared_pool_size together.

It is possible that too much memory has been allocated to the reserved list.
If:

REQUEST_MISS = 0 or not increasing
FREE_MEMORY = > 50% of shared_pool_reserved_size minimum

The DBA should Decrease shared_pool_reserved_size

You should also use hidden and unsupported parameter “_shared_pool_reserved_pct” to control reserved pool. This parameter controls the allocated percentage of shared pool for reserved pool. By default it is %5 of the shared pool and if you use ASMM for memory management you can set this value higher like 10 to allocate reserved pool dynamically. When you set the parameter you will see the shared_pool_reserved_size parameter will be adjusted to the new setting.

The parameter can not be modified when instance is started. You can use the query below to see the current value

select a.ksppinm "Parameter",
 b.ksppstvl "Session Value",
 c.ksppstvl "Instance Value"
 from sys.x$ksppi a, sys.x$ksppcv b, sys.x$ksppsv c
 where a.indx = b.indx and a.indx = c.indx
 and a.ksppinm = '_shared_pool_reserved_pct';
Parameter Session Value Instance Value
_shared_pool_reserved_pct 10 10

12-Is there any fragmentation in shared pool?

The primary problem that occurs is that free memory in the shared pool becomes fragmented into small pieces over time. Any attempt to allocate a large piece of memory in the shared pool will cause large amount of objects in the library cache to be flushed out and may result in an ORA-04031 out of shared memory error. But how to understand the fragmentation ?

  • Occurrence of ORA-04031 error. Before this error signalled, memory is freed from unnecessary objects and merged. This error only occurs when there is still not a large enough contiguous piece of free memory after this cleaning process. There may be very large amounts of total free memory in the shared pool, but just not enough contiguous memory.
  • Using X$KSMLRU internal fixed table. We told about this view before about its usage for tracking age out operations, it also can be used to identify what is causing the large allocations. KSMLRSIZ column of this table shows the amount of contiguous memory being allocated. Values over around 5K start to be a problem, values over 10K are a serious problem, and values over 20K are very serious problems. Anything less then 5K should not be a problem. Again be careful to save spool the result when you query this table
    spool ageout.txt
    select * from x$ksmlru where ksmlrsiz > 5000;
    spool off
    

After finding the result you should do the followings to correct fragmentation

  • Keep object by pinning them as we discussed above
  • Use bind variables as we discussed before
  • Eliminate large anonymous PL/SQL block. Large anonymous PL/SQL blocks should be turned into small anonymous PL/SQL blocks that call packaged functions. The packages should be ‘kept’ in memory. To view candidates

    select sql_text from v$sqlarea
    where command_type=47 — command type for anonymous block
    and length(sql_text) > 500;

Fallacies about solving shared pool fragmentation

  • Free memory in shared pool prevents fragmentation. This is not true because Free memory is more properly thought of as ‘wasted memory’. You would rather see this value be low than very high. In fact, a high value of free memory is sometimes a symptom that a lot of objects have been aged out of the shared pool and therefore the system is experiencing fragmentation problems.
  • Flushing shared pool frequently solves fragmentation and improves performance. This is also incorrect because Executing this statement causes a big spike in performance and does nothing to improve fragmentation. You lost your cached cursors when you flush and they will hard parsed next time with high CPU consumption.

13- Using related database parameters

  • CURSOR_SHARING: Setting this parameter to smilar can solve your hard parse problems caused by using literals but can have side effects mostly on DSS environments and systems which uses stored outlines.
  • CURSOR_SPACE_FOR_TIME: This parameter specifies whether a cursor can be deallocated from the library cache to make room for a new SQL statement. CURSOR_SPACE_FOR_TIME has the following values meanings:
    • If CURSOR_SPACE_FOR_TIME is set to false (the default), then a cursor can be deallocated from the library cache regardless of whether application cursors associated with its SQL statement are open. In this case, Oracle must verify that the cursor containing the SQL statement is in the library cache.
    • If CURSOR_SPACE_FOR_TIME is set to true, then a cursor can be deallocated only when all application cursors associated with its statement are closed. In this case, Oracle need not verify that a cursor is in the cache, because it cannot be deallocated while an application cursor associated with it is open.

You must be sure that the shared pool is large enough for the work load otherwise performance will be badly affected and ORA-4031 eventually signalled.

  • OPEN_CURSORS: This parameter sets the upper bound for the number of cursor that a session can have open and if you size it correctly, cached cursors can be stay opened and won’t have to be closed to let new cursor open
  • PROCESSES / SESSIONS: You can review the high water mark for Sessions and Processes in the V$RESOURCE_LIMIT view. If the hard-coded values for these parameters are much higher than the high water mark information, consider decreasing the parameter settings to free up some memory in the Shared Pool for other uses.
  • SESSION_CACHED_CURSORS: When a cursor is closed, Oracle divorces all association between the session and the library cache state. If no other session has the same cursor opened, the library cache object and its heaps are unpinned and available for an LRU operation. The parameter SESSION_CACHED_CURSORS controls the number of cursors “soft” closed, much like the cached PL/SQL cursors. Oracle checks the library cache to determine whether more than three parse requests have been issued on a given statement. If so, then Oracle assumes that the session cursor associated with the statement should be cached and moves the cursor into the session cursor cache. Subsequent requests to parse that SQL statement by the same session then find the cursor in the session cursor cache.To determine whether the session cursor cache is sufficiently large for your instance, you can examine the session statistic session cursor cache hits in the V$SYSSTAT view. This statistic counts the number of times a parse call found a cursor in the session cursor cache. If this statistic is a relatively low percentage of the total parse call count for the session, then consider setting SESSION_CACHED_CURSORS to a larger value. Steve Adams also wrote usefully queries to find the usage and the maximum cacheable cursors.session_cursor_cache.sql

This was a long article and if you see anything wrong or suspicious please feel free to comment for correction

Update 07/06/11 : Please also check out sgastatx.sql by Tanel poder

All of the queries are tested on Oracle 10.2.0.3 for Windows

Code Depot of The Queries (All scripts are taken from metalink notes and official documentation)
References :

Oracle® Database Performance Tuning Guide 10g Release 2 (10.2)

Metalink Notes

Note:396940.1 Troubleshooting and Diagnosing ORA-4031 Error

Note:146599.1 Diagnosing and Resolving Error ORA-04031

Note:61623.1 Resolving Shared Pool Fragmentation In Oracle7

Note:62143.1 Understanding and Tuning the Shared Pool

Note:1012047.6 How To Pin Objects in Your Shared Pool

Note:274496.1 ora-7445 and ora-4031 in 9.2.0.5 and 10g if SESSION_CACHED_CURSORS is used

http://www.ixora.com Oracle Advanced Performance Tuning Scripts

Mailing list threads from Oracle-l

http://www.freelists.org/archives/oracle-l/08-2007/msg00975.html

Update 09/12/2011
Some extra useful metalink note

Interpreting the automatically generated ORA-4031 diagnostic trace. [ID 809560.1]

« Newer Posts

Theme: Silver is the New Black. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 205 other followers