Coskan’s Approach to Oracle

January 29, 2010

Working with statspack-part-1b-Solution/Answer/another Question

Filed under: Diary, Performance — coskan @ 6:58 pm

<—-Part 1a

In part 1a I was trying to find a problem to troubleshoot and I spotted 2 possible problems and drilled down into one of them and stuck. Below is the list of what I did till I stuck and started to spend time on doing more research than it actually needs because I forgot the most important and easy step which I will explain in a moment.

What I did so far,

  • found the possible bottleneck on graph
  • cross-checked the bottleneck with DB time
  • checked the statspack and found “row cache lock” “log file switch (checkpoint incomplete)”, “buffer busy” waits as queue time
  • checked if Dictionary Cache statistics gives clue – No luck -at least for me
  • checked if DB did more redo than redo size can handle- redo activity was way lower than actual office hours
  • checked the segments for buffer busy waits and found only 1 segment caused all the wait but buffer busy waits and row cache lock are not related are they ?

What else I have in my hand ? I have a skipped snapshot slot. One of the statspack snaps could not be gathered.

What should I have done, before going and reading and learning more (which wasn’t a waste of time for learning but will be a waste of time in pressure situation) about those top events ?

Answer was I should have checked the alert log especially when my information is minimum with this historic problems. I was on call that week and I “assumed” that nothing was wrong with DB because I did not get called. Assumptions without proofs can be big problem and big time waster for your troubleshooting practices. After 4 hours of searching on the net for causes of row cache lock (which is not a common event), I decided to check alert log because of that skipped snapshot. What I found ?

Wed Jan 20 21:47:09 2010
Errors in file j:\data\oracle\trace\prod01\bdump\prod01_arc1_7976.trc:
ora-19816: Message 19816 not found; No message file for product=RDBMS, facility=ORA; arguments: [db_recovery_file_dest]
ora-27044: Message 27044 not found; No message file for product=RDBMS, facility=ORA
OSD-04008: WriteFile() failure, unable to write to file
O/S-Error: (OS 112) There is not enough space on the disk.
....
....
....
....
Wed Jan 20 22:28:48 2010
kcrrdmx: Successful archiving of previously failed ORL
Archiver process freed from errors. No longer stopped
Wed Jan 20 22:28:49 2010

As you see, I found that the database was hanged because the disk was full . Why was it full ? Looks like a mistake maybe not. It might be because we do exports on the same disk with db_file_recovery_area and probably One of my team mates took extra export during the day and forgot to delete it. With the redo amount generated that day it is unlikely that we can fill the logical limit and we did not get any %85 warning on that day on the alert log. Why did not I get called, because the issue lasted 45 minutes and probably technical guys did not check Big Brother monitoring that time:) I have no clue how it solved

I asked myself what is it to do with “row cache lock” why not “log file switch (archiving needed)”

Definition of the expected wait event fits with the alert log so it should be that event or should it ?

log file switch (archiving needed)

Waiting for a log switch because the log that the LGWR will be switching into has not been archived yet. Check the alert file to make sure that archiving has not stopped due to a failed archive write. To speed archiving, consider adding more archive processes or putting the archive files on striped disks.

Wait Time: 1 second

Parameters: None

Let’s check how it was;

SQL> @sp_evtrends
Please enter the ORACLE_SID value: prod01
Please enter the report begin date: sysdate-11
Please enter the report end date:sysdate-8

Some useful database statistics to search upon:
------------------------------------------------------------
        ARCH random i/o
        ARCH sequential i/o
	....
	.....
	.....
What statistic do you want to analyze? log file switch (archiving needed)

...
...
...
Daily trends for "log file switch (archiving needed)"...

Day    Hour   Statistic Name                         Secs Waited Percentage of total over all hours for each day
------ ------ ------------------------------ ------------------- ------------------------------------------------
17:00  log file switch (archiving needed)                0.00
18:00  log file switch (archiving needed)                0.00
19:00  log file switch (archiving needed)                0.00
20:00  log file switch (archiving needed)                0.00
21:00  log file switch (archiving needed)                0.00
22:00  log file switch (archiving needed)                0.00
23:00  log file switch (archiving needed)                0.00

How come ? Alert log says I was getting “log file switch (archiving needed)” but why my system event doesn’t show it? Lets check cumulative wait on this event since server started (12/30/2009)

SQL> select * from v$system_event where event like 'log file switch (archiving needed)';

EVENT                                                            TOTAL_WAITS TOTAL_TIMEOUTS TIME_WAITED AVERAGE_WAIT TIME_WAITED_MICRO   EVENT_ID WAIT_CLASS_ID WAIT_CLASS# WAIT_CLASS
---------------------------------------------------------------- ----------- -------------- ----------- ------------ ----------------- ---------- ------------- ----------- ---------------
log file switch (archiving needed)                                    132442          80465     8062918        60.88        8.0629E+10  681532028    3290255840           2 Configuration

It says my DB waited on this event before but when ?

SQL> @sp_event_delta2 "log file switch (archiving needed)"

   SNAP_ID SNAP_TIME         EVENT                                                            TOTAL_WAITS TOTAL_TIMEOUTS TIME_WAITED_MICRO
---------- ----------------- ---------------------------------------------------------------- ----------- -------------- -----------------
      7517 20100108 06:07:52 log file switch (archiving needed)                                        27              5           7022081
      7766 20100113 06:18:58 log file switch (archiving needed)                                         4              0            194101
      7831 20100114 13:49:06 log file switch (archiving needed)                                        16              0           5062268
      7892 20100115 21:05:16 log file switch (archiving needed)                                    132395          80460        8.0617E+10

I have 4 more day and time slots for this event and looks like on 15/01/2010 I had a similar issue. Lets check further for that day.

SQL> @sp_systime2
Please enter the ORACLE_SID value: prod01
Please enter the report begin date: sysdate-15
Please enter the report end date:sysdate-13

                                                                        Total             Cum
             Service                                                  Seconds    % of    % of
Day          or Wait  Name                                              Spent   Total   Total
------------ -------- ----------------------------------- ------------------- ------- -------

14-JAN       Service  SQL execution                                 32,871.29   39.24   39.24
             Wait     db file sequential read                       14,875.81   17.76   56.99
             Service  Recursive SQL execution                       13,203.42   15.76   72.75
             Wait     log file parallel write                        3,381.76    4.04   76.79
             Wait     db file scattered read                         3,069.28    3.66   80.45
             Wait     log file sync                                  2,651.77    3.17   83.62
             Wait     db file parallel write                         2,330.67    2.78   86.40
             Wait     Backup: sbtwrite2                              2,323.22    2.77   89.17
             Wait     enq: TX - row lock contention                  1,515.07    1.81   90.98
             Wait     Log archive I/O                                1,387.98    1.66   92.64
             Service  Parsing SQL                                    1,138.64    1.36   94.00
             Wait     log file sequential read                         814.92    0.97   94.97

15-JAN       Wait     log file switch (archiving needed)            80,616.90   31.14   31.14
             Wait     buffer busy waits                             41,401.25   15.99   47.13
             Wait     row cache lock                                39,873.16   15.40   62.53
             Service  SQL execution                                 37,210.97   14.37   76.90
             Wait     db file sequential read                       15,267.63    5.90   82.79
             Service  Recursive SQL execution                       13,166.82    5.09   87.88
             Wait     db file scattered read                         5,490.75    2.12   90.00
             Wait     log file parallel write                        5,157.94    1.99   91.99
             Wait     log file sync                                  3,793.75    1.47   93.46
             Wait     db file parallel write                         3,447.89    1.33   94.79
             Wait     enq: TX - row lock contention                  3,175.70    1.23   96.02
             Wait     Log archive I/O                                2,231.86    0.86   96.88

....
....
...
                                                                              Total             Cum
                   Service                                                  Seconds    % of    % of
Day          Hour  or Wait  Name                                              Spent   Total   Total
------------ ----- -------- ----------------------------------- ------------------- ------- -------
15-JAN      19:00 Service  SQL execution                                  2,193.28   34.66   34.66
                   Wait     db file sequential read                        1,423.59   22.50   57.16
                   Wait     log file parallel write                          543.65    8.59   65.75
                   Wait     db file parallel write                           445.41    7.04   72.79
                   Wait     log file sequential read                         344.37    5.44   78.24
                   Wait     db file scattered read                           304.99    4.82   83.06
         *** 21:00 Wait     log file switch (archiving needed)            80,616.90   48.82   48.82
                   Wait     buffer busy waits                             41,336.81   25.03   73.86
                   Wait     row cache lock                                39,873.16   24.15   98.01
                   Service  SQL execution                                  1,637.31    0.99   99.00
                   Wait     db file sequential read                          359.08    0.22   99.22
                   Service  Recursive SQL execution                          325.17    0.20   99.41
             22:00 Service  SQL execution                                  3,763.74   53.76   53.76
                   Wait     db file sequential read                        1,475.86   21.08   74.84
                   Wait     db file scattered read                           947.84   13.54   88.38
                   Wait     db file parallel write                           313.39    4.48   92.86
                   Service  Recursive SQL execution                          273.72    3.91   96.77

Between 20-21 (attention!! there is no 20:00 data which means problem is not between 21-22 but 20-21)
log file switch (archiving needed) is 48 percent of our response time. Which is very expectable if there was an archiver hang
problem on that day.

Lets cross-check if there was a snapshot at 20

SQL> @sp_stats_snap sysdate-14 sysdate-13

SNAP_ID SNAP_TIME
---------- -----------------
...
...
      7890 20100115 19:19:18
      7891 20100115 19:49:18
      7892 20100115 21:05:16  ----where is 20:00
      7893 20100115 21:35:18
      7894 20100115 22:05:18
      7895 20100115 22:35:18
      7896 20100115 23:05:18
      7897 20100115 23:35:18

The reason our report shows the waits between 21-22 because query checks the time when the snapshot gathered so it assumes those happened at 21,
Tim wrote the queries for hourly interval so it sometimes mislead you in this kind of situations be carefull and pay more attention.

Alert log also says the problem started around 20:05 and ended at 21.

Fri Jan 15 20:00:45 2010
Errors in file j:\data\oracle\trace\prod01\bdump\prod01_arc1_7976.trc:
ORA-19816: Message 19816 not found; No message file for product=RDBMS, facility=ORA; arguments: [db_recovery_file_dest]
ORA-27044: Message 27044 not found; No message file for product=RDBMS, facility=ORA
OSD-04008: WriteFile() failure, unable to write to file
O/S-Error: (OS 112) There is not enough space on the disk.
...
...
...
...
kcrrdmx: Successful archiving of previously failed ORL
Archiver process freed from errors. No longer stopped
Fri Jan 15 21:05:15 2010

So for the exactly same issue (maybe not ) our DB did not cry for “log file switch (archiving needed)” why ?

I am thinking about possible answers but I need to test meanwhile maybe you want to share your ideas
before I wrote the 3th part for my answer?

Today I kept my promise about giving the reason but I have to close an answering post with another question
which is the best part of troubleshooting, it never ends.

Moral of the story, “Check the alert log”

Advertisement

Create a free website or blog at WordPress.com.