In part 1a I was trying to find a problem to troubleshoot and I spotted 2 possible problems and drilled down into one of them and stuck. Below is the list of what I did till I stuck and started to spend time on doing more research than it actually needs because I forgot the most important and easy step which I will explain in a moment.
What I did so far,
- found the possible bottleneck on graph
- cross-checked the bottleneck with DB time
- checked the statspack and found “row cache lock” “log file switch (checkpoint incomplete)”, “buffer busy” waits as queue time
- checked if Dictionary Cache statistics gives clue – No luck -at least for me
- checked if DB did more redo than redo size can handle- redo activity was way lower than actual office hours
- checked the segments for buffer busy waits and found only 1 segment caused all the wait but buffer busy waits and row cache lock are not related are they ?
What else I have in my hand ? I have a skipped snapshot slot. One of the statspack snaps could not be gathered.
What should I have done, before going and reading and learning more (which wasn’t a waste of time for learning but will be a waste of time in pressure situation) about those top events ?
Answer was I should have checked the alert log especially when my information is minimum with this historic problems. I was on call that week and I “assumed” that nothing was wrong with DB because I did not get called. Assumptions without proofs can be big problem and big time waster for your troubleshooting practices. After 4 hours of searching on the net for causes of row cache lock (which is not a common event), I decided to check alert log because of that skipped snapshot. What I found ?
Wed Jan 20 21:47:09 2010 Errors in file j:\data\oracle\trace\prod01\bdump\prod01_arc1_7976.trc: ora-19816: Message 19816 not found; No message file for product=RDBMS, facility=ORA; arguments: [db_recovery_file_dest] ora-27044: Message 27044 not found; No message file for product=RDBMS, facility=ORA OSD-04008: WriteFile() failure, unable to write to file O/S-Error: (OS 112) There is not enough space on the disk. .... .... .... .... Wed Jan 20 22:28:48 2010 kcrrdmx: Successful archiving of previously failed ORL Archiver process freed from errors. No longer stopped Wed Jan 20 22:28:49 2010
As you see, I found that the database was hanged because the disk was full . Why was it full ? Looks like a mistake maybe not. It might be because we do exports on the same disk with db_file_recovery_area and probably One of my team mates took extra export during the day and forgot to delete it. With the redo amount generated that day it is unlikely that we can fill the logical limit and we did not get any %85 warning on that day on the alert log. Why did not I get called, because the issue lasted 45 minutes and probably technical guys did not check Big Brother monitoring that time:) I have no clue how it solved
I asked myself what is it to do with “row cache lock” why not “log file switch (archiving needed)”
Definition of the expected wait event fits with the alert log so it should be that event or should it ?
log file switch (archiving needed) Waiting for a log switch because the log that the LGWR will be switching into has not been archived yet. Check the alert file to make sure that archiving has not stopped due to a failed archive write. To speed archiving, consider adding more archive processes or putting the archive files on striped disks. Wait Time: 1 second Parameters: None
Let’s check how it was;
SQL> @sp_evtrends Please enter the ORACLE_SID value: prod01 Please enter the report begin date: sysdate-11 Please enter the report end date:sysdate-8 Some useful database statistics to search upon: ------------------------------------------------------------ ARCH random i/o ARCH sequential i/o .... ..... ..... What statistic do you want to analyze? log file switch (archiving needed) ... ... ... Daily trends for "log file switch (archiving needed)"... Day Hour Statistic Name Secs Waited Percentage of total over all hours for each day ------ ------ ------------------------------ ------------------- ------------------------------------------------ 17:00 log file switch (archiving needed) 0.00 18:00 log file switch (archiving needed) 0.00 19:00 log file switch (archiving needed) 0.00 20:00 log file switch (archiving needed) 0.00 21:00 log file switch (archiving needed) 0.00 22:00 log file switch (archiving needed) 0.00 23:00 log file switch (archiving needed) 0.00
How come ? Alert log says I was getting “log file switch (archiving needed)” but why my system event doesn’t show it? Lets check cumulative wait on this event since server started (12/30/2009)
SQL> select * from v$system_event where event like 'log file switch (archiving needed)'; EVENT TOTAL_WAITS TOTAL_TIMEOUTS TIME_WAITED AVERAGE_WAIT TIME_WAITED_MICRO EVENT_ID WAIT_CLASS_ID WAIT_CLASS# WAIT_CLASS ---------------------------------------------------------------- ----------- -------------- ----------- ------------ ----------------- ---------- ------------- ----------- --------------- log file switch (archiving needed) 132442 80465 8062918 60.88 8.0629E+10 681532028 3290255840 2 Configuration
It says my DB waited on this event before but when ?
SQL> @sp_event_delta2 "log file switch (archiving needed)" SNAP_ID SNAP_TIME EVENT TOTAL_WAITS TOTAL_TIMEOUTS TIME_WAITED_MICRO ---------- ----------------- ---------------------------------------------------------------- ----------- -------------- ----------------- 7517 20100108 06:07:52 log file switch (archiving needed) 27 5 7022081 7766 20100113 06:18:58 log file switch (archiving needed) 4 0 194101 7831 20100114 13:49:06 log file switch (archiving needed) 16 0 5062268 7892 20100115 21:05:16 log file switch (archiving needed) 132395 80460 8.0617E+10
I have 4 more day and time slots for this event and looks like on 15/01/2010 I had a similar issue. Lets check further for that day.
SQL> @sp_systime2 Please enter the ORACLE_SID value: prod01 Please enter the report begin date: sysdate-15 Please enter the report end date:sysdate-13 Total Cum Service Seconds % of % of Day or Wait Name Spent Total Total ------------ -------- ----------------------------------- ------------------- ------- ------- 14-JAN Service SQL execution 32,871.29 39.24 39.24 Wait db file sequential read 14,875.81 17.76 56.99 Service Recursive SQL execution 13,203.42 15.76 72.75 Wait log file parallel write 3,381.76 4.04 76.79 Wait db file scattered read 3,069.28 3.66 80.45 Wait log file sync 2,651.77 3.17 83.62 Wait db file parallel write 2,330.67 2.78 86.40 Wait Backup: sbtwrite2 2,323.22 2.77 89.17 Wait enq: TX - row lock contention 1,515.07 1.81 90.98 Wait Log archive I/O 1,387.98 1.66 92.64 Service Parsing SQL 1,138.64 1.36 94.00 Wait log file sequential read 814.92 0.97 94.97 15-JAN Wait log file switch (archiving needed) 80,616.90 31.14 31.14 Wait buffer busy waits 41,401.25 15.99 47.13 Wait row cache lock 39,873.16 15.40 62.53 Service SQL execution 37,210.97 14.37 76.90 Wait db file sequential read 15,267.63 5.90 82.79 Service Recursive SQL execution 13,166.82 5.09 87.88 Wait db file scattered read 5,490.75 2.12 90.00 Wait log file parallel write 5,157.94 1.99 91.99 Wait log file sync 3,793.75 1.47 93.46 Wait db file parallel write 3,447.89 1.33 94.79 Wait enq: TX - row lock contention 3,175.70 1.23 96.02 Wait Log archive I/O 2,231.86 0.86 96.88 .... .... ... Total Cum Service Seconds % of % of Day Hour or Wait Name Spent Total Total ------------ ----- -------- ----------------------------------- ------------------- ------- ------- 15-JAN 19:00 Service SQL execution 2,193.28 34.66 34.66 Wait db file sequential read 1,423.59 22.50 57.16 Wait log file parallel write 543.65 8.59 65.75 Wait db file parallel write 445.41 7.04 72.79 Wait log file sequential read 344.37 5.44 78.24 Wait db file scattered read 304.99 4.82 83.06 *** 21:00 Wait log file switch (archiving needed) 80,616.90 48.82 48.82 Wait buffer busy waits 41,336.81 25.03 73.86 Wait row cache lock 39,873.16 24.15 98.01 Service SQL execution 1,637.31 0.99 99.00 Wait db file sequential read 359.08 0.22 99.22 Service Recursive SQL execution 325.17 0.20 99.41 22:00 Service SQL execution 3,763.74 53.76 53.76 Wait db file sequential read 1,475.86 21.08 74.84 Wait db file scattered read 947.84 13.54 88.38 Wait db file parallel write 313.39 4.48 92.86 Service Recursive SQL execution 273.72 3.91 96.77
Between 20-21 (attention!! there is no 20:00 data which means problem is not between 21-22 but 20-21)
log file switch (archiving needed) is 48 percent of our response time. Which is very expectable if there was an archiver hang
problem on that day.
Lets cross-check if there was a snapshot at 20
SQL> @sp_stats_snap sysdate-14 sysdate-13 SNAP_ID SNAP_TIME ---------- ----------------- ... ... 7890 20100115 19:19:18 7891 20100115 19:49:18 7892 20100115 21:05:16 ----where is 20:00 7893 20100115 21:35:18 7894 20100115 22:05:18 7895 20100115 22:35:18 7896 20100115 23:05:18 7897 20100115 23:35:18
The reason our report shows the waits between 21-22 because query checks the time when the snapshot gathered so it assumes those happened at 21,
Tim wrote the queries for hourly interval so it sometimes mislead you in this kind of situations be carefull and pay more attention.
Alert log also says the problem started around 20:05 and ended at 21.
Fri Jan 15 20:00:45 2010 Errors in file j:\data\oracle\trace\prod01\bdump\prod01_arc1_7976.trc: ORA-19816: Message 19816 not found; No message file for product=RDBMS, facility=ORA; arguments: [db_recovery_file_dest] ORA-27044: Message 27044 not found; No message file for product=RDBMS, facility=ORA OSD-04008: WriteFile() failure, unable to write to file O/S-Error: (OS 112) There is not enough space on the disk. ... ... ... ... kcrrdmx: Successful archiving of previously failed ORL Archiver process freed from errors. No longer stopped Fri Jan 15 21:05:15 2010
So for the exactly same issue (maybe not ) our DB did not cry for “log file switch (archiving needed)” why ?
I am thinking about possible answers but I need to test meanwhile maybe you want to share your ideas
before I wrote the 3th part for my answer?
Today I kept my promise about giving the reason but I have to close an answering post with another question
which is the best part of troubleshooting, it never ends.
Moral of the story, “Check the alert log”
wh0cd855980 yasmin
Comment by Michaellex — August 29, 2017 @ 7:29 pm