This is what happened on 11.2.0.2 upgrade from 10.2.0.5
Picture is taken from production (tool used is for this picture is my recent favourite Database Time Monitor from Dominic Giles – creator of swingbench) today, which was upgraded to 11.2.0.2 on last saturday.
What happens here is I think self-explanatory, our system and underlying I/O is hammered with direct path reads.
Explaining this will not satisfy the customer so question is Why it is happening ?
1- It is obvious that upgrade was not tested good enough
2- Enough research about behaviour change has not been made for 10g to 11g migration
All these are political answers, real technical answer is below
Once I checked the system there were roughly 9 distinct sqlids running with many sessions, all waiting on the same event.
The sqls were more or less centralised on querying Users table with filtering email where there is no index and there has never been an index. Why wasn’t system suffering from this lack of index when we are on 10GR2?
It is because 11G is now using direct path reads whenever it can for full table scans (for more information check the references at the bottom of this table). This means when we run this query and if the table is eligible for direct path reads it will use direct path reads from disk. When you query with 2-3 sessions this is ok but when you query with 20 sessions then it becomes disaster like here and your disk run queue gets bigger and bigger
What was happening on 10GR2? Because Oracle was using scattered reads and cache when it can we were doing more logical I/O without hammering disk subsystem and this contention never revealed and because this application is third party no index was created by us even there was unnecessary logical I/O.
Because we only functionality and vendor certifies the production on 11G we just upgraded the database.
Let me replicate the issue with simple test case on 11G by first with default direct path read behaviour and then reverting back to pre 11G behaviour by setting 10949 event at system level (ref: Dion Cho)
Test Case
—Running 20 sessions of simple select each for 10 times. Total 200 runs. (Idea taken from –Andrey S. Nikolaev –)
create table test_direct as select * from dba_source; ---single_thread.sql --- 20 lines of the code below select * from test_directpath2 where name like 'TEST%'; ---multi_runner.sql set echo off set verify off set head off: define parallelism="&1" spool many_threads.sql select 'host sqlplus hr/hr @single_thread.sql &' from dba_objects where rownum <= ¶llelism; spool off @many_threads.sql ---Running @multi_runner 20
Default 11G
Pre-10G behaviour
during the issue-Only slight increase during the caching then all LIO
After the issue happened – It lasted only 1 minute
As you can see we have a sudden CPU increased due to high number of logical reads in short time but no long term I/O load, once oracle caches it burns CPU instead waiting on I/O. Meanwhile I think DB was in maintenance window so resource manager kicked in to avoid hammering CPU–resmgr:cpu quantum event.
Here are the other comparisons between two runs
Session waits – 11G
HR@ORACOS> @swact SID USERN STATE EVENT SEQ# SEC_IN_STATE SQL_ID CHN P1 P2 P3 P1TRANSL BLCKNG_SID BLCKNG_SID_S BLCKNG_INST ------- ----- ------- ---------------------------------------- ---------- ------------ ------------- ---- ------------------------- ------------------------- ------------------------- ------------------------------------------ ---------- ------------ ----------- 15 HR WAITING direct path read 606 0 ba40q4n8w1myy 0 file number= 10 first dba= 493120 block cnt= 64 UNKNOWN 134 HR WAITING direct path read 444 0 ba40q4n8w1myy 0 file number= 10 first dba= 493314 block cnt= 62 UNKNOWN 75 HR WAITING direct path read 483 0 ba40q4n8w1myy 0 file number= 10 first dba= 494656 block cnt= 64 UNKNOWN 200 HR WAITING direct path read 491 0 ba40q4n8w1myy 0 file number= 10 first dba= 494912 block cnt= 64 UNKNOWN 196 HR WAITING direct path read 493 0 ba40q4n8w1myy 0 file number= 10 first dba= 494912 block cnt= 64 UNKNOWN 12 HR WAITING direct path read 504 0 ba40q4n8w1myy 0 file number= 10 first dba= 495234 block cnt= 62 UNKNOWN 132 HR WAITING direct path read 507 0 ba40q4n8w1myy 0 file number= 10 first dba= 495234 block cnt= 62 UNKNOWN 13 HR WAITING direct path read 515 0 ba40q4n8w1myy 0 file number= 10 first dba= 495617 block cnt= 63 UNKNOWN 201 HR WAITING direct path read 532 0 ba40q4n8w1myy 0 file number= 10 first dba= 496192 block cnt= 64 UNKNOWN 137 HR WAITING direct path read 547 0 ba40q4n8w1myy 0 file number= 10 first dba= 496704 block cnt= 64 UNKNOWN 136 HR WAITING direct path read 453 0 ba40q4n8w1myy 0 file number= 13 first dba= 100032 block cnt= 64 UNKNOWN 202 HR WAITING direct path read 464 0 ba40q4n8w1myy 0 file number= 13 first dba= 100354 block cnt= 62 UNKNOWN 135 HR WAITING direct path read 472 0 ba40q4n8w1myy 0 file number= 13 first dba= 100610 block cnt= 62 UNKNOWN 199 HR WAITING direct path read 485 0 ba40q4n8w1myy 0 file number= 13 first dba= 101056 block cnt= 64 UNKNOWN 71 HR WAITING direct path read 502 0 ba40q4n8w1myy 0 file number= 13 first dba= 101506 block cnt= 62 UNKNOWN 16 HR WAITING direct path read 505 0 ba40q4n8w1myy 0 file number= 13 first dba= 101696 block cnt= 64 UNKNOWN 73 HR WAITING direct path read 521 0 ba40q4n8w1myy 0 file number= 13 first dba= 102208 block cnt= 64 UNKNOWN 133 HR WAITING direct path read 591 0 ba40q4n8w1myy 0 file number= 13 first dba= 98946 block cnt= 62 UNKNOWN 74 HR WAITING direct path read 432 0 ba40q4n8w1myy 0 file number= 13 first dba= 99330 block cnt= 62 UNKNOWN 72 HR WAITING direct path read 608 0 ba40q4n8w1myy 0 file number= 13 first dba= 99520 block cnt= 64 UNKNOWN 192 HR WORKING On CPU / runqueue 43 0 90sp00nt0sspn 0 driver id= #bytes= 1 0 NOT IN WAIT 0x0000000054435000 21 rows selected.
Session waits – Pre 11G
HR@ORACOS> @swact SID USERN STATE EVENT SEQ# SEC_IN_STATE SQL_ID CHN P1 P2 P3 P1TRANSL BLCKNG_SID BLCKNG_SID_S BLCKNG_INST ------- ----- ------- ---------------------------------------- ---------- ------------ ------------- ---- ------------------------- ------------------------- ------------------------- ------------------------------------------ ---------- ------------ ----------- 136 WAITING jobq slave wait 10 0 0 0 0 0 NO HOLDER 20 WAITING jobq slave wait 10 0 0 0 0 0 NO HOLDER 23 HR WAITING resmgr:cpu quantum 53 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 UNKNOWN 25 HR WAITING resmgr:cpu quantum 60 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 UNKNOWN 22 HR WAITING resmgr:cpu quantum 53 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 UNKNOWN 210 HR WAITING resmgr:cpu quantum 50 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 UNKNOWN 14 HR WAITING resmgr:cpu quantum 56 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 UNKNOWN 208 HR WAITING resmgr:cpu quantum 53 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 UNKNOWN 206 HR WAITING resmgr:cpu quantum 54 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 UNKNOWN 144 HR WAITING resmgr:cpu quantum 58 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 UNKNOWN 141 HR WAITING resmgr:cpu quantum 54 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 UNKNOWN 140 HR WAITING resmgr:cpu quantum 50 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 UNKNOWN 74 HR WAITING resmgr:cpu quantum 55 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 UNKNOWN 81 HR WAITING resmgr:cpu quantum 56 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 UNKNOWN 85 HR WAITING resmgr:cpu quantum 56 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 UNKNOWN 132 HR WAITING resmgr:cpu quantum 54 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 UNKNOWN 137 HR WAITING resmgr:cpu quantum 57 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 UNKNOWN 73 HR WORKING On CPU / runqueue 506 0 90sp00nt0sspn 0 driver id= #bytes= 1 0 NOT IN WAIT 0x0000000054435000 80 HR WORKING On CPU / runqueue 56 0 ba40q4n8w1myy 0 driver id= #bytes= 1 0 NOT IN WAIT 0x0000000062657100 199 HR WORKING On CPU / runqueue 52 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 NOT IN WAIT 82 HR WORKING On CPU / runqueue 57 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 NOT IN WAIT 13 HR WORKING On CPU / runqueue 56 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 NOT IN WAIT 205 HR WORKING On CPU / runqueue 54 0 ba40q4n8w1myy 0 location= 3 = 0 = 0 NOT IN WAIT 23 rows selected.
SQL Stats -Pre-11G
HR@ORACOS> @sqlid ba40q4n8w1myy SQL_ID HASH_VALUE SQL_TEXT ------------- ---------- ------------------------------------------------------------------------------------------------------------------------------------------------------ ba40q4n8w1myy 297848798 select * from test_directpath2 where name like 'TEST%' INS CH# UEX PLAN_HASH OPT_COST PARSES H_PARSES EXECUTIONS FETCHES ROWS_PROCESSED CPU_MS ELA_MS AVG_ELA LIOS PIOS ---------- ---- ---- ---------- -------- ---------- ---------- ---------- ---------- -------------- ---------- ---------- ---------- ---------- ---------- 1 0 0 3436878245 2147 200 1 200 200 0 52000 5333438.4 26.667192 2103470 2102605
SQL Stats -Pre-11G –Check Elapsed times and PIO difference
HR@ORACOS> @sqlid ba40q4n8w1myy SQL_ID HASH_VALUE SQL_TEXT ------------- ---------- ------------------------------------------------------------------------------------------------------------------------------------------------------ ba40q4n8w1myy 297848798 select * from test_directpath2 where name like 'TEST%' INS CH# UEX PLAN_HASH OPT_COST PARSES H_PARSES EXECUTIONS FETCHES ROWS_PROCESSED CPU_MS ELA_MS AVG_ELA LIOS PIOS ---------- ---- ---- ---------- -------- ---------- ---------- ---------- ---------- -------------- ---------- ---------- ---------- ---------- ---------- 1 0 0 3436878245 2147 200 2 200 200 0 16120 102802.028 .51401014 2104800 10513 HR@ORACOS>
Snapper -11G
HR@ORACOS> @sn 10 15 Sampling... -- Session Snapper v3.10 by Tanel Poder @ E2SN ( http://tech.e2sn.com ) ---------------------------------------------------------------------------------------------------------------------- SID, USERNAME , TYPE, STATISTIC , DELTA, HDELTA/SEC, %TIME, GRAPH ---------------------------------------------------------------------------------------------------------------------- 15, HR , STAT, session logical reads , 3338, 333.8, ===>check the difference 15, HR , STAT, user I/O wait time , 1040, 104, 15, HR , STAT, non-idle wait time , 1039, 103.9, 15, HR , STAT, non-idle wait count , 53, 5.3, 15, HR , STAT, physical read total IO requests , 54, 5.4, 15, HR , STAT, physical read total multi block requests, 54, 5.4, 15, HR , STAT, physical read total bytes , 27869184, 2.79M, ===>check the difference 15, HR , STAT, cell physical IO interconnect bytes , 27869184, 2.79M, 15, HR , STAT, consistent gets , 3338, 333.8, 15, HR , STAT, consistent gets direct , 3338, 333.8, 15, HR , STAT, physical reads , 3338, 333.8, 15, HR , STAT, physical reads direct , 3338, 333.8, 15, HR , STAT, physical read IO requests , 53, 5.3, 15, HR , STAT, physical read bytes , 27344896, 2.73M, 15, HR , STAT, file io wait time , 1010, 101, 15, HR , STAT, Number of read IOs issued , 53, 5.3, 15, HR , STAT, no work - consistent read gets , 3338, 333.8, 15, HR , STAT, table scan rows gotten , 164188, 16.42k, 15, HR , STAT, table scan blocks gotten , 3338, 333.8, 15, HR , TIME, DB CPU , 70000, 7ms, .7%, |@ | 15, HR , TIME, sql execute elapsed time , 10107298, 1.01s, 101.1%, |@@@@@@@@@@| 15, HR , TIME, DB time , 10107298, 1.01s, 101.1%, |@@@@@@@@@@| 15, HR , WAIT, direct path read , 9535205, 953.52ms, 95.4%, |@@@@@@@@@@| -- End of Stats snap 1, end=2011-08-16 00:02:14, seconds=10 ----------------------------------------------------------------------- Active% | SQL_ID | EVENT | WAIT_CLASS ----------------------------------------------------------------------- 99% | ba40q4n8w1myy | direct path read | User I/O 1% | ba40q4n8w1myy | ON CPU | ON CPU -- End of ASH snap 1, end=2011-08-16 00:02:14, seconds=10, samples_taken=99
Snapper – Pre-11G –check logical I/O per second
HR@ORACOS> @sn 30 192 Sampling... -- Session Snapper v3.10 by Tanel Poder @ E2SN ( http://tech.e2sn.com ) ---------------------------------------------------------------------------------------------------------------------- SID, USERNAME , TYPE, STATISTIC , DELTA, HDELTA/SEC, %TIME, GRAPH ---------------------------------------------------------------------------------------------------------------------- 192, HR , STAT, opened cursors cumulative , 77, 2.57, 192, HR , STAT, opened cursors current , -1, -.03, 192, HR , STAT, user calls , 116, 3.87, 192, HR , STAT, pinned cursors current , 1, .03, 192, HR , STAT, session logical reads , 406695, 13.56k, ===>check the difference 192, HR , STAT, CPU used when call started , 300, 10, 192, HR , STAT, CPU used by this session , 300, 10, 192, HR , STAT, DB time , 1926, 64.2, 192, HR , STAT, concurrency wait time , 6, .2, 192, HR , STAT, scheduler wait time , 1471, 49.03, 192, HR , STAT, non-idle wait time , 1479, 49.3, 192, HR , STAT, non-idle wait count , 201, 6.7, 192, HR , STAT, enqueue requests , 1, .03, 192, HR , STAT, enqueue releases , 1, .03, 192, HR , STAT, consistent gets , 406695, 13.56k, 192, HR , STAT, consistent gets from cache , 406695, 13.56k, 192, HR , STAT, consistent gets from cache (fastpath) , 406695, 13.56k, 192, HR , STAT, calls to kcmgcs , 426, 14.2, 192, HR , STAT, calls to get snapshot scn: kcmgss , 39, 1.3, 192, HR , STAT, no work - consistent read gets , 406269, 13.54k, 192, HR , STAT, table scans (long tables) , 39, 1.3, 192, HR , STAT, table scan rows gotten , 24488652, 816.29k, 192, HR , STAT, table scan blocks gotten , 406269, 13.54k, 192, HR , STAT, session cursor cache hits , 76, 2.53, 192, HR , STAT, parse count (total) , 77, 2.57, 192, HR , STAT, execute count , 77, 2.57, 192, HR , STAT, bytes sent via SQL*Net to client , 24169, 805.63, 192, HR , STAT, bytes received via SQL*Net from client , 19066, 635.53, 192, HR , STAT, SQL*Net roundtrips to/from client , 76, 2.53, 192, HR , TIME, parse time elapsed , 2510, 83.67us, .0%, | | 192, HR , TIME, PL/SQL execution elapsed time , 4570, 152.33us, .0%, | | 192, HR , TIME, DB CPU , 3160000, 105.33ms, 10.5%, |@@ | 192, HR , TIME, sql execute elapsed time , 19257156, 641.91ms, 64.2%, |@@@@@@@ | 192, HR , TIME, DB time , 19757917, 658.6ms, 65.9%, |@@@@@@@ | 192, HR , WAIT, Disk file operations I/O , 38, 1.27us, .0%, | | 192, HR , WAIT, latch: cache buffers chains , 68244, 2.27ms, .2%, | | 192, HR , WAIT, resmgr:cpu quantum , 14801390, 493.38ms, 49.3%, |@@@@@ | 192, HR , WAIT, SQL*Net message to client , 309, 10.3us, .0%, | | 192, HR , WAIT, SQL*Net message from client , 9924100, 330.8ms, 33.1%, |@@@@ | 192, HR , WAIT, events in waitclass Other , 11340, 378us, .0%, | | -- End of Stats snap 1, end=2011-08-16 00:20:32, seconds=30 ----------------------------------------------------------------------- Active% | SQL_ID | EVENT | WAIT_CLASS ----------------------------------------------------------------------- 43% | ba40q4n8w1myy | resmgr:cpu quantum | Scheduler 20% | ba40q4n8w1myy | ON CPU | ON CPU -- End of ASH snap 1, end=2011-08-16 00:20:32, seconds=30, samples_taken=87
I/O Stats – 11G–high runque size
coskan@coskan-lucid:~$ sar -d 1 Linux 2.6.32-33-generic (coskan-lucid) 16/08/11 _x86_64_ (4 CPU) 00:00:55 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 00:00:56 dev8-0 132.00 127200.00 104.00 964.42 21.18 130.15 7.58 100.00 00:00:56 dev8-16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 00:00:56 dev8-32 105.00 1680.00 34.00 16.32 0.68 6.38 6.00 63.00 00:00:56 dev8-48 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 00:00:56 dev8-64 84.00 3240.00 8400.00 138.57 79.66 109.88 9.64 81.00 00:00:56 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 00:00:57 dev8-0 131.00 128320.00 64.00 980.03 22.76 136.87 7.63 100.00 00:00:57 dev8-16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 00:00:57 dev8-32 158.00 2640.00 0.00 16.71 0.79 5.00 5.00 79.00 00:00:57 dev8-48 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 00:00:57 dev8-64 56.00 0.00 13440.00 240.00 144.52 1049.64 17.86 100.00
I/O Stats – PRe-11G —no runque
00:11:57 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 00:11:58 dev8-0 2.00 0.00 32.00 16.00 0.03 15.00 15.00 3.00 00:11:58 dev8-16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 00:11:58 dev8-32 10.00 160.00 64.00 22.40 0.04 4.00 4.00 4.00 00:11:58 dev8-48 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 00:11:58 dev8-64 275.00 2464.00 4736.00 26.18 144.05 193.24 3.64 100.00 00:11:58 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 00:11:59 dev8-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 00:11:59 dev8-16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 00:11:59 dev8-32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 00:11:59 dev8-48 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 00:11:59 dev8-64 440.00 3344.00 2984.00 14.38 146.16 295.02 2.27 100.00
There are two solutions in this case one is creating an index to avoid full table scan or disabling direct path read at system level which is not right at all.
Moral Of the story is TEST YOUR SYSTEM WITH SAME/SIMILAR LOAD BEFORE YOU UPGRADE !!!!!
Being certified on a platform won’t mean it will run fine
If you havent read already, these are reference posts about direct path reads on 11GR2
http://afatkulin.blogspot.com/2009/01/11g-adaptive-direct-path-reads-what-is.html
http://oraclue.com/2009/07/17/direct-path-reads-and-serial-table-scans-in-11g/
http://oracledoug.com/serendipity/index.php?/archives/1321-11g-and-direct-path-reads.html