Cursor Mutex X Wait Cause Hang

系统出现cursor: mutex X等待导致实例HANG死
Posted on 2012/09/24 by yangtingkun
客户数据库出现明显的cursor: mutex X等待,尝试通过ALTER SYSTEM KILL SESSION的方式释放等待的会话,结果导致整个实例出现HANG死的情况。
数据库版本为10.2.0.4,检查数据库问题发生时刻的AWR报告,发现cursor: mutex X等待在AWR报告中并不明显,不但没有出现在TOP 5里,而在整个等待事件的排名中也排在非常靠后的位置,且该事件的总等待时间为0,这与客户反馈的大约70多个会话持续等待cursor: mutex X事件的描述极为不符。
导致这个现象的原因可能有两种,一种是虽然从V$SESSION中看到当前的等待是cursor :mutex X事件,但是该事件已经结束,因此会话的等待时间并未记录并汇总到这个事件的等待时间中;另外一种情况是Oracle对于这个事件的统计值存在错误。
继续分析AWR报告,可以发现VERSION COUNT报告中,最高的两个SQL已经远远超过了正常的范围,达到了13W以上,毫无疑问这肯定不正常的情况所致:

Version Count    Executions    SQL Id    SQL Module    SQL Text
150,649        7q1tdx7ysmg6f
    ** SQL Text Not Available **
134,787        fa3vfvs859n5n
    ** SQL Text Not Available **
10,732        gr0wk1tf0npm4
    select count(:”SYS_B_0″) as ro…
6,547        6r91qu3knm3vz
    ** SQL Text Not Available **
5,888        36ft4hmuru5d2
    ** SQL Text Not Available **

排在第3和第4位的VERSION COUNT达到5000以上就已经不太正常了,何况最高的两条语句的值已经超过了13W,不用说这两条语句会占用大量的共享池空间:
Sharable Mem (b)    Executions    % Total    SQL Id    SQL Module    SQL Text
3,082,390,224        28.84    7q1tdx7ysmg6f
    ** SQL Text Not Available **
2,902,320,992        27.16    fa3vfvs859n5n
    ** SQL Text Not Available **
174,014,440        1.63    gr0wk1tf0npm4
    select count(:”SYS_B_0″) as ro…
126,804,120        1.19    36ft4hmuru5d2
    ** SQL Text Not Available **
106,961,832        1.00    6r91qu3knm3vz
    ** SQL Text Not Available **

果然不出所料,每个问题SQL占用的共享池空间都达到了3G左右。
显然导致当前cursor: mutex X的问题肯定和这两个SQL直接相关,最后通过ASH视图验证一下这个判断:
SQL> SELECT to_char(sample_time, 'yyyy-mm-dd hh24:mi:ss.ff'), session_id, event_name, sql_id
  2  FROM wrh$_active_session_history ash, wrh$_event_name en
  3  WHERE ash.dbid = 1944025784
  4  AND en.dbid = 1944025784
  5  AND ash.event_id = en.event_id(+)
  6  AND event_name LIKE '%mutex%'
  7  AND sample_timeselect sql_id, COUNT(*)
  2  FROM wrh$_active_session_history ash, wrh$_event_name en
  3  WHERE ash.dbid = 1944025784
  4  AND en.dbid = 1944025784
  5  AND ash.event_id = en.event_id(+)
  6  AND event_name LIKE '%mutex%'
  7  AND sample_time
SQL> select to_char(sample_time, 'yyyy-mm-dd hh24:mi:ss.ff'), session_id, event_name, sql_id
2 from wrh$_active_session_history ash, wrh$_event_name en
3 where ash.dbid = 1944025784
4 and en.dbid = 1944025784
5 and ash.event_id = en.event_id(+)
6 and event_name like '%mutex%'
7 and sample_time <= to_timestamp('20120908', 'yyyymmdd')
8 order by 1;
 
TO_CHAR(SAMPLE_TIME,'YYYY-MM- SESSION_ID EVENT_NAME SQL_ID
----------------------------- ---------- --------------- -------------
2012-09-07 10:00:15.044 882 cursor: mutex X fa3vfvs859n5n
2012-09-07 10:00:25.053 882 cursor: mutex X fa3vfvs859n5n
2012-09-07 10:00:35.061 882 cursor: mutex X 7q1tdx7ysmg6f
2012-09-07 10:00:45.069 882 cursor: mutex X fa3vfvs859n5n
2012-09-07 10:00:55.078 882 cursor: mutex X 7q1tdx7ysmg6f
2012-09-07 10:01:05.086 882 cursor: mutex X 7q1tdx7ysmg6f
2012-09-07 10:01:15.093 882 cursor: mutex X 7q1tdx7ysmg6f
2012-09-07 10:01:25.101 882 cursor: mutex X 7q1tdx7ysmg6f
2012-09-07 10:01:35.109 882 cursor: mutex X fa3vfvs859n5n
2012-09-07 10:01:45.117 882 cursor: mutex X fa3vfvs859n5n
2012-09-07 10:01:55.126 882 cursor: mutex X fa3vfvs859n5n
2012-09-07 10:02:05.134 983 cursor: mutex X 7q1tdx7ysmg6f
2012-09-07 10:02:15.141 983 cursor: mutex X fa3vfvs859n5n
2012-09-07 10:02:25.149 983 cursor: mutex X 7q1tdx7ysmg6f
2012-09-07 10:02:35.157 983 cursor: mutex X
2012-09-07 10:02:45.164 983 cursor: mutex X 7q1tdx7ysmg6f
2012-09-07 10:02:55.173 983 cursor: mutex X 7q1tdx7ysmg6f
2012-09-07 10:03:05.181 983 cursor: mutex X 7q1tdx7ysmg6f
2012-09-07 10:03:15.188 983 cursor: mutex X 7q1tdx7ysmg6f
2012-09-07 10:03:25.196 983 cursor: mutex X 7q1tdx7ysmg6f
2012-09-07 10:03:35.204 983 cursor: mutex X 7q1tdx7ysmg6f
2012-09-07 10:03:45.211 983 cursor: mutex X 7q1tdx7ysmg6f
2012-09-07 10:03:55.220 983 cursor: mutex X 7q1tdx7ysmg6f
2012-09-07 10:04:05.228 983 cursor: mutex X 7q1tdx7ysmg6f
.
.
.
2012-09-07 11:59:02.738 887 cursor: mutex X fa3vfvs859n5n
2012-09-07 11:59:12.748 887 cursor: mutex X 7q1tdx7ysmg6f
2012-09-07 11:59:22.757 887 cursor: mutex X fa3vfvs859n5n
2012-09-07 11:59:32.767 887 cursor: mutex X 7q1tdx7ysmg6f
 
428 rows selected.
 
SQL> select sql_id, count(*)
2 from wrh$_active_session_history ash, wrh$_event_name en
3 where ash.dbid = 1944025784
4 and en.dbid = 1944025784
5 and ash.event_id = en.event_id(+)
6 and event_name like '%mutex%'
7 and sample_time <= to_timestamp('20120908', 'yyyymmdd')
8 group by sql_id;
 
SQL_ID COUNT(*)
------------- ----------
1
7q1tdx7ysmg6f 232
fa3vfvs859n5n 190
43tcru5bkgdvc 5

很明显,导致系统出现cursor: mutex X等待的就是AWR报告中看到的两个语句,正是这两个语句不断的产生新的VERSION,导致了cursor: mutex X事件,并占用越来越多的共享池空间,最终导致整个数据库出现了HANG死的情况。

根据分析导致这个问题的最直接的原因是初始化参数CURSOR_SHARING设置为SIMILAR。而CURSOR_SHARING设置为SIMILAR从而导致VERSION COUNT过大的问题已经有过很多案例了,对于当前的数据库而言,要想彻底避免这个问题,除了升级到最高的10.2.0.5.6以外,就是通过改变CURSOR_SHARING的设置。显然EXACT是最佳方案,但是这必须同时修改程序使用绑定变量,才能避免由于参数修改带来的硬解析的问题。此外修改为FORCE也是一个选择,不过这需要在测试环境中进行验证,避免此修改带来的性能问题和其他BUG的风险。

http://yangtingkun.itpub.net/post/468/529552