# Expert Oracle Practices # Troubleshooting Latch Contention

I am reading Expert Oracle Practices by Oracle Database Administration from the Oak Table. I’d like to summary and share what I learned.

This chapter, Troubleshooting Latch Contention by Riyaj Shamsudeen, is very helpful, for both understanding the theory behind and practice in read world.

Latches, what is it? Enqueues?
Latches are a serialization mechanism to protect critical resources in the system global area (SGA) from concurrent modifications. A latch is implemented as a few bytes of memory.The value of a latch’s memory area indicates whether that latch is available or not.

Latches are handled exclusively by the Oracle RDBMS engine. Users have no ability to control latch operations directly.

Latches are very different from enqueues (also termed locks). Enqueues are used to lock objects or rows to avoid simultaneous conflicting operations on tables or rows. Enqueues have their own complex data structures, and the resources in question can be locked in one of six available modes. In contrast, latches can be acquired only in either share mode or exclusive mode. Another difference is that enqueues have queuing structures, and they are acquired in a strict serial fashion. By contrast, latches have no queuing structures.

Latches classes:
Latches can be grouped into two classes: solitaire and non-solitaire latches.
Solitaire latche, only one latch, a single latch that allows only one process to modify the critical resource it
protects.
Non-solitaire latches, has a Parent latch, and many Child Latches. The non-solitaire group can further be broken into parent and child latches.

The cache buffers chains latch (CBC latch) is a typical example of a non-solitaire latch. There are thousands of child latches for this parent latch.

Views about latch
Latches are externalized in dynamic views such as v$latch, v$latch_children, and v$latch_parent. The view v$latch_children maintains statistics for latches at the child level, while v$latch maintains aggregated statistics for all child latches, and v$latch_parent maintains statistics at the parent level.

Rem Example of solitaire latches
col name format A35
select name, latch# , count(*) from v$latch_children group by name, latch#
having count(*) =1 order by 1;
NAME LATCH# COUNT(*)
----------------------------------- ---------- ----------
TLCR meta context 126 1
channel handle pool latch 33 1
ges resource scan list 51 1
granule operation 42 1
redo allocation 115 1

Rem Example of non-solitaire latches
select name, latch# , count(*) from v$latch_children group by name, latch#
having count(*) >1 order by 1;
NAME LATCH# COUNT(*)
----------------------------------- ---------- ----------
KCL freelist parent latch 120 10
KCL gc element parent latch 118 250
KCL name table parent latch 119 832
KJC message pool free list 58 3
KJCT flow control latch 66 33
…

What’s spin & sleep? Why ?
Before we dive into these modes, let’s take a brief look at sleeps and spins. Algorithms presented in the following section use sleeps and spins if a latch is not available. If a latch is not immediately available, then the process might need to sleep before trying to acquire the latch again. Sleep is the process of giving up CPU cycles voluntarily and “sleeping” for a specific time-out period. Sleep results in costly context switches. Typically, latches are held for very short durations on the order of a few nanoseconds, and in many cases it is better to avoid the context switches. Thus, Oracle Database code makes a preset number of attempts to acquire a latch before going to sleep. This process is known as spinning.

Statistics of spins and sleeps
The spin_gets value indicates the number of times a process was able to acquire a latch with just spinning alone, without the need to sleep. The sleep_gets statistic indicates the number of sleeps encountered trying to acquire a latch. As the value of sleep_gets increases, performance worsens.

What’s Immediate Mode
In the immediate mode of latch acquisition, Oracle code tries to acquire any of the available child latches (of a given latch family) without waiting, and it tries to acquire the last child latch in a willing-to-wait mode. For example, redo copy latches must be held before copying redo records to the log buffer. There are several Redo copy latches configured in an SGA, and it is sufficient to hold any one of the Redo copy latch children to copy the redo records to the log buffer. Oracle Database’s internal code tries to acquire any of those child latches s in a round-robin fashion. If none of the child latches are available, then the code tries to acquire the last child latch in willing-to-wait mode.

A pseudocode algorithm to illustrate immediate-mode latch acquisition

Try to acquire first child latch selected pseudo randomly.
If latch acquired
  Increment gets counter
  Exit loop and proceed to modify critical resource
Else
  Step X: Try to acquire next child latch in a round-robin fashion.
  If latch is available then
    Increment immediate_gets counter
    Exit loop and proceed to modify critical resource.
  Else
    Increment immediate_misses counter
    Go to step X until all but one child latches tried.
  End
  If all the child latches except one has been tried, then
    Try that last child latch in willing-to-wait mode.
    While (latch is not available)
    Loop
      if latch acquired
        increment gets counter
        exit loop and proceed to modify critical resource.
      Else
        For ( req_count=0;
        req_count < spin_count && latch not acquired;
        req_count++
        )
        Loop
          Spin and try to acquire latch;
        End loop;
        If the latch is acquired
          Increment misses counter
          Increment spin_gets counter
          exit loop and proceed to modify the critical resource
        else
          while (latch is not available)
          loop
            Sleep (sleep duration increases slowly until reaching
            _max_exponential_sleep).
            Increment sleeps counter.
          End loop;
          Increment misses counter.
          Exit loop and proceed to modify the critical resource.
        End if
      End if
    End loop;
    Increment gets counter;
    Exit loop and proceed to modify the critical resource
  End if
End if

What’s Willing-to-Wait Mode
Willing-to-wait mode involves the practice of requesting a latch and waiting until the latch becomes available. Successful attempts to acquire the latch are termed gets. Failed attempts are termed misses.

For example, before modifying a cache buffers chains in the SGA, the child latch protecting that hash chain must be held. (a solitaire latch, the resource being protected is protected by only that latch)

Latch acquisition in willing-to-wait mode

Try to acquire the latch in willing-to-wait mode.
While (latch is not available)
Loop
  if latch acquired
    increment gets counter.
    Exit loop and proceed to modify critical resource.
  Else
    Increment misses counter.
    For ( req_count=0;
    req_count < spin_count && latch not acquired;
    req_count++
    )
    Loop
      Spin and try to acquire latch;
    End loop;
    If the latch is acquired
      Increment spin_gets counter.
      Increment misses counter.
      exit loop and proceed to modify critical resource
    else
      while (latch is not available)
      loop
        Sleep (sleep duration increases slowly until
        reaching _max_exponential_sleep).
        Increment sleeps counter.
        Try to acquire the latch.
      End loop;
      Increment misses counter.
      Exit loop and proceed to modify critical resource.
    End if
  End if
  Try to acquire the latch;
End loop;
Exit loop and proceed to modify critical resource

What’s Latch-Wait Posting Mode?
The third method of latch acquisition relies on latch wait posting. This method uses more sleeps, thus reducing spin. Excessive spin can lead to high CPU usage, and latch wait posting effectively addresses that problem. When the holder of a long wait latch is ready to release the latch, it will scan through the waiters list and post the processes waiting for the latch to indicate latch availability. The waiting processes largely sleep until the latch is available and posted.

Two latches general categories:
short-wait, using immediate mode or willing-to-wait mode
long-wait latches, using spinning also, but processes go to sleep while waiting for long wait latches, thus reducing the CPU usage associated with latch spin.(I am not sure whether it’s still true in 11g)

For example, a cache buffers chains latch is a short wait latch, meaning that most probably this latch is held for a very short time.

The library cache latch is a long-wait latch. It can take longer to search for a suitable object in library cache hash chain, and so the library cache latch may be held for longer time.

What’s Cache Buffers Chains (CBC) ?
Cache buffers chains (CBC) latch contention is the most common cause of latch contention in the real world.
The Buffer cache is split into multiple data block buffers. Every data block buffer has an associated buffer header pointing to that buffer. These buffer headers (externalized in x$bh or in the dynamic view v$bh) are hanging from a hash chain. Multiple buffer headers are chained together to create a hash chain, and a hash bucket is in the head of this chain. The buffer header associated with a database block can only be chained to a specific hash chain.

How a database block is read into the buffer cache by Oracle Database code?
1. Identify the database block to be read from the disk.
2. Check to see if the block is already in the buffer cache: Apply a hash function over the DBA of the block to identify the hash bucket. If that buffer is in the buffer cache, it will be in the hash chain associated with the bucket.
3. Acquire the CBC latch protecting that hash chain. Walk the hash chain to see if that block with a specific version can be found.
4. If a block is found in the hash chain, use that block. If the block is not found, find a free buffer in the buffer cache, delink the buffer header for that buffer from its current chain, and link that buffer header to the hash chain under the protection of the CBC latch. Pin the buffer header, release the latch child, and read the block into this buffer.

Causes of CBC Latch Contention?
1. Longer cache buffers chains.
Even if the processes are trying to access completely different objects, because the same latch is protecting all buffer headers linked in that hash chain, that child latch can inadvertently cause latch contention. This problem is mostly resolved from Oracle Database version 10g, since the number of hash buckets2 configured by default increased dramatically in that release.The number of buffer cache hash buckets is configured by _db_block_hash_buckets. The number of CBC latches is configured by _db_block_hash_latches.

2. Heavy concurrent access of an index root block or branch blocks.
In a tight, nested-loops join based upon a unique or primary key, the associated index’s root block must be accessed very frequently.
Consider the access path shown in Listing 12-8. For each row from the outer row source at step 130, probe the inner table through a unique key column. To find a specific join key, the database engine will access the root block of the RCV_SHIPMENT_HEADERS_U1 index. Next, the engine
will access branch blocks, and then leaf blocks. If the query is executed by hundreds of concurrent processes, then each of those processes will access the root/branch/leaf blocks thousands of times per second. That heavy concurrent access can result in numerous processes trying to access the hash chain looking for the root block of the index, in turn leading to CBC latch contention.

A tight, nested-loops join
| 129 | NESTED LOOPS | | 1 |
| 130 |   TABLE ACCESS BY INDEX ROWID | RCV_TRANSACTIONS_INTERFACE |1339 |
|*131 | INDEX RANGE SCAN | CUS_RCV_TXNS_INTERFACE_C3 |1339 |
|*132 |   TABLE ACCESS BY INDEX ROWID | RCV_SHIPMENT_HEADERS | 1 |
|*133 |    INDEX UNIQUE SCAN | RCV_SHIPMENT_HEADERS_U1 | 1 |

3. Heavy concurrent access of a leaf block.
Such access typically happens if concurrent inserts are made into a nonpartitioned table with a monotonically increasing sequence of primary or unique keys. Sequence-based values will be in a strict numeric order. Since primary or unique indexes are populated with these sequence values, recently generated values tend to be populated in the same leaf block. CBC latches protecting the hash chains of those buffers will be heavily accessed, leading to CBC latch contention.

4. Heavily accessed small table.
In a nested loops join method, if the inner row source is a small table, and if it is not indexed, then for every row from the outer row source, the inner row source will be accessed using a full table scan. The result can be much higher concurrent access to the few blocks used by that table, which in turn may lead to CBC latch contention for the latches protecting that small table’s blocks.

A heavily accessed small table
| 129 | NESTED LOOPS | | 1 |
| 130 |  TABLE ACCESS BY INDEX ROWID | RCV_TRANSACTIONS_INTERFACE |1339 |
|*131 |    INDEX RANGE SCAN | CUS_RCV_TXNS_INTERFACE_T3 |1339 |
|*132 |  FULL TABLE SCAN | MTL_PARAMETERS | 1 |

5. High consistent block generation.
This makes it necessary to create buffer clones consistent with the SELECT statement. Buffer clones are created by applying undo records to a copy of the buffers being cloned, creating consistent versions of the blocks—known as consistent read (CR) block generation. The process of cloning buffers can result in higher access to the undo header block, and to undo blocks

6. Many sessions performing full table scan on one or few tables
This means that many processes will compete for the same blocks or range of blocks, and that competition can in turn lead to CBC latch contention.

How-to Analyzing CBC Latch Contention?
Step 1: Review the Distribution of Gets
To analyze CBC latch contention, you need to find whether the latch contention is widespread across many child latches or is limited to a smaller set of child latches. You can use the view v$latch_children to determine which is the case.

Querying to see the distribution of child latches
Select * from (
select child#, gets, misses,sleeps,
rank() over (partition by latch# order by gets desc ) rnk_gets,
rank() over (partition by latch# order by misses desc )rnk_misses,
rank() over (partition by latch# order by sleeps desc ) rnk_sleeps
from v$latch_children where name = 'cache buffers chains'
)
where rnk_gets <=20 and rnk_misses <=20 and rnk_sleeps <=20 order by rnk_gets / 

Step 2: Identify the SQL Statements Involved Your next step is to identify SQL statements involved in the latch contention.
Use the SQL in Listing 12-10 to show the top SQL statements suffering from latch contention. Keep in mind that any given SQL statement identified by the query in Listing 12-10 could be merely a victim of latch contention, and not a root cause

Listing 12-10. Querying to see Latch Free events and SQL hash values
select substr(w.event, 1, 28) event, s.sql_hash_value, count(*)
from v$session_wait w, v$session s, v$process p
where s.sid=w.sid
and p.addr = s.paddr
and s.username is not null
and w.event not like '%pipe%'
and w.event not like 'SQL*%'
group by substr(w.event, 1, 28), sql_hash_value
order by 3
/

EVENT SQL_HASH_VALUE COUNT(*)
------------------------------ -------------- ----------
enqueue 3740270 1
enqueue 747790152 1
...
latch free 622474477 3
latch free 1509082258 58

You need to drill down to identify the specific latch type at the root of the contention. Column p2 in v$sesssion_wait identifies the latch# in the v$latch dynamic performance view.

Listing 12-11. Finding the latch type and SQL hash value
Set lines 160 pages 100
Column event format A35
Column name format A35
select x.event, x.sql_hash_value,
case when x.event like 'latch%' then
l.name
else ' '
end name,
x.cnt from (
select substr(w.event, 1, 28) event, s.sql_hash_value, w.p2,count(*) cnt
from v$session_wait w, v$session s, v$process p
where s.sid=w.sid
and p.addr = s.paddr
and s.username is not null
and w.event not like '%pipe%'
and w.event not like 'SQL*%'
group by substr(w.event, 1, 28), sql_hash_value,w.p2
) x,
v$latch l
where
x.p2 = l.latch#(+)
order by cnt
/
EVENT SQL_HASH_VALUE NAME CNT
----------------------------------- -------------- ------------------------- ----
…
Latch free 1509082258 cache buffers chains 56

From Oracle Database version 10g onward, Active Session History4 (ASH) provides sufficient information to identify SQL statements suffering from CBC latch contention.

Listing 12-12. Querying Active Session History to identify SQL_IDs
select event, sql_id, sql_child_number, count(*) cnt from v$active_session_history where event like 'latch%' and sample_time > sysdate-(1/24)
group by event, sql_id, sql_child_number
order by 4 desc
/

It is possible that another process is holding the latches in Listing 12-12 and thus not allowing SQL from the application in question to acquire those latches. This scenario is rare but needs to be analyzed. You need to verify that the hash value acquiring or holding the latch matches the hash value from Listing 12-11. You should execute the SQL in Listing 12-13 repeatedly (at least 10 times) to verify that the hash
values match.

Listing 12-13. Querying for SQL hash values acquiring or holding a latch
select s.sql_hash_value, lh.* from
v$latchholder lh, v$session s where
lh.sid=s.sid
order by s.sql_hash_value
/
SQL_HASH_VALUE PID SID LADDR NAME GETS
-------------- ------- ---------- ---------------- ------------------- ----------
0 418 10766 0000000ABBF5E3E8 library cache pin 16775103
1509082258 418 10766 0000000ACA8A1158 cache buffers chains 461959

Step 3: Identify Objects Causing the Contention
The script in Listing 12-14 queries to find objects that are hanging from the hash chain protected by those child latches. Output is sorted in part by the tch column, placing the more suspect objects first in the output. Objects with higher tch values are problematic and most probably causing the latch contention. You may need to execute the query a few times to get consistent data while latch contention is occurring. In the example output, you can see that the inv.mtl_onhand_quantities table is causing the contention.

Listing 12-14. Querying to find objects involved in latch contention
with bh_lc as
(select
lc.addr, lc.child#, lc.gets, lc.misses, lc.immediate_gets, lc.immediate_misses,
lc.spin_gets, lc.sleeps,
bh.hladdr, bh.tch tch, bh.file#, bh.dbablk, bh.class, bh.state, bh.obj
from
v$session_wait sw,
v$latchname ld,
v$latch_children lc,
x$bh bh
where lc.addr =sw.p1raw
and sw.p2= ld.latch#
and ld.name='cache buffers chains'
and lower(sw.event) like '%latch%'
and bh.hladdr=lc.addr
)
select bh_lc.hladdr, bh_lc.tch, o.owner, o.object_name, o.object_type,
bh_lc.child#,
bh_lc.gets, bh_lc.misses, bh_lc.immediate_gets,
bh_lc.immediate_misses, spin_gets, sleeps
from
bh_lc, dba_objects o
where bh_lc.obj = o.data_object_id(+)
order by 1,2 desc
/
HLADDR TCH OWNER OBJECT_NAME OBJECT_TYP
---------------- --- ------- ------------------------------ ----------...
0000001029B81010 203 INV MTL_ONHAND_QUANTITIES_DETAIL TABLE...
0000001029B81010 81 ONT OE_ORDER_HOLDS_ALL_C3 INDEX...
...

Step 4: Review Execution Plan of Problem Statements
In step 2, SQL statements causing latch contention were identified. In step 3 objects involved in CBC latch contention were identified. Data from steps 2 and 3 can be correlated and matched. This correlation can be done by reviewing the execution plans of the SQL statements from step 2.

How-to resolve CBC Latch Contention?
1. Eliminate Full Table Scan on Small Tables

2. Eliminate Full Index Scan on Small Indices
Similar to full scans on small tables, full index scans on smaller indices also will result in many concurrent processes accessing a small set of buffers, in turn leading to CBC latch contention. Eliminate full index scans with proper indexing techniques such as careful selection of index columns and choosing proper index types (such as btree or bitmap indexes).

3. Eliminate Leaf Block Contention with Partitioning
Processes that insert rows with primary key values generated from a sequence will insert rows in the rightmost leaf block of the associated primary or unique index btree. All concurrent inserts will suffer from CBC latch contention accessing that leaf block. The root cause here is that synthetic keys are inserted in just a few leaf blocks of index.
To eliminate leaf block contention, you can hash partition the involved table on the primary key.
Also, if a table cannot be partitioned, you may be able to convert the primary key and unique key indices into hash-partitioned indices over the nonpartitioned table

4. Favor Hash Joins Over Tightly-Nested Loop Joins
If the number of rows from the outer row source in a nested loops (NL) join is very high, and the inner probed table is small, processes can contend for the latches that protect the small inner table. Sometimes you can resolve this contention by converting the nested loops join to a hash join.

5. Tune Inefficient Indices
Table blocks are accessed for each row returned from an index scan. Excessive accesses to the table blocks can cause CBC latch contention. When too many blocks are scanned with index range scans (because of index inefficiency), concurrent processes will contend for the same set of buffers and latches, and that leads to CBC latch contention.

6. Reduce CPU Usage
If CPU usage is very high in the database server, that can also lead to latch contention. Ironically, high CPU usage is also a side effect of latch contention. Knowing whether high CPU usage is the cause or the result of contention can be a challenge. If you suspect latch contention from high CPU usage, you should reduce CPU usage and then measure for latch contention again.

What’s Shared Pool Latch ?
Shared pool latches are held while allocating or releasing space from the shared pool. Essentially, shared pool child latches protect changes to the shared pool free lists.
Prior to Oracle version 9i, the shared pool was just one big heap protected by just one shared pool latch, and it was a major contributor to latch contention. But from Oracle version 9i onward, the shared pool is split into multiple subheaps, and each subheap is protected by a shared pool child latch. This change has reduced the incidence of trouble with shared pool latch contention. Still, it is important to understand the problem of shared pool latch contention.

Structures in the Shared Pool
Each of the subheaps in the shared pool is further split into smaller subheaps. The fixed table x$ksmss externalizes shared pool subheaps.

Listing 12-20. Query to list shared pool subheaps
select ksmdsidx, count(*) , sum(ksmsslen) from x$ksmss
where ksmdsidx>0
group by ksmdsidx
order by 1
/
KSMDSIDX COUNT(*) SUM(KSMSSLEN)
---------- ---------- -------------
1 229 268435536
2 237 268435520
3 241 268435520
4 235 268435512
5 234 268439728
6 247 268435616
7 248 268435552

Listing 12-21. Query showing the number of shared pool latches
select count(*) from v$latch_children where name='shared pool'
COUNT(*)
----------
7

The number of subheaps (and so the number of shared pool latches) is dependent upon initialization factors such as shared pool size, granule size, a few underscore parameters, the Oracle Database version, and even a few OS parameters.

Shared Pool Free Lists
Oracle’s internal algorithm tries to find a chunk at least as big as the request, and the search for a free chunk is not cheap.
From Oracle Database release 9i onward, these free chunks of memory are grouped by their size and arranged as free lists from a hash bucket. This arrangement of free chunks grouped by size improves free space search.

Shared Pool Reserved Free Lists
A small percentage of the shared pool is set aside as the shared pool reserved area. Each subheap of the shared pool allocates a fraction of itself as part of the shared pool reserved area. And even though the shared pool reserved area has separate free lists, it shares the same latch with the non-reserved area.
If there is no free space in the main part of the shared pool, and if a requested chunk size is above a threshold determined by the _shared_pool_reserved_min_alloc initialization parameter, then the chunk is allocated from the shared pool reserved area.

Causes of Shared Pool Latch Contention?
Shared pool latch contention is almost always due to shared pool fragmentation. If the shared pool is fragmented, then the shared pool latch will be held longer during this search.
1. Shared pool fragmentation
While a process is trying to allocate a chunk big enough to satisfy an allocation request, recreatable or freeable chunks can be flushed off the shared pool and those chunks need to be coalesced to create a bigger chunk. That coalescing might involve walking shared pool freelists many times and flushing chunks until suitable chunks can be found or created. This activity is performed under the protection of the shared pool child latch leading to latch contention

2. Incorrect configuration
If a pool is incorrectly or inadequately configured, then chunks may be allocated from the shared pool instead of, say, the large pool. For example, if the large pool is not configured, then RMAN communication buffers or parallel query slave communication buffers may
be allocated from the shared pool. Allocation of these buffers can deplete and fragment shared pool free space, leading to shared pool latch contention.

3. Reduced shared pool subheaps.
Multiple shared pool subheaps are very important to performance, since it is critical to have multiple shared pool latches. Each shared pool subheap is protected by a shared pool child latch. If the number of shared pool subheaps is reduced, then the number of shared pool child latches also will be reduced

4. High CPU usage in the server.
If CPU usage is very high in the server, then the process holding a latch might not get enough CPU cycles and go to sleep prematurely before completing the code path. That sleep can result in the process holding the latch longer than is ideal. Fortunately, this problem has been solved in many platforms through the use of facilities to prevent a process from being preempted on the CPU. Such an option is referred to as the no_preempt option. But the no_preempt option is not available in all platforms, and without such an option it becomes important to reduce CPU usage to resolve latch contention.

5. Too-frequently flushed shared pool.
Flushing the shared pool too often can lead to latch contention. This typically happens in databases upgraded from Oracle Database version 8i or earlier. Flushing the shared pool involves flushing unpinned, recreatable chunks. Latches must be held during the time in which the flushing occurs. The result is increased activity on the shared pool latches.

How-to analyze Shared pool Latch Contention?
Step 1: Review Distribution Among Child Latches
Check whether the distribution among various child latches of the shared pool is uniform. If the distribution of gets and misses is skewed, meaning that a few child latches have much more demand than others in the shared pool, it may mean that those few subheaps are undergoing intense pressure.

Listing 12-24 shows a situation in which just a few shared-pool child latches are acquired much
more frequently than other child latches. Such a situation suggests higher activity in a few shared pool
subheaps.

select child#, gets , misses, immediate_gets, immediate_misses
from v$latch_children where name='shared pool';
CHILD# GETS MISSES IMMEDIATE_GETS IMMEDIATE_MISSES
---------- ---------- ---------- -------------- ----------------
7 280 0 0 0
6 280 0 0 0
5 280 0 0 0
4 280 0 0 0
3 12907260 1035751 0 0
2 15153322 1276786 0 0
1 12826219 1154988 0 0
7 rows selected.

If the gets among children are skewed, then it is important to understand the type of chunks allocated in those subheaps

You can delve into the problem in at least two ways. Statspack and AWR reports have a section that will show the increase/decrease in various shared pool area sizes.

Listing 12-25. Statspack output showing SGA differences
SGA breakdown difference for DB: APDB Instance: APDB3 Snaps: 242425 -242426
Pool Name Begin value End value % Diff
------ ------------------------------ ---------------- ---------------- -------
shared library cache 44,448,832 44,463,992 0.03
shared messages 2,080,000 2,080,000 0.00
shared miscellaneous 40,601,264 38,964,144 -4.03
shared parameters 188,648 188,648 0.00
shared partitioning d 228,600 228,600 0.00
shared pl/sql source 3,288 3,288 0.00
shared processes 13,760,000 13,760,000 0.00
shared qmps connections 4,842,200 4,842,200 0.00

Another method of researching the problem is to query the fixed table x$ksmss. It can be queried to see which area has grown bigger. The SQL in Listing 12-26 prints the top 20 areas that are consuming space in the shared pool.
You may need to run a query or report multiple times. In some cases, it may necessary to monitor shared pool areas with a custom script to detect abnormal growth.

Listing 12-26. Query against x$ksmss to see which area is bigger
select ksmdsidx,ksmssnam, size_area from(
select ksmdsidx, ksmssnam,sum(ksmsslen) size_area from x$ksmss
where ksmssnam!='free memory'
group by ksmdsidx, ksmssnam
order by 3 desc
)
Where rownum<21; 

Step 2: Inspect Shared Pool Fragmentation
In most cases, shared pool latch contention is caused by shared pool free list fragmentation.

Step 3: Review Objects Causing Flushing
The fixed table x$ksmlru keeps track of the most recent flushes of the shared pool. If there is no chunk big enough to accommodate an allocation request, then chunks that can be recreated or freed might need to be flushed. This object flushing also will increase the activity on the shared pool latches. In Listing 12-30, column KSMLRNUM indicates the number of items flushed to accommodate an allocation request. KSMLRCOM shows the allocation comment for the allocation request. Reviewing that column will provide valuable clues about why there is higher activity on shared pool latches.

 Listing 12-30. Query to identify objects that are flushing the shared pool
Set lines 160 pages 100
Spool ksmlru.lst
Select * from x$ksmlru order by ksmlrnum;
Spool off

Interestingly, querying x$ksmlru will remove rows from that fixed table. Thus, it is better to spool the output to a file.

Step 4: Identify SQL Statements Using Literal Values
Another common reason—probably the most common—for shared pool fragmentation is repeated hard parsing of the SQL statements due to use of literal values in those statements However, most of these SQL statements will have the same execution plan, and hence the same plan hash value. The SQL in Listing 12-31 uses that fact to identify SQL statements that do not use bind variables.

Listing 12-31. Query to see the top 20 SQL statements that use literal values
select * from (
select plan_hash_value, count(distinct(hash_value)), sum(executions),
sum(parse_calls)
from v$sql
group by plan_hash_value
having count(distinct(hash_value)) > 10
order by 2 desc
) where rownum<21;
PLAN_HASH_VALUE COUNT(DISTINCT(HASH_VALUE)) SUM(EXECUTIONS) SUM(PARSE_CALLS)
--------------- --------------------------- --------------- ----------------
511709263 4939 4939 4939
1438036477 4907 4907 4907
3532438726 4814 4814 4814
...

How-to Resolve Shared Pool Latch Contention?
1. Avoid Unnecessary Heaps
Configure large pool, etc.

2. Avoid and Reduce Fragmentation by Sharing SQL Statements
SQL statements that are executed very frequently with literal values must be converted to use bind variables.

3. Avoid Setting _kghdsidx_count to 1
Set the _kghdsidx_count parameter to 1, oracle will allocates just one shared pool heap with just one shared pool latch. This means that the shared pool latch becomes a solitaire latch, and this can lead to shared pool latch contention.

4. Avoid Flushing the Shared Pool
You can flush the shared pool with the alter system flush shared_pool command. Flushing a shared pool throws away recreatable and freeable chunks. These chunks must be added to shared pool free lists under the protection of a shared pool child latch. Flushing the shared pool will increase activity against shared pool latches artificially and can lead to shared pool latch contention.
Sometimes flushing the shared pool can improve performance. However, if you find that flushing the shared pool improves performance, the correct action is not to continue to flush, but to find the root cause of the fragmentation and resolve that.

5. Avoid Shared Pool Reserved Free List Fragmentation
The shared pool reserved free list is searched only if there is not enough contiguous space in the main free list, and if an allocation request is larger than the value specified by the _shared_pool_reserved_min_alloc parameter. Up until version 11g, this parameter defaults to 4200. The
default value is sufficient in many cases. Setting the value of the parameter lower can lead to fragmentation in the reserved pool, in turn leading to latch contention.

What’s Library Cache Latch ?
The library cache holds parsed representations of SQL statements, execution heaps of procedures, functions, and packages. There are many other items, such as table definitions.

The library cache is organized as an array of hash buckets. Hash chains are hanging from these hash buckets. Library cache objects are attached to these hash chains, and these objects point to various heaps in the shared pool. Changes and inspections to these library cache hash chains are protected by library cache latches. The library cache latch is a non-solitaire latch, and there are many child latches protecting changes to the hash chains.

The parsing step involves too many operations to discuss in detail, but following is a subset of what happens:
1. First, a unique hash value is created from the SQL text, or from the text of the library cache object.
2. Using that hash value, a library cache bucket number is derived.
3. That library cache bucket is protected by a library cache child latch. The parsing process holds that latch while searching the hash chain for a match on the hash value (together with a few other criteria).
4. If match is found, then the parsed representation of the SQL statement (or other object) exists, and no additional hard parsing is needed.
5. If there is no match for a hash value, then an additional hard parsing step must be completed. A new library cache object must be added to the hash chain with pointers to various heaps for the object being parsed. Any change to library cache hash chains is done under the protection of a library cache child latch. Hard parsing can lead to contention for those child latches.

Causes of Library Cache Latch Contention?
Library cache latch contention is almost always caused by excessive parsing or by nonsharable child cursors. Use of literal values in SQL is the most common cause of library cache and shared pool latch contention.
1. Excessive parsing:
Caused by literal values in SQL
2. Side effects of hard parsing:
If there is no free space available to accommodate an incoming allocation request, existing cursor heaps might need to be flushed from the shared pool/library cache. This increases activity against shared pool and library cache latches
3. Creation of excessive child cursors:
That lack of sharing will result in longer hash chains, and in many objects hanging from any one chain. Processes will be searching through these longer hash chains, increasing activity against library cache child latches

How-to Analyze Library Cache Latch Contention?
The process for analyzing library cache latch contention is almost identical to analyzing shared pool latch contention.

How -to Resolve Library Cache Latch Contention
1. Use Bind Variables
It’s especially important to use bind variables for statements that have a high number of executions.

2. Avoid Flushing the Shared Pool
Unpinned recreatable chunks from the library cache are thrown away during a shared pool flush operation.Flushing artificially induces a higher amount of hard parse calls.

3. Adjust the session_cached_cursors Parameter
If a cursor is found in the session cursor cache, then the parsing step is cheaper than otherwise. Increase of this parameter comes at a cost, since cursors that are in the session cursor cache are pinned in the shared pool, and this can result in higher space usage in the shared pool.

4. Adjust the cursor_space_for_time Parameter to True
If there is a cursor open pointing to a cursor heap, then that cursor heap is less likely to be flushed out. Retaining the cursor heap improves the performance of subsequent executions of that SQL statement at the cost of increased shared pool usage.
As of Oracle Database version 10.2.0.5 and 11.1.0.7, this parameter is deprecated.

5. Control and Limit Histograms
If your application is collecting histograms on all columns, then rethink that practice, and narrow the scope of collection to only those columns that matter. That’s avoid unnecessary SQL plan for same SQL.

Library Cache Latches and Mutexes
From Oracle version 10g onward, most library cache latches have been replaced by mutexes.
Library cache objects are locked and pinned under the protection of a mutex associated with the parent cursor. Still, library cache hash chains are protected by library cache latches, and having longer hash chains can result in library cache latch contention.
Version 11g improved this area further. Each bucket in the library cache has its own mutex, and scanning through the library cache hash chain is also performed under the protection of a mutex governing that specific bucket.

What’s Enqueue Hash Chains Latch?
Enqueue hash chain latches protect structures handling enqueues, commonly known as locks. Higher amounts of enqueue activity can lead to enqueue hash chains latch contention

For example, if a process is trying to acquire a table level lock (lock type TM), then the process allocates a resource structure having
the combination of that table’s object_id and lock type as a resource key.

Following is a summary of the operational details of the enqueue latching mechanism
1. A hashing function is applied on a string with a combination of . For example, if the table scott.emp is to be locked, and that table has anobject ID of 6509, and if the object type is TM for table, then a hashing function is applied over the string “(TM, 6509)”.
2. The result of the hashing function uniquely identifies a hash bucket to which the resource will always be chained. Changes to or inspection of the hash chain are protected by the enqueue hash chains child latches. So, while holding an enqueue hash chains child latch, that hash chain is searched to see if that resource is already allocated.
3. If the resource exists in the hash chain, then the process tries to see whether the resource is locked in a compatible mode with the lock request mode (assuming that there are no other waiters). If the resource is available or if the resource is locked in compatible mode, then a locking structure is added to the holders’ queue.
4. If there is another process waiting to lock the resource, then a locking structure is added to the waiters’ queue, and the process waits until the resource is available.
5. If the resource does not exist in the hash chain at all, then a new resource structure is added to the enqueue hash chain. A locking structure is also added to the holders’ queue.
6. Changes to the enqueue hash chain are protected by enqueue hash chains latches.

If there is an enormous amount of locking activity, typically in the case of high-end OLTP applications, it is conceivable that there will be a higher amount of activity against enqueue structures and enqueue hash chains child latches. This leads to enqueue hash chains latch contention.

Causes of Enqueue Hash Chains Latch Contention?
1. Excessive locking activity:
OLTP applications tend to have thousands of short transactions per second. Thousands of such short transactions will result in excessive activity against resource structures, lock structures, and enqueue hash chains latches.

2. Deadlock detection algorithm:
Oracle’s deadlock detection algorithm is invoked periodically, and it holds a parent enqueue hash chains latch (thereby holding all child latches) while searching for deadlocks between processes. If there are numerous processes, locks, and resource combinations in the database concurrently, then the deadlock detection algorithm can take long enough to result in contention for enqueue hash chains latches.

How-to Analyze Enqueue Hash Chains Latch Contention?
Step 1: Review Distribution Among Child Latches.

Listing 12-35. Skew among enqueue hash chains child latches
select * from (
select child#, gets, misses, sleeps
from v$latch_children where name like 'enqueue hash%'
order by gets desc )
where rownum <=20;
CHILD# GETS MISSES SLEEPS
---------- ---------- ---------- ----------
28 12297672 211726 0
20 284429 2151 1
...

Step 2: Identify the Enqueue Causing Latch Contention
You can get the below result from the session wait event info in v$session also. p2 is the object_id.
An appropriate action is to enable SQL Trace at level 8 (with waits) on a process suffering from latch contention.

Listing 12-36. An example of SQL trace output lines (version 9.2.0.8)
WAIT #1: nam='enqueue' ela= 301 p1=1414332420 p2=9086 p3=0
WAIT #1: nam='latch free' ela= 1 p1=-1437405580 p2=19 p3=0
WAIT #1: nam='latch free' ela= 1 p1=-1437405580 p2=19 p3=1
WAIT #1: nam='latch free' ela= 1 p1=-1437405580 p2=19 p3=2

From the analysis in this section we can safely conclude that latches protecting TM enqueues are causing performance issues. We also know that the object_id causing trouble is 9086. Querying dba_objects for that ID will reveal the object_name and owner. You need to repeat this exercise for several times to confirm your analysis.

Step 3: Identify Lock Types with High Gets

Listing 12-39. Top 20 lock types by activity
-- v$enqueue_statistics top 20 only..
select * from (
select eq_name, eq_type, total_req#, total_wait#, succ_req#, failed_req#
from v$enqueue_statistics
order by total_Req# desc)
where rownum<21
/
EQ_NAME EQ TOTAL_REQ# TOTAL_WAIT# SUCC_REQ# FAILED_REQ#
------------------------- -- ---------- ----------- ---------- -----------
DML TM 18554 17630 18155 400
Session Migration SE 17709 0 17725 0
...

How-to Resolve Enqueue Hash Chains Latch Contention
1. Avoid Excessive Short Transactions
If possible, modify the code to avoid or reduce this type of activity, and try grouping small transactions into bigger ones.
Note that worrying about short transactions only matters if the lock type behind the contention is TM or TX.

2. Disable Table Level Locks
In some cases, it may be beneficial to avoid table-level locks

Listing 12-40. Disabling table level lock
Alter table emp disable table lock;
alter table backup.emp add (n2 number)
*
ERROR at line 1:
ORA-00069: cannot acquire lock -- table locks disabled for EMP
Alter table emp enable table lock;

Disabling table level locks means that there can be no DDL statements on the target table. But in a production OLTP database suffering from latch contention, not being able to issue DDLs against a table is a very small price to pay to resolve the problem. Of course, you can always enable locks during maintenance on that table

3. Reduce or Avoid Activity Against the Lock Type Causing Contention
Reduce or avoid activity against the lock type underlying the contention. For example, if the lock type is DX, try to reduce distributed transaction by redesigning the application architecture.

Advanced topic for Latch Contention Problems
There are a few exceptions and bugs that can cause excessive latch contention. This section tries to list them, but for informational and debugging purposes only.

v$latch_parent
The v$latch_parent view maintains statistics at the parent latch level. In some cases, if there are many child latches, then the RDBMS code will need to acquire all the child latches. For example, deadlock detection code holds the parent latch of enqueue hash chains during execution of the deadlock detection algorithm, and no child latch can be acquired by any other process until that parent latch is released.

spin_count
The parameter spin_count determines amount of time the process will spin on the CPU before going to sleep. This parameter affects the behavior of all latches. Rarely, adjusting the spin_count is a good solution. Adjusting the spin count is unnecessary, and there are only few situations in which it might need to be adjusted.

_latch_classes and _latch_class_N
If it is not possible to resolve the root cause of the latch contention, then the _latch_classes and _latch_class_n parameters can be used to modify the spin count just for few latches from Oracle Database version 9iR2. For example, if you want to increase the spin count for library cache latches, you can use the method demonstrated in Listing 12-41.
In the listing, which is from Oracle Database version 10.2.0.4. the spin_count for _latch_class_1 is set to 18000, and the _latch_classes parameter specifies that latch number 214 belongs to latch_class_1. Latch number 214 is for the library cache latch.

Listing 12-41. _latch_classes and _latch_class_1 parameter changes for library cache latches
Select latch#, name from v$latch where name='library cache';
LATCH# NAME
---------- --------------------------------------------------
214 library cache
*._latch_class_1=18000
*._latch_classes='214:1'

_latch_wait_posting and _enable_reliable_latch_waits
The parameter _latch_wait_posting (obsoleted as of 11g) controls how long wait latches such as library cache latches are handled. By default, this parameter is set to 1,If a latch is not available after a certain period, the process sleeps until posted by the latch holder.
If _enable_reliable_latch_waits is set to true, then all latch posts are considered reliable. A process waiting for a latch will go to sleep and not wake up until posted. The result is reduced CPU usage, tending to reduce the symptoms of latch contention.

Summary
As always, it is best to understand the root cause of a problem, in this case a latch contention problem, and to resolve that root cause. It is
always better to resolve a root cause than to apply a band-aid.

I read this chapter at least 3 times. I believe it’s very helpful on understanding what’s going on in oracle although maybe we cannot solve it from DB side. This is the last chapter I will write about # Expert Oracle Practices #.

Advertisements

About Alex Zeng
I would be very happy if this blog can help you. I appreciate every honest comments. Please forgive me if I'm too busy to reply your comments in time.

One Response to # Expert Oracle Practices # Troubleshooting Latch Contention

  1. google.com says:

    You need to take part in a contest for one
    of the best websites on the web. I will recommend this site!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: