BUG: parallel transaction recovery

Recently, one of our DB got crashed when SMON was recovering a killed transaction using fast-start parallel rollback. Later I found there were several serious bugs related to parallel transaction rollback on metalink. Some of them are “claimed” to be fixed and some are not. In our case, oracle support said we hit a bug which should be fixed in our binary. Nevertheless, the db crashed in real case. As I said, it’s “claimed”.

In OLTP system like this one, applications don’t run large transactions normally. For batch tasks running by DBA, most of time it doesn’t matter how quickly it get recovered, such as this create temp table case.  I don’t come to a real case that we really need fast transaction recovery. So my suggestion is to disable this feature  by setting fast_start_parallel_rollback to false. This is also the workaround for many of following bugs:

Bug Fixed(Claimed) Description
9233544 ORA-600 [15709] during parallel rollback
7293156, ORA-600 [2023] by Parallel Transaction Rollback when applying Multi-block undo Head-piece / Tail-piece
6954722,,, OERI[15709] in SMON in RAC / instance crash
2186174, Excessive REDO generation from parallel rollback
1418532 Parallel transaction recovery may be abandoned
1294783, Parallel transaction recovery slave may dump (in KTUGRU)

In addition, there are some described issues in oracle documents, such as 100%CPU, hang, ORA-600 or crash, etc:

Parallel Rollback may hang database, Parallel query servers get 100% cpu [ID 144332.1]
Database Hangs Because SMON is taking 100% CPU doing transaction recovery [ID 414242.1]
SMON may fail with ORA-00600 [15709] Errors Crashing the Instance [ID 736348.1]
Bug 9233544 – ORA-600 [15709] during parallel rollback [ID 9233544.8]
Instance Crashing With Ora-600 [15789] And Ora-474 [ID 1094645.1]
SMON Gets Terminated with ORA-600 [KDOLKR-1] [ID 419371.1]
Bug 7293156 – ORA-600 [2023] by Parallel Transaction Rollback when applying Multi-block undo Head-piece / Tail-piece [ID 7293156.8]

I did some tests to get better understanding  when a transaction is rolling back by SMON or itself:

  • 1.”Crtl+c” a transaction:
  • No new record in v$FAST_START_TRANSACTIONS (when smon rollback sessions, new records will be added).
    I saw v$transaction.USED_UREC was decreasing which means the session itself was rolling back.

  • 2.Kill a transaction’s server process when fast_start_parallel_rollback is LOW (default)
  • After several seconds, I saw a new record in v$FAST_START_TRANSACTIONS which means the transaction was rolling back by SMON.
    By “select * from V$FAST_START_SERVERS”, I saw oracle started 16 processes but only 1 was “RECOVERING”, the others were”IDLE” in my test case.
    At the same time, v$FAST_START_TRANSACTIONS.RCVSERVERS was 1, while once the recovery was done. It became 16.
    I saw “SMON: Parallel transaction recovery tried” in alert log.
    No record in v$transaction.USED_UREC as the session didn’t exist anymore.

  • 3.When setting the fast_start_parallel_rollback to FALSE and kill a transaction’s server process
  • I saw a new record in v$FAST_START_TRANSACTIONS.
    I saw “select * from V$FAST_START_SERVERS” is always no record and v$FAST_START_TRANSACTIONS.RCVSERVERS is 0 from start to end.
    NO “SMON: Parallel transaction recovery tried” added to alert log.
    It clearly showed that it was recovered by SMON but not parallel transaction recovery.

  • All in all, when a transaction is “ctrl_c”, the transaction will recover by its own session, not SMON. When it rollback by itself, it’s not related to the parameter fast_start_parallel_rollback.
    When a transaction is terminated by killing its process or system crash, the transaction will be rollback by SMON.
    Whether SMON recovers the transaction by parallel processes or not depends on the setting of fast_start_parallel_rollback. By default it’s LOW, meaning 2*CPU, and it limited by parallel_max_servers as well. So the recovery process is min(parallel_max_servers, 2*CPU) by default.
  • Why not DBA wait a “ctrl+c” session to rollback by itself, but kill the process instead? Because sometimes one don’t know how long the rollback will take and suspect it’s hang or not in progress. To address this puzzle, I create a script to check the rollback progress, either by SMON or the session itself: How-to check rollback transactions and give an ETA

    A few notes:

  • SMON will not recover a ended session immediately. It has its own cycle. You may need to wait a while to observe that. In my test, it’s less than 1 minutes.
  • Parameter recovery_parallelism is for “parallel recovery” which has different effects from fast_start_parallel_rollback. Essentially, fast-start parallel rollback is to transaction recovery what parallel recovery is to cache recovery. See more Cache Recovery & Transaction Recovery
  • If you update parameter fast_start_parallel_rollback during a rollback, the rollback will START OVER. See more Fast Parallel Rollback and Large Transactions
  • Advertisements

    About Alex Zeng
    I would be very happy if this blog can help you. I appreciate every honest comments. Please forgive me if I'm too busy to reply your comments in time.

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out /  Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out /  Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out /  Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out /  Change )


    Connecting to %s

    %d bloggers like this: