AutoUpgrade may run into a timeout when GI is lower than 19.10.0

The other day I received a question from a colleague about the risk of having GI and database both not being on the same RU. And a long while ago I blogged about it. We recommend that you keep it in synch. But you don’t have to. At the same time I received a wish from Ernst Leber to post something on the blog he and his colleague had trouble with when upgrading to 19c in a RAC environment. AutoUpgrade hung. But he titled his email with “not an AutoUpgrade problem!”. Still I agree, it is worth to write about the issue seen. And thanks, Ernst!

What’s the issue?

Ernst and his colleague attempted a RAC upgrade. The database upgrade had been attempted with AutoUpgrade. But it hung. So what happened?

Actually, they ran into a known issue: BUG 29580769 – LNX-193-AGENT: “SRVCTL MODIFY ASM -COUNT 3” COULD HANG AS LONG AS 11 MINS, AND CRS ORAAGENT.BIN WOULD COREDUMP AT THE SAME TIME.

Luckily they quickly found the MOS note giving advice on this issue:

The issue is that a srvctl modify asm -count all hangs for 10 minutes. The instance connections weren’t closed, and as result, the shutdown hung. Workaround would be to kill it with “shutdown abort” in SQL*Plus.

When is this fixed?

For most of you, this won’t be an issue as the issue has been fixed from 19.10.0 Release Upgrade on. The fix is included in 19.10.0 and the following Grid Infrastructure RUs. And this is the reason for my lengthy entry paragraph. No blame to anybody, but IF Grid Infrastructure would have been patched to 19.10.0 or 19.11.0 (or newer in case you read this article after mid July 2021), then the issue wouldn’t have happened. Just saying that this is why you may want to consider patching GI on the same schedule as your database.

What is the workaround?

Now back to Ernst and his recommendation for the database upgrade in case you have GI 19.9.0 or lower. And this is actually the same recommendation you can read in MOS Note: 2645911.1.

  1. srvctl stop instance -db orcl -node node1
  2. Confirm the node1’s instance is stopped and the command (a,) has finished
  3. srvctl stop instance -db orcl -node node2

Ok, so far so good. But why does this appear on the Upgrade Blog?

AutoUpgrade in this case will start with the -analyze followed by the -fixups phase. It will collect stats and do all the other things, AutoUpgrade does to make your life easier. But when in enters the drain phase, it will give you an error as it will catch the timeout. AutoUpgrade will initiate the srvctl commands. But as the shutdown hangs, AutoUpgrade won’t progress for 10 minutes as well.

And in this case you may sit around and wonder. Or blame AutoUpgrade for doing nothing and just giving you a timeout error.

So the ultimate workaround is to have Grid Infrastructure patched to 19.10.0 or 19.11.0 or the matching RURs which contain the fix as well.

Additional Information – Data Guard

Ernst and his colleague sent me another question about whether this issue would affect a Data Guard switchover as well. I wasn’t sure so they tried it out by themselves. And unfortunately, the answer is “Yes”. So you need to patch GI to 19.10.0 or higher to avoid such issues.

8 minutes into the switchover, they received this message from the Broker:

DGMGRL> switchover to to19DG
Performing switchover NOW, please wait...
New primary database "to19dg" is opening...
Oracle Clusterware is restarting database "to19" ...
Unable to connect to database using to19
ORA-12514: TNS:listener does not currently know of service requested in connect descriptor

Failed.

Ensure Oracle Clusterware successfully restarted database "to19" before proceeding

You need to patch.

Further Links and Information

–Mike

Share this: