Type something to search...
How CPU IO Wait Became an ASM Issue

How CPU IO Wait Became an ASM Issue

You know the story? You start the day with a simple plan, but the universe rewrites the script, and before you know it, daylight turns into overtime darkness.
It started quietly. No database alerts. No instance crash. No complaints from application users — yet.

It’s just that checking the morning reports made a few systems look suspicious on CPU — they seemed abnormally utilized at that particular time of day.

CPU IO Wait

A high wa value indicates that the CPU is idle and stranded because it is waiting for slower storage devices or network I/O to finish, rather than being overworked by computational tasks.

Probably now is a good time to introduce a bit of architecture: Oracle 19c SE2 HA, two hosts, two storage arrays accessed via iSCSI — managed by ASM (with ASMLib) in NORMAL redundancy. One of the arrays presents its resources using multiple paths. The graph above comes from the node on which PROD is running.

First Suspicious Symptom

Let’s start digging into the I/O layer to find out if the problem is there:

$ iostat -xz 1
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.02    0.00    4.39   68.05    0.00   21.55

Device    r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda      0.00    1.00      0.00      4.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     4.00   1.00   0.10
dm-3     0.00    1.00      0.00      4.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     4.00   1.00   0.10
sdb      2.00   40.00      1.00     68.50     0.00     3.00   0.00   6.98    0.50    0.35   0.01     0.50     1.71   0.81   3.40
sdc    242.00   39.00   4112.00     68.00     0.00     3.00   0.00   7.14    0.44    0.36   0.12    16.99     1.74   0.90  25.40

This is interesting — in general the I/O looks healthy, no particular disk has a high r/w_await, yet the CPU is obviously suffering because of I/O wait.

Digging further — with iSCSI, the wait can be generated anywhere along the path from user space to storage: User space → Kernel → TCP → Storage.

$ ps -eo state,pid,comm,wchan:32 | grep ^D

D  876095 oracleasm-read- blkdev_direct_IO_simple

Process stuck in D state. Uninterruptible sleep. ASM is waiting for direct I/O — reads/writes that bypass the page cache and go straight to the block device. The process is blocked inside that kernel function, waiting for a block I/O operation to complete.

A few more checks:

$ multipath -ll

Hmmm no output.
No Ctrl+C.
Just a frozen terminal.

Why did multipath -ll hang instead of just erroring out?

$ lsblk
NAME        TYPE   SIZE FSTYPE      MOUNTPOINT
sda         disk   250G
├─sda1      part   600M vfat        /boot/efi
├─sda2      part     1G xfs         /boot
└─sda3      part 248.4G LVM2_member
  ├─ol-root lvm     15G xfs         /
  ├─ol-swap lvm     16G swap        [SWAP]
  ├─ol-home lvm     20G xfs         /home
  └─ol-u01  lvm  197.4G xfs         /u01
sdb         disk   150G
└─sdb1      part   150G oracleasm
sdc         disk  1000G
└─sdc1      part  1000G oracleasm
sdd         disk   150G
└─sdd1      part   150G oracleasm
sde         disk  1000G
└─sde1      part  1000G oracleasm
sdf         disk   150G
└─sdf1      part   150G oracleasm
sdg         disk  1000G
└─sdg1      part  1000G oracleasm
sr0         rom   1024M

I noticed that the list of devices in the iostat output was too short. Some devices were missing, and multipath wasn’t doing its job:

$ systemctl status multipathd.service
 multipathd.service - Device-Mapper Multipath Device Controller
   Loaded: loaded (/usr/lib/systemd/system/multipathd.service; disabled; vendor preset: enabled)
   Active: inactive (dead)

So multipath was dead — and what’s more, it was disabled, meaning the system had been running in single-path mode since the last reboot.
Because multipathd was dead, the multipath -ll command hang, it tried to connect to its socket — which didn’t exist. On top of that, scanning block devices through ioctl hit the exact same stale iSCSI paths that already had oracleasm stuck in D state. So multipath -ll joined the queue of processes waiting for I/O that was never coming back. Welcome to the club.
But something must have changed recently, because suddenly some devices had disappeared:

$ dmesg -T | egrep -i "reject|I/O error|offline|timed out"

Output showed info about some devices going offline, here’s a little snippet:

session recovery timed out after 120 secs
sd 34:0:0:2: rejecting I/O to offline device
blk_update_request: I/O error, dev sdd
blk_update_request: I/O error, dev sde
Buffer I/O error on dev sde
rejecting I/O to offline device

Those 120 seconds are not random — that’s the default value of node.session.timeo.replacement_timeout in /etc/iscsi/iscsid.conf. It defines how long the iSCSI layer waits for a session to recover before giving up and marking the device offline. Which also explains exactly how long the D-state processes were stuck before the kernel started returning I/O errors.

After contacting the infrastructure team, I was informed that firmware had been updated on the storage array controller. That made sense — some paths went offline on the array that presents its resources via multipath. Because the multipathd daemon was off, there was nothing to handle the failover, so those devices simply went offline on the OS side as well. So far the picture looked like this:

Oracle ASM

direct I/O

block device layer

multipathd: dead — no failover possible

iSCSI session (broken)

session recovery timeout (120s — default node.session.timeo.replacement_timeout)

task -> D state

CPU -> iowait

Trying to bring back order.

First things first: bring multipathd back to life on both nodes and refresh the iSCSI paths, since the OS was still trying to use stale ones. Meanwhile, ASM had already noticed something was wrong — each disk group was reporting one disk missing:

SQL> select NAME, GROUP_NUMBER, LABEL, PATH, MOUNT_STATUS, VOTING_FILE, FAILGROUP, HEADER_STATUS, FAILGROUP_TYPE, STATE from v$asm_DISK order by MOUNT_STATUS;

NAME       GROUP_NUMBER LABEL      PATH                                MOUNT_S V FAILGROUP       HEADER_STATU FAILGRO STATE
---------- ------------ ---------- ----------------------------------- ------- - --------------- ------------ ------- --------
RECO2                 2  RECO2     ORCL:RECO2                          CACHED  Y FG2             MEMBER       REGULAR NORMAL
RECO_0002             2            /mnt/votedisk/vote_disk             CACHED  Y FG3             MEMBER       QUORUM  NORMAL
DATA2                 1  DATA2     ORCL:DATA2                          CACHED  N FG2             MEMBER       REGULAR NORMAL
DATA1                 1                                                MISSING N FG1             UNKNOWN      REGULAR NORMAL
RECO1                 2                                                MISSING N FG1             UNKNOWN      REGULAR NORMAL

Taking advantage of the redundant storage array architecture, the plan was to simply log out and log back into the iSCSI sessions on both hosts — forcing the OS to rediscover the block device paths from the affected array. Since Automatic Storage Management library kernel driver is used to manage disk devices, that layer needed refreshing too.

At this point, the HA architecture proved its worth — the database kept running without interruptions or data corruption even with one storage array effectively gone.

[root@host1 ~]# iscsiadm -m node -T Storage.Target-1 --logout
Logging out of session [sid: 2, target: Storage.Target-1, portal: 10.0.0.11,3260]
Logging out of session [sid: 3, target: Storage.Target-1, portal: 10.0.0.10,3260]
Logout of [sid: 2, target: Storage.Target-2, portal: 10.0.0.11,3260] successful.
Logout of [sid: 3, target: Storage.Target-2, portal: 10.0.0.10,3260] successful.

[root@host1 ~]# oracleasm scandisks
Reloading disk partitions: done
Cleaning any stale ASM disks...
Cleaning disk "DATA1"
Cleaning disk "RECO1"
Scanning system for ASM disks...

[root@host1 ~]# iscsiadm -m node -T Storage.Target-1 --login
Logging in to [iface: default, target: Storage.Target-1, portal: 10.0.0.11,3260]
Logging in to [iface: default, target: Storage.Target-1, portal: 10.0.0.10,3260]
Login to [iface: default, target: Storage.Target-1, portal: 10.0.0.11,3260] successful.
Login to [iface: default, target: Storage.Target-1, portal: 10.0.0.10,3260] successful.

[root@host1 ~]# oracleasm scandisks
Reloading disk partitions: done
Cleaning any stale ASM disks...
Scanning system for ASM disks...
Instantiating disk "RECO1"
Instantiating disk "DATA1"

Now I could try to online missing disks on ASM level:

SQL> ALTER DISKGROUP DATA ONLINE DISKS IN FAILGROUP FG1;

But at this point something unexpected happened:

ORA-15032: not all alterations performed
ORA-15075: disk DATA1 is not visible on instance number 1

Same for RECO:

ORA-15075: disk RECO1 is not visible on instance number 1

ASM alert log:

ORA-15186: ASMLIB error function = [asm_open(global)]
ORA-15025: could not open disk "ORCL:DATA1"

ORA-15186: ASMLIB error function = [asm_open(global)]
ORA-15025: could not open disk "ORCL:RECO1"

WARNING: Read Failed. group:0 disk:0 AU:0 offset:0 size:4096
path:Unknown disk
         incarnation:0xf0f2cbfa asynchronous result:'I/O error' ioreason:2818 why:11
         subsys:Unknown library krq:0x7fe6132e7cd0 bufp:0x7fe618836000 osderr1:0x434c5344 osderr2:0x0
         IO elapsed time: 0 usec Time waited on I/O: 0 usec
WARNING: Read Failed. group:0 disk:2 AU:0 offset:0 size:4096
path:Unknown disk
         incarnation:0xf0f2cbfc asynchronous result:'I/O error' ioreason:2818 why:11
         subsys:Unknown library krq:0x7fe6132e7380 bufp:0x7fe618832000 osderr1:0x434c5344 osderr2:0x0
         IO elapsed time: 0 usec Time waited on I/O: 0 usec

Hmm, this is strange. ASMLib showed the correct list of devices:

[root@host1 ~]# oracleasm listdisks
DATA1
DATA2
RECO1
RECO2

But checking what the ASM instance could actually see via kfod told a different story:

[root@host1 ~]# kfod disks=all
--------------------------------------------------------------------------------
 Disk          Size Header    Path                               User     Group
================================================================================
   1:        500 MB MEMBER    /mnt/votedisk/vote_disk            grid     asmdba
   2:    1023999 MB MEMBER    ORCL:DATA2
   3:     153599 MB MEMBER    ORCL:RECO2
--------------------------------------------------------------------------------

That made me scratch my head. I started thinking that maybe the disk headers were corrupted and I’d have to DROP the disks from the disk groups, clean/overwrite them, add them back, and resynchronize. But after a while I ran the same command on the other host and got a completely different output — there was a clear mismatch in what the ASM instances could see on host1 vs host2:

[root@host2 ~]# oracleasm listdisks
DATA1
DATA2
RECO1
RECO2

[root@host2 ~]# kfod disks=all
--------------------------------------------------------------------------------
 Disk          Size Path                                     User     Group
================================================================================
   1:        500 MB /mnt/votedisk/vote_disk                  grid     asmdba
   2:    1023999 MB ORCL:DATA1
   3:    1023999 MB ORCL:DATA2
   4:     153599 MB ORCL:RECO1
   5:     153599 MB ORCL:RECO2
--------------------------------------------------------------------------------

Root cause.

The root cause became clear when I listed the block device properties for the oracleasm disks at the OS level:

[root@host2 ~]# ls -l /dev/oracleasm/disks
total 0
brw-rw----. 1 grid asmadmin 252,  7 Apr 13 15:52 DATA1
brw-rw----. 1 grid asmadmin   8, 97 Apr 13 15:42 DATA2
brw-rw----. 1 grid asmadmin 252,  5 Apr 13 15:53 RECO1
brw-rw----. 1 grid asmadmin   8, 81 Apr 13 15:42 RECO2

[root@host1 ~]# ls -l /dev/oracleasm/disks
total 0
brw-rw----. 1 grid asmadmin 8, 65 Apr 13 15:58 DATA1
brw-rw----. 1 grid asmadmin 8, 97 Apr 13 15:58 DATA2
brw-rw----. 1 grid asmadmin 8, 17 Apr 13 15:58 RECO1
brw-rw----. 1 grid asmadmin 8, 81 Apr 13 15:58 RECO2

On host2, DATA1/RECO1 use major 252 (device-mapper/multipath dm-* devices), while on host1 they show as major 8 (raw sd* devices).

On host2 it’s as it should be:

  • DATA1 → 252,7 → /dev/mapper/mpatha1
  • RECO1 → 252,5 → /dev/mapper/mpathb1

But on host1:

  • DATA1 → 8,65 → /dev/sde1
  • RECO1 → 8,17 → /dev/sdb1

This means oracleasm on host1 is pointing at raw sd* paths instead of the multipath device. When oracleasm scandisks ran on host1, it likely found the disk header on the sd* device first, before the mpath alias was scanned. But why?

The Fix

How does ASMLib know in which order to scan devices? There is a configuration file that controls this behaviour — /etc/sysconfig/oracleasm:

ORACLEASM_SCANORDER="mpath sd"

Looks reasonable — one array presents resources via multiple paths, the other uses a single path, so mpath sd seemed like a sensible configuration. The issue can probably have many solutions or workarounds. I’ve picked a simple one. Looks like ASMLib could miss mpath because mpath* names are userspace aliases — symlinks created by udev rules that appear only after the Device Mapper device already exists. The dm-* devices, on the other hand, are the actual kernel block devices, available immediately once multipathd assembles the multipath map. During disk discovery, depending on udev timing, ASMLib could win a race condition and grab the sd* device before the mpath* alias was ready.

To be precise about what each prefix in ORACLEASM_SCANORDER actually does:

  • dm — scans /dev/dm-* directly, i.e. the real kernel Device Mapper block devices; available as soon as multipathd creates the map
  • mpath — scans /dev/mapper/mpath* symlinks; created by udev rules and may appear with a slight delay or inconsistent ordering
  • sd — raw SCSI block devices, one per physical path; exactly what you don’t want ASMLib to pick when multipath is in play, but because of the second single-path storage array it’s here

So putting dm first in the scanorder ensures ASMLib finds the Device Mapper device before it ever gets to the sd* devices underneath it. Different nodes selected different devices at scan time. ASM treated them as different disks — because from its perspective, they were.

Changing scanorder values by adding dm into it did the job:

ORACLEASM_SCANORDER="dm mpath sd"

After rescanning disks via oracleasm, the major numbers in /dev/oracleasm/disks were correct — pointing to the mpath devices. kfod started showing all disks on both hosts. No need to DROP anything. Just bring them online:

ALTER DISKGROUP DATA ONLINE DISKS IN FAILGROUP FG1;
ALTER DISKGROUP RECO ONLINE DISKS IN FAILGROUP FG1;

Final state:

NAME       GROUP_NUMBER LABEL      PATH                                MOUNT_S V FAILGROUP       HEADER_STATU FAILGRO STATE
---------- ------------ ---------- ----------------------------------- ------- - --------------- ------------ ------- --------
DATA2                 1 DATA2      ORCL:DATA2                          CACHED  N FG2             MEMBER       REGULAR NORMAL
DATA1                 1 DATA1      ORCL:DATA1                          CACHED  N FG1             MEMBER       REGULAR NORMAL
RECO_0002             2            /mnt/votedisk/vote_disk             CACHED  Y FG3             MEMBER       QUORUM  NORMAL
RECO1                 2 RECO1      ORCL:RECO1                          CACHED  Y FG1             MEMBER       REGULAR NORMAL
RECO2                 2 RECO2      ORCL:RECO2                          CACHED  Y FG2             MEMBER       REGULAR NORMAL

Conslusion

There’s more than one way to skin this cat. An alternative fix would have been to present the single-path array’s disks through multipath as well — a single-path multipath device is still a valid multipath device. That would make the entire ASM disk landscape uniform: everything goes through /dev/dm-*, ORACLEASM_SCANORDER becomes a non-issue, and the race condition with udev has nowhere to hide. The downside is that it requires a properly configured multipath.conf and — more importantly — a multipathd that actually stays running. Which, as we’ve seen, is not something you want to take for granted.
Another option worth mentioning is replacing ASMLib altogether with udev rules. udev can stamp ASM disk headers and manage device ownership just fine without the kernel module, and it’s actually the preferred approach on newer systems where ASMLib is no longer actively developed. The catch in this particular situation: migrating from ASMLib to udev rules is not an operation which I would like to make online. So that one goes on the backlog.
The fix applied was pragmatic — one config line, no downtime, no DROP DISK drama. Sometimes that’s exactly the right call.

ASMLib, udev, AFD, each occupying the same territory. But which one belongs in your architecture? That’s a different story.

Share :

Related Posts