BKUP_API and disappearing Incremental Backups
- Radoslaw Kut
- Exadata CC , Backup
- 08 Oct, 2024
Some time ago, I noticed that the bkup_api hadn’t completed any Incremental Backup for one of the databases on Exadata Cloud@Customer for several days. More importantly, it failed to notify of any backup errors or issues. The regularly scheduled Incremental Backups, which should have run at 23:27, were simply missing — no record of their attempts appeared in the RMAN JOBS list. Surprisingly, though, ARCHIVELOG backups continued to complete successfully.
SQL> select SESSION_KEY, INPUT_TYPE, STATUS,
to_char(START_TIME,'dd-mm-yyyy hh24:mi:ss') as RMAN_Bkup_start_time,
to_char(END_TIME,'dd-mm-yyyy hh24:mi:ss') as RMAN_Bkup_end_time,
round(elapsed_seconds/60,1) as MIN from V$RMAN_BACKUP_JOB_DETAILS
where start_time between sysdate - interval '2' day and sysdate
order by START_TIME asc;
SESSION_KEY INPUT_TYPE STATUS RMAN_BKUP_START_TIME RMAN_BKUP_END_TIME MIN
______________ _____________ ____________ _______________________ ___________________ _____
655589 ARCHIVELOG COMPLETED 10-04-2024 10:58:37 10-04-2024 10:59:09 0.5
655604 ARCHIVELOG COMPLETED 10-04-2024 11:58:33 10-04-2024 11:59:08 0.6
655619 ARCHIVELOG COMPLETED 10-04-2024 12:58:28 10-04-2024 12:59:02 0.6
655634 ARCHIVELOG COMPLETED 10-04-2024 13:58:25 10-04-2024 13:58:59 0.6
655649 ARCHIVELOG COMPLETED 10-04-2024 14:58:28 10-04-2024 14:59:11 0.7
655664 ARCHIVELOG COMPLETED 10-04-2024 15:58:23 10-04-2024 15:58:57 0.6
655679 ARCHIVELOG COMPLETED 10-04-2024 16:58:14 10-04-2024 16:58:45 0.5
655694 ARCHIVELOG COMPLETED 10-04-2024 17:58:11 10-04-2024 17:58:29 0.3
655709 ARCHIVELOG COMPLETED 10-04-2024 18:58:12 10-04-2024 18:58:51 0.6
655724 ARCHIVELOG COMPLETED 10-04-2024 19:58:16 10-04-2024 19:58:46 0.5
655739 ARCHIVELOG COMPLETED 10-04-2024 20:58:15 10-04-2024 20:58:48 0.5
655754 ARCHIVELOG COMPLETED 10-04-2024 21:58:17 10-04-2024 21:58:40 0.4
655769 ARCHIVELOG COMPLETED 10-04-2024 22:58:11 10-04-2024 22:58:43 0.5
655793 ARCHIVELOG COMPLETED 10-04-2024 23:58:13 10-04-2024 23:58:35 0.4
655808 ARCHIVELOG COMPLETED 11-04-2024 00:58:18 11-04-2024 00:58:52 0.6
655823 ARCHIVELOG COMPLETED 11-04-2024 01:58:17 11-04-2024 01:58:56 0.7
655838 ARCHIVELOG COMPLETED 11-04-2024 02:58:09 11-04-2024 02:58:33 0.4
655853 ARCHIVELOG COMPLETED 11-04-2024 03:58:18 11-04-2024 03:58:50 0.5
655868 ARCHIVELOG COMPLETED 11-04-2024 04:58:09 11-04-2024 04:58:40 0.5
655883 ARCHIVELOG COMPLETED 11-04-2024 05:58:07 11-04-2024 05:58:46 0.6
655898 ARCHIVELOG COMPLETED 11-04-2024 06:58:15 11-04-2024 06:58:47 0.5
655913 ARCHIVELOG COMPLETED 11-04-2024 07:58:31 11-04-2024 07:59:23 0.8
655928 ARCHIVELOG COMPLETED 11-04-2024 08:58:26 11-04-2024 08:59:07 0.7
655943 ARCHIVELOG COMPLETED 11-04-2024 09:58:31 11-04-2024 09:59:13 0.7
655958 ARCHIVELOG COMPLETED 11-04-2024 10:58:31 11-04-2024 10:59:06 0.6
655973 ARCHIVELOG COMPLETED 11-04-2024 11:58:32 11-04-2024 11:59:08 0.6
655988 ARCHIVELOG COMPLETED 11-04-2024 12:58:26 11-04-2024 12:58:52 0.4
656003 ARCHIVELOG COMPLETED 11-04-2024 13:58:44 11-04-2024 13:59:23 0.6
656018 ARCHIVELOG COMPLETED 11-04-2024 14:58:25 11-04-2024 14:59:06 0.7
656033 ARCHIVELOG COMPLETED 11-04-2024 15:58:25 11-04-2024 15:59:06 0.7
656048 ARCHIVELOG COMPLETED 11-04-2024 16:58:20 11-04-2024 16:58:58 0.6
656063 ARCHIVELOG COMPLETED 11-04-2024 17:58:11 11-04-2024 17:58:42 0.5
Checking directly with the bkup_api for backup action status, it reported that the last Incremental Backup job had finished — but a week ago:
$ /var/opt/oracle/bkup_api/bkup_api bkup_status --dbname=orcl
DBaaS Backup API V1.5 @2016 Multi-Oracle home
-> This node is not master. Running on Elastic configuration.
-> Action : bkup_status
-> logfile: /var/opt/oracle/log/orcl/bkup_api_log/bkup_api_2024-04-12-10:24:59.log
Warning: unable to get current configuration of: catalog
* Current backup settings:
* Last registered Bkup: 04-03 21:29 API::55643:: Starting dbaas backup process
* Bkup state: finished
**************************************************
* API History: API steps
API:: NEW PROCESS 55643
API:: Starting dbaas backup process
API:: Your new dbaas backup tag is TAG20240403T233342
API:: BKUP COMPLETE YOUR BKUP TAG TAG20240403T233342
*************************************************
* Backup steps
-> API:: invoked with args : -dbname=orcl -uuid=e901fa14f20011ee8bee00163e161fe3
-> API:: Oracle database state is up and running
-> API:: DB instance: orcl
-> API:: Validating the backup repository ......
-> API:: All backup pieces are ok
-> API:: Performing Incremental Backup to Cloud Storage
-> API:: Executing rman instructions
-> API:: ....... OK
-> API:: Incremental Backup to Cloud Storage is Completed
-> API:: Clean MOTD.
-> API:: Starting backup of config files
-> API:: at time: 2024-04-03:23:33:42
-> API:: Determining the oracle database id
-> API:: DBID: 3793387371
-> API:: Creating directories to store config files
-> API:: Enabling RAC exclusions for config files.
-> API:: Compressing config files into tar files
-> API:: Uploading config files to cloud storage
-> API:: Removing temporary location /var/opt/oracle/log/orcl/obkup/e901fa14f20011ee8bee00163e161fe3.
-> API:: Completed at time: 2024-04-03:23:36:32
-> API:: at time: 2024-04-03:23:36:32
-> API:: Config files backup ended successfully
-> API:: Clean MOTD.
-> API:: All requested tasks are completed
Ok, so if the last backup is indeed finished, why new jobs are not starting ?
To investigate further, I tried manually starting a backup, replicating the automated job setup on Exa C@C:
$ /var/opt/oracle/bkup_api/bkup_api bkup_start --cron --dbname=orcl
DBaaS Backup API V1.5 @2016 Multi-Oracle home
-> This node is not master. Running on Elastic configuration.
-> Action : bkup_start
-> logfile: /var/opt/oracle/log/orcl/bkup_api_log/bkup_api_2024-04-12-10:25:18.log
UUID 32189946f8a611eea08d00163e161fe3 for this backup
One action at a time
Request e901fa14f20011ee8bee00163e161fe3 with PID 55643 was running
Please start a new backup request after e901fa14f20011ee8bee00163e161fe3 is completed
It appears the bkup_api believes a backup job is still running under PID 55643, even though the previous command indicated that the job had already completed. Let’s check what is running under PID 55643, is it really a backup process?
$ ps -ef | grep 55643
root 55643 2 0 Apr04 ? 00:00:00 [acfsslg0]
Hmm that’s not any bakup process nor any rman process.
But indeed PID with this number is “taken” and it’s clearly there.
This lead me to conclusion that bkup_api when starting new backup job, checks if a previous backup is finished by looking for the PID used in the last backup job. If that PID is still active, it assumes the backup is ongoing. In this case bkup_api will not start any new incremental backup till PID: 55643 is findable on the system.
But why this PID is active in the system? It should be gone after finishing previous backup job!
After some thought, I realized that the recent quarterly infrastructure update for Exadata C@C had taken place. After DB Node reboot, system reused this PID for a different process. Consequently, bkup_api refuses to start a new incremental backup, mistakenly assuming the previous one is still running.
Fixing the Issue
One possible solution is to simply kill PID 55643. However it doesn’t look like a good idea to kill current PID 55643, as this is LGWR process used by ACFS. Most probably it will create additional issues. This is not a solution which leads me to a happy afternoon.
To fix this we have to find the source from where bkup_api takes the information about last backup session. Tracing bkup_api execution, I discovered interesting findings: bkup_api is using a database as a repository to keep all the information about backup jobs for every Oracle Database on which it operates - concept is of course not new (RMAN Catalog) but in this case - SQLite is being used. This DB is located at:
/var/opt/oracle/dbaas_acfs/orcl/opc/db/api.db
Let’s just simply open that database, query it and check if we can find informations about backup jobs, the best it would be to get the record which holds info about our annoying backup job and messy PID.
$ sqlite3 /var/opt/oracle/dbaas_acfs/orcl/opc/db/api.db
Let’s check for existing tables:
sqlite> .tables
archives bkp_status dropped jobs locks requests
Ok. LOCKS table looks promising:
sqlite> PRAGMA table_info(locks);
0|uuid |text|0||0
1|dbname|text|0||0
2|pid |text|0||0
3|type |text|0||0
Bingo, it has a column with slick name - PID. Let’s check what this table holds:
sqlite> select * from locks;
e901fa14f20011ee8bee00163e161fe3|orcl|55643|create-backup-incremental
Hmm only one record and to our delight it seems to be exactly as we concluded. Table LOCKS holds information about “ongoing” backups jobs and we have the entry confirming that job with PID 55643 is still tied to an incremental backup.
If we want to use a hammer, this moment looks like a perfect opportunity to do so:
sqlite> delete from locks where pid='55643';
sqlite> select * from locks;
sqlite>
Ok SQLite, thank you with this.
It’s time to test if the backup would now start:
$ /var/opt/oracle/bkup_api/bkup_api bkup_start --cron --dbname=orcl
DBaaS Backup API V1.5 @2016 Multi-Oracle home
-> This node is not master. Running on Elastic configuration.
-> Action : bkup_start
-> logfile: /var/opt/oracle/log/orcl/bkup_api_log/bkup_api_2024-04-12-11:08:04.log
UUID 2bbbf86cf8ac11eeab2d00163e161fe3 for this backup
** process started with PID: 129076
** see log file for monitor progress
Nice! The Incremental Backup started without complaining. It finished correctly and Exa C@C “automatisations” were back to normal state.
Afternoon was saved - this time.