0%

记录一次意外断电导致的Oracle数据库故障

故障背景

年前机房供电系统故障,意外断电导致数据库服务器连不上,由于不是核心业务数据库(两个礼拜连不上也没人找我);上班后就慢慢折腾。接上机ipmi,接显示器看了下,服务器硬件正常,系统卡在了经典的Ctrl+D的启动界面,于是我们就从这里开始填坑……

修复xfs文件系统

当时没截图,大概长这个样子

01

输入root密码进入系统后,查看了/etc/fstab

1
2
3
4
5
6
7
8
9
10
11
12
13
[root@xbqy-02 ~]# cat /etc/fstab 

#
# /etc/fstab
# Created by anaconda on Mon Aug 5 20:47:30 2019
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/centos-root / xfs defaults 0 0
UUID=c79e1466-7fc4-4434-863f-d1f6b1c2f3e1 /boot xfs defaults 0 0
/dev/mapper/centos-swap swap swap defaults 0 0
UUID=1ce3f41e-cd87-4e86-b069-fd1bb6d744d5 /data xfs defaults 0 0

使用df -h看到/data目录没有被挂载,由于/data 是挂载了另外一个raid组(sdb),使用fdisk -l命令查看,看不到sdb,由此判定应该是xfs文件系统损坏导致的。

使用xfs_repair /dev/sdb1 修复报错,于是大胆的使用了 xfs_repair -L /dev/sdb1 修复了文件系统,可以正常进入系统。(这里不重要,简单提一下就好)

多看日志

1
2
3
4
5
SYS@orcl> select value from v$diag_info where name ='Diag Trace';

VALUE
------------------------------------------------------------------------------------------------------------------------
/u01/app/oracle/diag/rdbms/orcl/orcl/trace

默认orcl实例的日志文件为:/u01/app/oracle/diag/rdbms/orcl/orcl/trace/alert_orcl.log

处理ORA-01110故障

修改好xfs文件系统后,启动oracle,报错如下

01

由此可见,是启动控制文件control01.ctl和control02.ctl记录的版本号不一致造成的,control01.ctl的记录更新,所有我决定备份02后,复制一份01位02,操作如下。

1
2
3
cd /u01/app/oracle/fast_recovery_area/orcl
mv control02.ctl control02.ctl.bak
cp /u01/app/oracle/oradata/orcl/control01.ctl /u01/app/oracle/fast_recovery_area/orcl/control02.ctl

再次启动,出现如下错误,如图所示,可以启动的mount阶段,数据库无法open。

01

1
2
3
4
5
SYS@orcl> alter database open;
alter database open
*
ERROR at line 1:
ORA-01589: must use RESETLOGS or NORESETLOGS option for database open
1
2
3
4
5
6
SYS@orcl> alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '/u01/app/oracle/oradata/orcl/system01.dbf'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
SYS@orcl> select v1.group#, member, sequence#, first_change# from v$log v1, v$logfile v2 where v1.group# = v2.group#;

GROUP#
----------
MEMBER
------------------------------------------------------------------------------------------------------------------------
SEQUENCE# FIRST_CHANGE#
---------- -------------
1
/u01/app/oracle/oradata/orcl/redo01.log
769420 395590662

3
/u01/app/oracle/oradata/orcl/redo03.log
769419 395590635

2
/u01/app/oracle/oradata/orcl/redo02.log
769421 395590691
1
2
3
4
5
6
7
8
SYS@orcl> recover database using backup controlfile until cancel;
ORA-00279: change 395590635 generated at 02/06/2021 16:09:01 needed for thread 1
ORA-00289: suggestion : /u01/app/oracle/oradata/orcl/archivelog/arch_5c1051e2_1_1016792676_769419.log
ORA-00280: change 395590635 for thread 1 is in sequence #769419


Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
CANCEL

可以看到#769419对应的redo文件是redo03.log

我们尝试恢复一下

1
2
3
4
5
6
7
8
9
10
11
12
SYS@orcl> recover database using backup controlfile until cancel;
ORA-00279: change 395590635 generated at 02/06/2021 16:09:01 needed for thread 1
ORA-00289: suggestion : /u01/app/oracle/oradata/orcl/archivelog/arch_5c1051e2_1_1016792676_769419.log
ORA-00280: change 395590635 for thread 1 is in sequence #769419


Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
/u01/app/oracle/oradata/orcl/redo03.log
ORA-00279: change 395590635 generated at 02/06/2021 16:09:01 needed for thread 1
ORA-00289: suggestion : /u01/app/oracle/oradata/orcl/archivelog/arch_5c1051e2_1_1016792676_769419.log
ORA-00280: change 395590635 for thread 1 is in sequence #769419
ORA-00278: log file '/u01/app/oracle/oradata/orcl/redo03.log' no longer needed for this recovery

尝试后,这种方法无法完成恢复。

我们尝试创建一个pfile文件,先把库拉起来,然后手动切换redo日志

1
2
3
4
5
6
7
8
9
10
SYS@orcl> create pfile from spfile;

File created.

SYS@orcl> shutdown immediate;
ORA-01109: database not open


Database dismounted.
ORACLE instance shut down.

编辑pfile文件,添加下边两个隐含参数

1
2
3
4
#修改隐含参数

*._allow_resetlogs_corruption=true
*._allow_error_simulation=true
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
SYS@orcl> startup pfile='/u01/app/oracle/product/11.2.0/db_1/dbs/initorcl.ora' mount;
ORACLE instance started.

Total System Global Area 5.0107E+10 bytes
Fixed Size 2264856 bytes
Variable Size 3.2212E+10 bytes
Database Buffers 1.7851E+10 bytes
Redo Buffers 41463808 bytes
Database mounted.
SYS@orcl> select * from v$log;
SYS@orcl> select * from v$log;

GROUP# THREAD# SEQUENCE# BYTES BLOCKSIZE MEMBERS ARC STATUS FIRST_CHANGE# FIRST_TIME
---------- ---------- ---------- ---------- ---------- ---------- --- ---------------- ------------- ------------
NEXT_CHANGE# NEXT_TIME
------------ ------------
1 1 769420 52428800 512 1 YES ACTIVE 395590662 06-FEB-21
395590691 06-FEB-21

3 1 769419 52428800 512 1 YES ACTIVE 395590635 06-FEB-21
395590662 06-FEB-21

2 1 769421 52428800 512 1 NO CURRENT 395590691 06-FEB-21
2.8147E+14

SYS@orcl> alter system switch logfile;

System altered.

处理ORA-01012故障

数据库启动正常后,过了一会儿发现自己又停掉了

1
2
3
4
5
6
7
8
9
10
11
SYS@orcl> select status from v$instance;
select status from v$instance
*
ERROR at line 1:
ORA-01012: not logged on
Process ID: 0
Session ID: 0 Serial number: 0


SYS@orcl> conn / as sysdba
Connected to an idle instance.

停掉监听

1
lsnrctl stop
1
2
3
4
5
ps -ef|grep ora_dbw0_$Oracle_SID

kill -9 pid ;

SYS@orcl> startup

ps:至此,本地oracle数据库故障算是基本处理结束了,本文只是记录了这一过程,很多操作都是网上找的,不是很清楚原理,写的也不是很清楚,勿喷。