WCDMA网管平台故障分析及处理

2014-03-24 03:14赵金闯
南北桥 2014年2期
关键词:网管硬盘链路

赵金闯

【摘 要】本文主要阐述了爱立信公司搭建的WCDMA网管平台硬件及操作系统的功能及故障处理过程,重点介绍了硬件故障及zpool池故障处理的办法。

【关键词】SUN M5000 SUN x4600 M2 SUN stk2540 zpool cpu

中图分类号:G4 文献标识码:A DOI:10.3969/j.issn.1672-0407.2014.02.160

WCDMA網管平台是用来对全省WCDMA网络进行流量分析及数据监控和采集的专业网管平台,因为其特殊性,所以采用特殊的网管系统平台来进行专业化处理。其基础硬件平台由小型机、存储、带库构成,采用DAS连接方式进行数据传输,达到业务系统的使用需求。

主要系统组成及功能

系统组成

主要硬件组成

1台SUN M5000小型机,一台SUN x4600 M2 服务器,6台SUN stk2540 存储。

连接方式

SUN M5000通过直连方式连接2台SUN stk2540存储,SUN x4600 M2通过直连方式连接4台SUN stk2540存储。

存储配置及RAID级别

每台SUN stk2540由12块300G硬盘组成。与SUN M5000连接的存储,每台中的5块盘做成raid5,2块热备盘,共划分4个卷;与SUN x4600 M2连接的存储,每台的12块盘做成raid0,划分12个卷,4台共48个卷,在主机SUN x4600上将48个卷通过zfs方式划分到zpool池eniq_sp_1里,并将每两台SUN stk2540的对应卷做成raid1(即mirror)。

主要实现功能

通过上述的系统组成,主机SUN M5000和SUN x4600 M2与存储SUN stk2540共组成了2套硬件平台系统。

SUN M5000与2台SUN stk2540存储直连,配置了130G物理内存、8颗(虚拟64颗)2.4GHz主频的cpu以及4T的硬盘空间,从而为业务提供了良好的运行性能及足够的数据存放空间。

SUN x4600 M2与4台SUN stk2540存储直连,配置了160G物理内存、8颗(虚拟32颗)2.6GHz主频的cpu以及6T的硬盘空间,从而为业务提供了良好的运行性能及足够的数据存放空间。

2套硬件平台确保了WCDMA网管平台系统的正常运行。

WCDMA网管平台故障分析及处理

本文重点分析了我在实际工作中碰到的几个典型的案例:

案例一:zpool 池循环同步

故障描述:

主机上zpool 状态为DEGRADED,循环同步且存在error,其中一条链路上的盘均为REMOVED状态。存储上,一个虚拟磁盘状态为失败,但对应的硬盘状态是好的。

故障分析及处理方案:

#: zpool status -v

pool: eniq_sp_1

state: DEGRADED

status: One or more devices has experienced an error resulting in data

corruption. Applications may be affected.

action: Restore the file in question if possible. Otherwise restore the

entire pool from backup.

see: http://www.SUN.com/msg/ZFS-8000-8A

scan: resilvered 41.2M in 6h41m with 0 errors on Fri Feb 14 19:13:07 2014

config:

NAME STATE READ WRITE CKSUM

eniq_sp_1 DEGRADED 5 0 0

mirror-0 DEGRADED 0 0 0

c1t0d0 ONLINE 0 0 0

replacing-1 UNAVAIL 9 273 2 insufficient replicas

c4t0d0/old FAULTED 0 0 0 corrupted data

c4t0d0 REMOVED 0 0 0

mirror-1 DEGRADED 0 0 0

c1t0d1 ONLINE 0 0 0

c4t0d1 REMOVED 0 0 0

mirror-2 DEGRADED 0 0 0

c1t0d2 ONLINE 0 0 0

c4t0d2 REMOVED 0 0 0

mirror-3 DEGRADED 0 0 0

c1t0d3 ONLINE 0 0 0

c4t0d3 REMOVED 0 0 0

mirror-4 DEGRADED 0 0 0

c1t0d4 ONLINE 0 0 0

c4t0d4 FAULTED 0 0 0 too many errors

mirror-5 DEGRADED 0 0 0

c1t0d5 ONLINE 0 0 0

c4t0d5 REMOVED 0 0 0

mirror-6 DEGRADED 5 0 0

c1t0d6 ONLINE 5 0 0

c4t0d6 REMOVED 0 0 0

mirror-7 DEGRADED 0 0 0

c1t0d7 ONLINE 0 0 0

c4t0d7 REMOVED 0 0 0

mirror-8 DEGRADED 0 0 0

c1t0d8 ONLINE 0 0 0

c4t0d8 REMOVED 0 0 0

mirror-9 DEGRADED 0 0 0

c1t0d9 ONLINE 0 0 0

c4t0d9 REMOVED 0 0 0

mirror-10 DEGRADED 0 0 0

c1t0d10 ONLINE 0 0 0

c4t0d10 REMOVED 0 0 0

mirror-11 DEGRADED 0 0 0

c1t0d11 ONLINE 0 0 0

c4t0d11 REMOVED 0 0 0

errors: Permanent errors have been detected in the following files:

/eniq/database/dwh_main_dbspace/dbspace_dir_9/main_9

根據上述报错及主机光纤卡的状态灯判断故障点为主机光纤卡。

在阵列上重新建卷。

将主机业务停掉并关闭主机,从而更换主机光纤卡。

cucc-eniq01(root) #:sync;sync;sync;init 5

更换光纤卡,并重启主机。

->start /SYS

cucc-eniq01(root) #: devfsadm

经过以上操作后,主机可识别到新链路上的硬盘,但磁盘逻辑名已改变,并且zpool的状态无法查看。此时需要重启主机,以便zpool池自动恢复。

cucc-eniq01(root) #:sync;sync;sync;init 6

重启后,可查看zpool池状态,zpool池中原链路上的盘已经替换成新链路的盘,但仍有2块盘(c4t0d0及c4t0d4)没有自动替换。需通过命令手动替换。

cucc-eniq01(root) #: zpool detach eniq_sp_1 c4t0d4

cucc-eniq01(root) #: zpool attach eniq_sp_1 c1t0d4 c10t0d4

cucc-eniq01(root) #: zpool detach eniq_sp_1 c4t0d0

cucc-eniq01(root) #: zpool attach eniq_sp_1 c1t0d0 c10t0d0

cucc-eniq01(root) #: zpool detach eniq_sp_1 c4t0d0

cucc-eniq01(root) #: zpool status

pool: eniq_sp_1

state: ONLINE

status: One or more devices is currently being resilvered. The pool will

continue to function, possibly in a degraded state.

action: Wait for the resilver to complete.

scan: resilver in progress since Wed Feb 19 01:05:25 2014

20.4G scanned out of 4.66T at 255M/s, 5h17m to go

20.4G scanned out of 4.66T at 255M/s, 5h17m to go

1.76G resilvered, 0.43% done

config:

NAME STATE READ WRITE CKSUM

eniq_sp_1 ONLINE 0 0 0

mirror-0 ONLINE 0 0 0

c1t0d0 ONLINE 0 0 0

c10t0d0 ONLINE 0 0 0 (resilvering)

mirror-1 ONLINE 0 0 0

c1t0d1 ONLINE 0 0 0

c10t0d1 ONLINE 0 0 0 (resilvering)

mirror-2 ONLINE 0 0 0

c1t0d2 ONLINE 0 0 0

c10t0d2 ONLINE 0 0 0 (resilvering)

mirror-3 ONLINE 0 0 0

c1t0d3 ONLINE 0 0 0

c10t0d3 ONLINE 0 0 0 (resilvering)

mirror-4 ONLINE 0 0 0

c1t0d4 ONLINE 0 0 0

c10t0d4 ONLINE 0 0 0 (resilvering)

mirror-5 ONLINE 0 0 0

c1t0d5 ONLINE 0 0 0

c10t0d5 ONLINE 0 0 0 (resilvering)

mirror-6 ONLINE 0 0 0

c1t0d6 ONLINE 0 0 0

c10t0d6 ONLINE 0 0 0 (resilvering)

mirror-7 ONLINE 0 0 0

c1t0d7 ONLINE 0 0 0

c10t0d7 ONLINE 0 0 0 (resilvering)

mirror-8 ONLINE 0 0 0

c1t0d8 ONLINE 0 0 0

c10t0d8 ONLINE 0 0 0 (resilvering)

mirror-9 ONLINE 0 0 0

c1t0d9 ONLINE 0 0 0

c10t0d9 ONLINE 0 0 0 (resilvering)

mirror-10 ONLINE 0 0 0

c1t0d10 ONLINE 0 0 0

c10t0d10 ONLINE 0 0 0 (resilvering)

mirror-11 ONLINE 0 0 0

c1t0d11 ONLINE 0 0 0

c10t0d11 ONLINE 0 0 0 (resilvering)

经过上述操作后,zpool池状态已正常运行状态。

案例二:SUN M5000硬件cpu板故障

故障描述:

SUN M5000告警灯亮起,但业务正常运行。

故障分析及处理方案:

通过登陆到xcsf卡下查看硬件信息

XSCF> showstatus

MBU_B Status:Normal;

* CPUM#1-CHIP#0 Status:Degraded;

* CPUM#1-CHIP#1 Status:Faulted;

XSCF> showhardconf

SPARC Enterprise M5000;

+ Serial:BEF0908D65; Operator_Panel_Switch:Locked;

+ Power_Supply_System:Single; SCF-ID:XSCF#0;

+ System_Power:On; System_Phase:Cabinet Power On;

Domain#0 Domain_Status:Running;

MBU_B Status:Normal; Ver:0201h; Serial:BE09071A62 ;

+ FRU-Part-Number:CF00541-0478 07 /541-0478-07 ;

+ Memory_Size:128 GB;

CPUM#0-CHIP#0 Status:Normal; Ver:0401h; Serial:PP084200N1 ;

+ FRU-Part-Number:CA06761-D202 D0 /375-3568-04 ;

+ Freq:2.400 GHz; Type:32;

+ Core:4; Strand:2;

CPUM#0-CHIP#1 Status:Normal; Ver:0401h; Serial:PP084200N1 ;

+ FRU-Part-Number:CA06761-D202 D0 /375-3568-04 ;

+ Freq:2.400 GHz; Type:32;

+ Core:4; Strand:2;

* CPUM#1-CHIP#0 Status:Degraded; Ver:0401h; Serial:PP090603DL ;

+ FRU-Part-Number:CA06761-D202 E0 /375-3568-05 ;

+ Freq:2.400 GHz; Type:32;

+ Core:4; Strand:2;

* CPUM#1-CHIP#1 Status:Faulted; Ver:0401h; Serial:PP090603DL ;

+ FRU-Part-Number:CA06761-D202 E0 /375-3568-05 ;

+ Freq:2.400 GHz; Type:32;

+ Core:4; Strand:2;

根据上述信息,判断主机的cpu板CPUM#1故障。

停止业务运行并关闭主机操作系统

wcuccmas1o{root} # sync

wcuccmas1o{root} # sync

wcuccmas1o{root} # init 5

XSCF> showdomainstatus -a

DID Domain Status

00 Powered Off

01 -

02 -

03 -

確认主机系统已关闭并拔掉主机电源线

XSCF> showdomainstatus -a

DID Domain Status

00 Powered Off

01 -

02 -

03 -

根据手册及下图更换CPUM#1

更换完成后,加电检测并启动主机操作系统。

XSCF> showstatus

No failures found in System Initialization.

XSCF> showhardconf

SPARC Enterprise M5000;

+ Serial:BEF0908D65; Operator_Panel_Switch:Locked;

+ Power_Supply_System:Single; SCF-ID:XSCF#0;

+ System_Power:Off; System_Phase:Cabinet Power Off;

Domain#0 Domain_Status:Powered Off;

MBU_B Status:Normal; Ver:0201h; Serial:BE09071A62 ;

+ FRU-Part-Number:CF00541-0478 07 /541-0478-07 ;

+ Memory_Size:128 GB;

CPUM#0-CHIP#0 Status:Normal; Ver:0401h; Serial:PP084200N1 ;

+ FRU-Part-Number:CA06761-D202 D0 /375-3568-04 ;

+ Freq:2.400 GHz; Type:32;

+ Core:4; Strand:2;

CPUM#0-CHIP#1 Status:Normal; Ver:0401h; Serial:PP084200N1 ;

+ FRU-Part-Number:CA06761-D202 D0 /375-3568-04 ;

+ Freq:2.400 GHz; Type:32;

+ Core:4; Strand:2;

CPUM#1-CHIP#0 Status:Normal; Ver:0401h; Serial:PP084402BJ ;

+ FRU-Part-Number:CA06761-D202 D0 /375-3568-04 ;

+ Freq:2.400 GHz; Type:32;

+ Core:4; Strand:2;

CPUM#1-CHIP#1 Status:Normal; Ver:0401h; Serial:PP084402BJ ;

+ FRU-Part-Number:CA06761-D202 D0 /375-3568-04 ;

+ Freq:2.400 GHz; Type:32;

+ Core:4; Strand:2;

XSCF> poweron -d 0

DomainIDs to power on:00

Continue? [y|n] :y

00 :Powering on

*Note*

This command only issues the instruction to power-on.

The result of the instruction can be checked by the "showlogs power".

至此,SUN M5000 cpu板故障處理完毕。

结束语

目前WCDMA网管平台系统正在稳定运行中,但随着时间的推移,网管系统平台的服务指标不断增多,所以对硬件平台系统要求也越来越多,存储空间的要求也会不断增高。以后会针对业务需求不断的对整个系统平台进行升级操作。

猜你喜欢
网管硬盘链路
家纺“全链路”升级
天空地一体化网络多中继链路自适应调度技术
HiFi级4K硬盘播放机 亿格瑞A15
Egreat(亿格瑞)A10二代 4K硬盘播放机
我区电视台对硬盘播出系统的应用
“五制配套”加强网管
基于3G的VPDN技术在高速公路备份链路中的应用
一种供鸟有限飞翔的装置
发射机房网管系统的设计原则及功能
网管支撑系统运行质量管控的研究与实现