Criteria and Considerations in Software Failure Modes and Effects Analysis (SFMEA)

2018-02-13 08:08WULijinHANXinyuYANRanTANGLongli

WU Lijin(), HAN Xinyu(), YAN Ran( ), TANG Longli()

China Institute of Marine Technology and Economy, Beijing 100081, China

Abstract: For the issue that software FMEA’s methods are improper and practices are not standard, this article for system-level SFMEA and detail-level SFMEA, cleared the work targets, proposed the criteria about general procedure and detail-level analysis, and the considerations in project management and specific steps, by which SFMEA became specific, standard and engineered.

Key words: system-level SFMEA; detail-level SFMEA; criteria; software reliability

Introduction

Software failure modes and effects analysis (SFMEA) is a traditional reliability and safety analysis methods used in safety-critical software, by identifying failure modes, researching the causes and analyzing the impact to seek measures to eliminate the harmful effects. Currently SFMEA is usually divided into the system-level and detailed-level. However, SFMEA is not carried out smoothly. On the one hand, developers are lack of the reliability and safety knowledge because there are no appropriate guidance document。In FMEA standards such as IEC60812、GB7826、GJB1391, involving SFMEA content is little, and the procedures or the methods are not specific. On the other hand, software FMEA is a time-consuming job, the implementation cost of which can not be ignored for some large-scale software. In this regard, the paper studied the implementation requirements of SFMEA, proposed engineering standards and noted the precautions in specific steps so SFMEA method is workable, step is instructive, and implementation is normative.

1 Implementation Requirements

1.1 The best time to carry out

Depending analysis stage and objects, SFMEA is divided into system-level and detailed-level SFMEA. The best time to carry out is the follows.

When the system structure design is completed and we start to allocating function for each module at the phases of software requirements analysis or preliminary design, system-level SFMEA should be carried out, because it has a lower cost modifying system structure at this stage. In the requirements analysis stage, it is used to improve the by identifying possible defect about function and performance. At the design phase of software, it is used to reduce the risk for the failure of hardware or software by analyzing and evaluating the software architecture. The system-level SFMEA analysis object is the high-level subsystems, modules or CSCI designed at early stage.

After the completion of the detailed design or at coding phase, detailed-level SFMEA is used to validate the software defects impact and it’s severity. Detailed-level SFMEA can verify whether the protective measures from the top-level design is effective, and provide information for the software integration and testing. Detailed-level SFMEA is applicable for critical systems with little protection in the memory, communication, and data process. The best object of detailed-level SFMEA is the detailed-design module, or pseudo-code described module.

1.2 Target of SFMEA

The target of SFMEA is as follows: (1) to ascertain the main failure modes and it’s mechanisms leading to failure, which can provide evidence for improvements from the design and coding; (2) to determine critical requirements, modules or components, which can provide support for the design and analysis about software reliability, testability, safety and so on; (3) to give guidelines for software testing about setting the detection point, developing the test target; (4) to provide information and decision-making basis about software reliability assessment or acceptance conditions according to the situation of improvements measures’ implementation on failure mode; (5) to input information to system risk assessment, such as failure rate, failure causes and control measures; (6) to help writing the guideline of the system maintenance or software maintenance.

2 SFMEA Process Criteria

2.1 Universal process principles

This paper put forward ten working principles about SFMEA as follows.

(1) Principle of time

SFMEA is not the analysis and evaluation after a failure happened. It aims to analyze expected failure that may occur in the future for the design or the original product to improve them. The earlier stage, the smaller cost.

(2) Principle of synergy

SFMEA should be in coordinate with software design. The SFMEA results should reflect the current state of technology.

(3) Principle of team

The analyst should come from the designers. Besides, software designers, testers, users, should be in the team, who can exchange views fully about all possible faults or risks.

(4) the principle of hierarchy

For each failure mode, it should consider the reasons from the lower level analysis and the impact for the higher level analysis. From bottom to top, SFMEA should be used structurally and iteratively.

(5) Principle of exhaust

In analyzing the impact of a failure, it should be conducted in accordance with the principle of a single point. The failure mode is seen as the only failure and the other part is normal. Then the analyst should try to figure out all the possible failure modes and it’s causes or effects.

(6) Principle of consistency

Strengthen standardization of the SFMEA work in order to ensure comparability of the result. Prior to analysis, the chief software design unit should make uniform requirements and the necessary instructions about the analysis level, the degree of severity, the failure rate data sources and the final analysis report.

(7) Principle of validity

SFMEA reports and forms should be correct and effective. When the activity is finished, the reports should be countersigned and checked carefully.

(8) Principle of traceability

Trace the compensatory measures or improve implement for design timely to ensure the implementation into practice, rather than to remedy after the failure occur.

(9) Principle of retroactivity

SFMEA is not a one-time job, and it should be analyzed and tracked according to the analysis results and the implementation of improvement measure. Each version of the report should be for future reference test; SFMEA report should be pigeonholed and can be traced back to different version software requirements and design documentation.

(10) Principle of synthesis

Combine the SFMEA with other techniques, such as the FTA to fully identify software’s weaknesses.

2.2 Detailed-level analysis criteria

Detailed-level SFMEA is not only in compliance with the universal process principles, but also with the following criteria to restrict the analysis scope for improving efficiency.

(1) Only one failure mode should be analyzed one time. This is the basic criterion of FMEA.

(2) Only analysis variables’ failure should be analyzed, because the algorithm’s failure is reflected in the variables’ failure in a large extent.

(3) Focuse on the input variables. Detailed-level SFMEA table only lists the analyzed variables.

(4) Output variables are not analyzed. Because each output variable (except the hardware associated) will be used as a input variables for the crosslinking module.

(5) Don’t analyze the variable output to the hardware, because variables such as those writing data to the peripheral hardware devices, represent non-storage hardware, whose failure is hardware FMEA category.

(6) Don’t analyze numeric values. The numeric value is assigned in assembler field and will not fail in software.

(7) Give more attention on the impact of logic variables. Incorrect variable value used to determine the program logical will lead to unintended order to execute or make the program skips the useful code wrongly.

(8) Variables involved in the module function’s calling should be analyzed elaborately. Because the failure may occur when the function is called causing a series of unexpected system black-box operation.

(9) If an input variable is in the same function, we should only analyze it once; if it presents in more than one function, we should analyze it many times.

3 SFMEA Project Management

3.1 Team and responsibilities

SFMEA implementation team should be organized to ensure the work smooth. In the team, software designer is the main member. Besides, software quality or reliability staff, software verification or validation personnel, and the overall designer also should participate.

Pay attention to the following matters in organization and implementation.

(1) Clear the SFMEA work interface between overall design units and software development units;

(2) Quality assurance organization should supervise, control and support the SFMEA work.

(3) Establish training systems in the SFMEA team, including overall designer’s training for SFMEA staff, reliability personnel’s training for software designer.

3.2 Work plans

Headings, or heads, are organizational devices that guide the reader through your paper. There are two types: component heads and text heads.

SFMEA should make work plan complying with the following requirements.

(1) SFMEA work should comply with the arrangements in systems reliability engineering. SFMEA plan should be coordinated with the functions FMEA and hardware FMEA work to avoid duplication of effort.

(2) SFMEA plan should be incorporated into the software development plan and the project evaluation for system-level or detailed-level SFMEA result is a end mark of the SFMEA work stage.

(3) SFMEA work plans need to distinguish the work item’s order of priority, and need to be confirmed by the overall design unit.

3.3 Conference evaluation

SFMEA report is a key item in critical software development stage. The analyst should note the following parts to improve the effectiveness of SFMEA report.

(1) SFMEA evaluation participants should have requirements analyzers, software designers, software testers, reliability staff and the experts in the field.

(2) The evaluation of system-level and detailed-level should be carried out at each stage. It may be only a simple internal review during the analysis and should be a formal evaluation after completion of the final report.

(3) The evaluation should focus on SFMEA’s correctness of the conclusions, compliance of the report, and effectiveness of the measures and so on.

(4) Develop assessment checklist or evaluation guidelines, and continue to accumulate experience and revise these checklists or guidelines according to the evaluation procedures.

(5) If the problems are found in SFMEA results, analyst’s comments should be timely confirmed.

(6) Ensure that the evaluation is aiming to evaluate the validity of SFMEA results, but not to evaluate the analyst or the software designers.

4 Precautions in Specific Steps

4.1 Identification of safety-critical components

The object of SFMEA is the safety-critical components. The analyst can onsider the following aspects to identify whether the component is safety-critical component.

1) Degree of control

If the software is involved in the hazard control for system deeply, it can be seen as safety-critieal component.

2) Risk

If the software can lead to a risk, it is safety-critical software, such as those software identifying hazardous conditions, giving automatic control for security, providing critical information, prohibiting dangerous events.

3) Complexity

The more complex the software is, the more critical it is. If the safety-related requirements increase, the software will be complicated.

4) Real-time features

The real-time control of danger is a key factor of software safety, and the software must find risk and take measures before the danger occurs.

4.2 Selection of indenture level

The first step of SFMEA is to select indenture level, in accordance with the actual needs, focusing on the key components or modules.

1) Determination of initial indenture level

The initial indenture level of analysis often is not the software itself, but is the safety-critical functions of the system combined with the software.

2) Determination of indenture level

The indenture level determined, the boundaries of system-level SFMEA is determined, beyond which other factors involved in are classified the category of software operating environment. The indenture level can be defined by the software architecture hierarchy.

(1) When the analysis software is complicated, clear the range of SFMEA by the technology responsibility and relationship between software development units and system overall unit. The overall system units should first develop a system level defined as the initial indenture level, and give the division principles of the lowest indenture level for the development units.

(2) The more level the indenture level is divided, the greater work SFMEA is to do. For software using a maturity design or having a good-proven by reliability、 maintainability and safety, the indenture level can be divided into less and thick; on the contrary, it can be divided into much and thin.

(3) Each indenture level should have a clearly circumscription (including function, fault criterion, etc.). When the indenture level is more than 3, it should analyze each indenture level according to from-low-to-up method until the initial indenture level, then it is a complete software FMEA.

(4) System-level SFMEA objects must have a certain of functionality. Be sure that system-level analysis point is the function of the module, which has input variables and output variables, and it is only concerned with their own function failure modes, how these variables affect the failure is to analyze in detail-level SFMEA.

3) Determination of the minimum indenture level

The lowest level division at least reached the level which has a direct impact on the system serious failure. For minimum indenture level we prescribed by the following principles:

It has available data to analyze in software unit of the lowest level, it can have complete input, or can correspond to one or more software functions;

(1) When the module failure of the software will directly lead to disaster (class I) or fatal (class II) consequences, the minimum indenture level at least reached this levels

(2) Determine the minimum level at those required or expected of unit testing, which could lead to general (class III) or mild (class IV) fault.

4.3 Failure mode analysis

Different levels of SFMEA have a different failure mode analysis method. System-level SFMEA determines all possible failure modes of functionality by functional description and it’s failure criteria, detailed-level SFMEA make sure all possible failure modes of some variables depending on the type and characteristics of the variable.

(1) Each failure mode of each module is single and independent, avoiding multiple failure mode.

(2) Failure modes’ expression should be correct and clearly. Don’t see input error as failure mode of system-level, in fact it is the failure effect to object from the input module, not the failure of object itself.

(3) In system-level SFMEA, don’t analyze the internal variable. In fact, it is the detailed-level analysis object but not function failure. In detailed-level SFMEA, don’t analyze the algorithm. In fact, it can be seen as a system-level function. The correct practice is that analyze the algorithm’s fail representation caused by these variables, don’t regard the fault reason cause analyzed objectas a failure mode.

(4) Software is composed of several functional modules, and the system-level failure modes include both functional failure modes and performance failure modes, but primarily function failure mode, in this point of view, the system-level SFMEA has no essential difference with hardware FMEA.

4.4 Failure cause analysis

For each failure mode, we should first analyze all the possible reasons to ensure whether each failure mode is likely to occur, as well as to provide a basis for the improvements. Failure analysis software should note the followings.

(1) Software failure should be first searched in their own scope, primarily the “design defect” for the development process. The SFMEA team can collect and develop “software defects table” which may be appropriate for the actual project work.

(2) Considering the relationship between adjacent layers of software architecture, the failure mode in deeper may be the cause of higher failure.

(3) When analyzing the external interfaces factors, we should not list all the environmental factors and software interfaces factors, but focuse on factors that have a significant effect on the failure mode.

(4) If a failure mode has two or more different causes, in the “Failure Cause” of SFMEA table column we should explain separately analyze them independently.

(5) Correctly distinguish failure causes and failure modes. Failure mode is generally observable manifestations of the failure, and it’s directly or indirectly due to the reasons of the design defect or external factors. Failure cause description should use precise engineering representation.

(6) For the redundant software, the analyst should give special attention to the “common cause failure” and “common mode failure ”.

4.5 Failure effect analysis

Failure effects are consequences of failure modes through analyzing which we determine the severity of failure risk. Failure impact analysis note follows.

(1) In analysis of the failure impact, we should make a clear relationship between levels, and grasp the function module links between the program rather than a simple structural relationship, which is different from the hardware.

(2) When a member has many of the same features, the analyst must consider the possible failure modes for each function. Software failure include not only functional failure, but also performance failure.

(3) For software used redundant design、alternate-work design, fault detection and protection designed, in FMEA we should not consider these design measures,but directly analyze the final impact of software failure modes and determine its severity based on the ultimate impact. In this regard, the software should be indicated in the SFMEA table that these design measures have been taken for this failure mode.

(4) The impact of each failure can be considered from three aspects. The first is the function influence of the software itself, sibling modules, or high-level software’s running. The second is the task influence such as the mission success degree. The third is the environment influence including hardware implementation or personnel safety.

4.6 Severity levels

Severity is the damage level of the failure mode, which is determined by the eventual influence of initially level from the aspects of personnel casualties, the mission fails, the impact of function or environmental damage. Severity analysis requires attention to the following points.

(1) The severity definition of software FMEA should be consistent with hardware FMEA and function FMEA definitions.

(2) The severity of SFMEA is defined by the worst potential consequences of failure modes on the initial indenture level, rather than have different division in different levels.

(3) If the defect cause complete failure of high-level functions, it’s severity should be consistent with the high-level. If high-level function is not completely fail, then the severity should be lower than the high-level functional failure.

(4) The severity results of analysis should be used to guide the design as soon as possible to eliminate or reduce the failure impact by design improvement or determine appropriate reliability criteria.

4.7 Improvement measures

SFMEA’s aim is to detect potential problems at early and take corresponding measures to improve software reliability and security. The improvements is considered in the following order.

1)Change the software design to eliminate the cause of critical failure.

2)If the failure cause can’t be eliminated, try to strengthen software’s processing of unusual circumstances (such as fault-tolerant design techniques) reducing the possibility of failure;

3)Use failure-safe design techniques to reduce the severity of the failure impact.

4)When the software is performing vital functions, the program should have a self-check to improve detectability reducing the difficulty of detecting failures.

5)Through software testing, make sure that critical defects in the software does not exist;

6)Use the compensatory or protective measures to minimize or prevent the occurrence of danger. For the failure mode that can not be controlled by the design improvements can be invalidated by the use of compensatory measures such as specialized training for the user and the use of maintenance measures etc.

4.8 Variable selection criterion in detailed-level SFMEA

Exhaustive analysis of all the variables in a module is not necessary and unrealistic, and makes the analysis not outstanding. In the software development process, different types of variables are defined according to the different needs, which have different roles and importance, Using the following rule to select analyzed variables.

(1) Global variables are called by more than one function. If global variable failure occurs, the system will have knock-on effects. The analyst needs to analyze this global variables further deeply.

(2) External controlling parameter variable which affect the operation of the system directly.

(3) The algorithm output variables should be selected, because the algorithm result may control the next module’s operation.

(4) The interface variables, including those software to the software such as function calls, interprocess communication, etc, software to hardware which control the hardware’s operation, hardware to software such as these come from sensor. Hardware input variable is a problem variable frequently in practice.