Functional Safety

FMEA and FMEDA: Hardware Safety Analysis for ASIC and FPGA

15 min read Functional Safety

FMEA and FMEDA: Hardware Safety Analysis for ASIC and FPGA

Functional safety standards such as ISO 26262 (automotive), IEC 61508 (industrial), and IEC 62304 (medical) require quantitative evidence that a hardware design meets its target failure rates. Two complementary techniques anchor this evidence: FMEA (Failure Mode and Effects Analysis), a qualitative bottom-up method, and FMEDA (Failure Mode, Effects and Diagnostic Analysis), a quantitative extension that computes failure rates and diagnostic coverage. For ASIC and FPGA designs targeting a safety integrity level (SIL) or automotive safety integrity level (ASIL), FMEDA is the backbone of the hardware safety case.

Quick Summary

FMEA Qualitative. Ranks failure modes by Risk Priority Number (Severity x Occurrence x Detection).
FMEDA Quantitative. Assigns FIT rates to each failure mode and classifies them as safe/dangerous, detected/undetected.
Outputs Diagnostic Coverage (DC), SPFM, LFM, and PMHF used to claim an ASIL or SIL target.

FMEA vs FMEDA: Two Methods, One Goal

FMEA originated in reliability engineering as a structured brainstorming method. Engineers identify each component, postulate how it can fail, trace the local and system-level effects, and rank the risk. It is qualitative and works well during early architecture exploration when failure rate data is not yet available.

FMEDA adds the quantitative layer demanded by IEC 61508 and ISO 26262. Every failure mode receives a numeric failure rate (in FIT), a failure mode distribution, and a classification that determines whether on-chip diagnostics can detect it before it causes a hazard. FMEDA is the method that produces the hardware architectural metrics auditors look for.

Key Differences

  • Direction: Both are bottom-up, starting at the component or gate level and propagating effects upward to the safety goal.
  • Quantification: FMEA uses ordinal scales (1-10); FMEDA uses physical failure rates and percentages.
  • Diagnostics: FMEDA explicitly models safety mechanisms (ECC, parity, lockstep, BIST) and credits them with diagnostic coverage.
  • Deliverable: FMEA yields a prioritized action list; FMEDA yields SPFM, LFM, and PMHF metrics for the safety case.

Failure Mode Identification

The analysis begins by decomposing the design into elementary parts. For an ASIC this means flip-flops, combinational logic cones, RAM bits, ROM, analog blocks, and I/O. For an FPGA it includes configuration memory (CRAM), block RAM, DSP slices, and routing. Each part is assigned a set of physical failure modes.

Common Hardware Failure Modes

  • Stuck-at: A node permanently driven to 0 or 1, typically from a permanent (hard) defect.
  • Bridging / short: Two nets unintentionally connected.
  • Open: A broken interconnect or via.
  • Soft errors (SEU): Bit flips in memory or registers caused by alpha particles or neutrons - the dominant transient failure mode in deep submicron and FPGA configuration memory.
  • Drift: Parametric shift in analog blocks (offset, gain, timing) due to aging or temperature.

Failure modes split into permanent (covered by the lambda value below) and transient (handled with a separate soft-error rate, or SER, often expressed in FIT/Mbit for memories).

Failure Rates: FIT and Lambda

The base unit of FMEDA is the failure rate, denoted by the symbol λ (lambda). It is most commonly expressed in FIT (Failures In Time):

FIT Definition

1 FIT = 1 failure per 109 device-hours

λtotal = λS + λD   (safe failures + dangerous failures)

λD = λDD + λDU   (dangerous detected + dangerous undetected)

A part rated at 50 FIT is expected to suffer one failure every 20 million hours of operation, roughly 2,283 years per device. Because automotive and industrial fleets contain millions of devices operating for years, these small per-device numbers aggregate into meaningful field failure counts, which is exactly why the quantitative budget matters.

Base Failure Rate Sources

The base permanent failure rate of a die is derived from recognized reliability handbooks rather than guessed. The two most widely accepted sources in functional safety work are:

  • IEC 62380: A reliability prediction model for electronic components that accounts for die area, transistor count, package type, thermal cycling, and mission profile (ambient temperature, on/off cycling). It is the preferred source for semiconductor die in many ISO 26262 programs.
  • SN 29500 (Siemens Norm): Provides reference failure rates and stress-dependent conversion models for integrated circuits, discretes, and passives. Widely used in industrial IEC 61508 contexts.

Other accepted sources include the older MIL-HDBK-217F, the FIDES guide, and Telcordia SR-332. The chosen model converts a reference failure rate at reference conditions to the actual mission-profile conditions using temperature (Arrhenius) and electrical stress factors. For FPGAs, configuration-memory soft-error rates are taken from the vendor's device reliability report rather than a generic handbook.

Failure Mode Distribution

Once the part-level λ is known, it must be apportioned across the part's failure modes. This is the failure mode distribution, expressed as a percentage that sums to 100%. The distribution is typically driven by silicon area: a sub-block occupying 30% of the die receives roughly 30% of the permanent failure rate, refined by gate density and known mode ratios.

For each failure mode the analyst then asks two questions that drive the entire metric calculation:

  1. Does this failure mode have the potential to violate a safety goal? (safe vs dangerous)
  2. If dangerous, is there a safety mechanism that detects it in time? (detected vs undetected)

Safe vs Dangerous, Detected vs Undetected

Classifying every failure mode into one of four categories is the heart of FMEDA. A safe failure cannot, by itself, lead to a violation of the safety goal (for example, a fault in an unused register or a fault that the architecture inherently tolerates). A dangerous failure can. Among dangerous failures, a detected failure is caught by a safety mechanism (parity, ECC, lockstep comparison, watchdog, BIST) that moves the system to a safe state, while an undetected failure stays latent and can defeat the safety goal.

Classification Symbol Safety Impact Diagnostic Status
Safe Detected λSD No safety goal violation Detected by a mechanism
Safe Undetected λSU No safety goal violation Not detected (acceptable)
Dangerous Detected λDD Could violate safety goal Caught, system reaches safe state
Dangerous Undetected λDU Could violate safety goal Latent - the critical concern

The goal of every safety mechanism is to shift failure rate from the λDU bucket into λDD. Minimizing λDU is what raises the metrics.

Diagnostic Coverage (DC)

Diagnostic Coverage is the fraction of a category's dangerous failure rate that the safety mechanisms detect. It is computed per safety mechanism and per failure category, then rolled up.

Diagnostic Coverage

DC = λDD / λD = λDD / (λDD + λDU)

Equivalently: DC = 1 - (λDU / λD)

ISO 26262 part 5 provides reference DC claim levels for common mechanisms: typically low (60%), medium (90%), and high (99%). For example, a single-bit parity check on a RAM offers medium coverage, while SECDED ECC plus address-overlap monitoring approaches high coverage. Claims must be justified by the mechanism's actual detection capability, not assumed.

SPFM and LFM: ISO 26262 Architectural Metrics

The FMEDA results feed two hardware architectural metrics that gate the ASIL claim. The Single-Point Fault Metric (SPFM) measures robustness against single-point and residual faults. The Latent-Fault Metric (LFM) measures the ability to detect latent (multi-point) faults before a second fault occurs.

ISO 26262 Hardware Architectural Metrics

SPFM = 1 - (Σ λSPF + Σ λRF) / Σ λ

LFM = 1 - (Σ λMPF,latent) / (Σ λ - Σ λSPF - Σ λRF)

Where SPF = single-point fault, RF = residual fault, MPF = multi-point fault.

Metric ASIL B ASIL C ASIL D
SPFM ≥ 90% ≥ 97% ≥ 99%
LFM ≥ 60% ≥ 80% ≥ 90%
PMHF (per ISO 26262) < 100 FIT < 100 FIT < 10 FIT

The third metric, PMHF (Probabilistic Metric for random Hardware Failures), is an absolute failure rate computed across the whole safety function and must fall below the target above. SPFM and LFM are relative percentages; PMHF is an absolute rate. All three must pass.

RPN: Risk Prioritization in FMEA

Where FMEDA produces metrics, classical FMEA produces a Risk Priority Number (RPN) to rank which failure modes deserve mitigation effort first. Each failure mode is scored on three ordinal scales, usually 1 to 10.

Risk Priority Number

RPN = Severity (S) x Occurrence (O) x Detection (D)

Range: 1 (negligible) to 1000 (critical, urgent action required)

  • Severity (S): How serious is the effect of the failure on the system or user?
  • Occurrence (O): How likely is the failure mode to occur?
  • Detection (D): How likely is the failure to escape detection? A high score means it is hard to detect.

Note that the newer AIAG-VDA FMEA handbook replaces RPN with Action Priority (AP), a lookup table of S/O/D combinations, because raw RPN multiplication can mask a high-severity item behind low occurrence. For hardware safety work the AP or RPN ranking guides where design changes and added diagnostics give the most benefit, which then feeds back into the FMEDA.

Implementation Best Practices

  1. Start the safety analysis early: Run a qualitative FMEA during architecture so safety mechanisms are designed in, not bolted on after RTL freeze.
  2. Use a recognized base-rate source: Anchor every λ in IEC 62380 or SN 29500 (or the vendor reliability report for FPGAs) and document the mission profile used.
  3. Derive failure mode distribution from layout: Apportion λ by actual silicon area and gate density rather than uniform guesses.
  4. Justify every DC claim: Tie each diagnostic-coverage figure to a specific mechanism and the failure modes it actually detects; avoid optimistic blanket claims.
  5. Separate permanent and transient analysis: Treat soft-error rate (SER) for memories and FPGA configuration RAM with its own scrubbing/ECC credit.
  6. Account for latent faults: Add latent-fault tests (boot-time and periodic BIST) so dual-point faults are revealed before a second fault accumulates.
  7. Verify diagnostics by fault injection: Confirm claimed DC with gate- or RTL-level fault campaigns rather than assumption.
  8. Keep the FMEDA a living document: Update it at every netlist and floorplan change; metrics shift with area and diagnostic changes.

Conclusion

FMEA and FMEDA together turn an abstract safety goal into auditable numbers. FMEA prioritizes risk qualitatively through the Risk Priority Number, while FMEDA quantifies it by assigning FIT rates, classifying each failure mode as safe or dangerous and detected or undetected, and rolling those into diagnostic coverage, SPFM, LFM, and PMHF.

For ASIC and FPGA teams, the discipline pays off twice: it produces the evidence certifiers demand, and it directs diagnostic resources, ECC, lockstep, and BIST, exactly where they shift failure rate out of the dangerous-undetected bucket. Done early and kept current, it is the difference between a design that claims an ASIL and one that proves it.

Vcores offers functional safety analysis services, including FMEA and FMEDA for ISO 26262 and IEC 61508 programs, diagnostic coverage assessment, and safety mechanism design for your ASIC and FPGA projects.

Tags: FMEA FMEDA safety analysis diagnostic coverage hardware safety failure modes

Need IP Cores for Your Design?

Vcores offers silicon-proven IP cores for ASIC and FPGA designs. Get high-quality, verified IP with comprehensive documentation and support.

Explore Products Contact Us