AI/ML

AI Accelerator Architecture: Designing Custom NPUs for Edge AI

20 min read AI/ML

The Rise of Custom AI Accelerators

AI accelerators, also known as Neural Processing Units (NPUs), are specialized hardware designed to efficiently execute machine learning workloads. As AI moves from cloud to edge devices, the demand for custom AI accelerators has exploded. This guide covers the architectural principles, design considerations, and implementation challenges of building custom NPUs for edge AI applications.

Why Custom AI Accelerators?

  • Performance: 10-1000x faster than CPUs for inference
  • Power Efficiency: 10-100x better TOPS/Watt than GPUs
  • Latency: Real-time inference at the edge
  • Privacy: On-device processing, no cloud dependency
  • Cost: Optimized silicon for specific workloads

Understanding Neural Network Workloads

Before designing an accelerator, understand the computational patterns of neural networks:

Dominant Operations

Operation Compute Pattern % of Inference Time
Convolution (Conv2D) Matrix multiply + accumulate 60-90%
Fully Connected Matrix-vector multiply 5-20%
Pooling Max/Average reduction 2-5%
Activation (ReLU) Element-wise comparison 1-3%
Batch Normalization Element-wise multiply-add 2-5%

Memory Bandwidth Challenge

The primary challenge in AI accelerators is memory bandwidth, not compute:

  • ResNet-50: 25.6 million parameters = 100MB at FP32
  • BERT-Base: 110 million parameters = 440MB at FP32
  • GPT-3: 175 billion parameters = 700GB at FP32

Efficient accelerators maximize data reuse and minimize memory access.

AI Accelerator Architectures

1. Systolic Array Architecture

Used by Google TPU, systolic arrays are highly efficient for matrix operations:

Weight Stationary Systolic Array (4x4 example):

        ───→ ───→ ───→ ───→  Activations flow right
       ↓    ↓    ↓    ↓
      ┌──┐ ┌──┐ ┌──┐ ┌──┐
      │PE│→│PE│→│PE│→│PE│   W0,0  W0,1  W0,2  W0,3
      └──┘ └──┘ └──┘ └──┘
       ↓    ↓    ↓    ↓
      ┌──┐ ┌──┐ ┌──┐ ┌──┐
      │PE│→│PE│→│PE│→│PE│   W1,0  W1,1  W1,2  W1,3
      └──┘ └──┘ └──┘ └──┘
       ↓    ↓    ↓    ↓
      ┌──┐ ┌──┐ ┌──┐ ┌──┐
      │PE│→│PE│→│PE│→│PE│   W2,0  W2,1  W2,2  W2,3
      └──┘ └──┘ └──┘ └──┘
       ↓    ↓    ↓    ↓
      ┌──┐ ┌──┐ ┌──┐ ┌──┐
      │PE│→│PE│→│PE│→│PE│   W3,0  W3,1  W3,2  W3,3
      └──┘ └──┘ └──┘ └──┘
       ↓    ↓    ↓    ↓
           Partial sums flow down
    

Advantages:

  • High data reuse (each weight used N times)
  • Regular dataflow, easy to pipeline
  • Efficient for large matrix multiplications

2. Dataflow Architecture

Flexible architectures that can adapt to different layer shapes:

  • Weight Stationary: Weights stay in PE, activations flow through
  • Output Stationary: Partial sums accumulate in PE
  • Row Stationary: Maximizes all types of data reuse (Eyeriss)

3. Near-Memory Computing

Place compute close to memory to reduce data movement:

  • Processing-in-Memory (PIM)
  • 3D stacked memory with logic layers
  • Analog computing in memory arrays

Quantization for Edge Deployment

Reduced precision is essential for edge AI accelerators:

Precision Bits Memory Reduction Typical Accuracy Loss
FP32 32 Baseline 0%
FP16/BF16 16 2x <0.1%
INT8 8 4x <1%
INT4 4 8x 1-3%
Binary/Ternary 1-2 16-32x 5-15%

Quantization Techniques

  • Post-Training Quantization (PTQ): Quantize after training
  • Quantization-Aware Training (QAT): Train with quantization in loop
  • Mixed Precision: Different precision for different layers

Processing Element (PE) Design

The PE is the fundamental compute unit in AI accelerators:

Basic MAC Unit

// INT8 MAC Unit with accumulator
module mac_unit (
  input  logic        clk,
  input  logic        rst_n,
  input  logic        enable,
  input  logic [7:0]  weight,      // INT8 weight
  input  logic [7:0]  activation,  // INT8 activation
  input  logic        clear_acc,
  output logic [31:0] accumulator
);

  logic [15:0] product;

  // Signed multiplication
  assign product = $signed(weight) * $signed(activation);

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n)
      accumulator <= '0;
    else if (clear_acc)
      accumulator <= '0;
    else if (enable)
      accumulator <= accumulator + {{16{product[15]}}, product};
  end

endmodule
    

PE Features for Efficiency

  • Local SRAM: Weight and activation buffers
  • Activation Functions: ReLU, Sigmoid, Tanh in hardware
  • Pooling Support: Max/average pooling logic
  • Flexible Precision: Support INT8/INT4/Binary modes

Memory Hierarchy Design

Critical for achieving high utilization:

Typical Memory Hierarchy

┌─────────────────────────────────────────┐
│           External DRAM (GB)            │  ~10 GB/s
│         Weights, Activations            │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│         Global Buffer (MB)              │  ~100 GB/s
│      Shared across all PEs              │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│         PE Array Local SRAM (KB)        │  ~1 TB/s
│      Weight buffer, Activation buffer   │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│         Register File (Bytes)           │  ~10 TB/s
│      Current operands                   │
└─────────────────────────────────────────┘
    

Tiling Strategy

Large layers must be tiled to fit in on-chip memory:

  • Output Tiling: Compute partial output feature maps
  • Input Tiling: Process input in spatial chunks
  • Channel Tiling: Process subset of channels at a time

Performance Metrics

Key Metrics for AI Accelerators

Metric Definition Target (Edge)
TOPS Tera Operations Per Second 1-10 TOPS
TOPS/W Energy Efficiency >5 TOPS/W
TOPS/mm² Area Efficiency >1 TOPS/mm²
Utilization % of peak TOPS achieved >70%
Latency Time per inference <10ms

Conclusion

Designing custom AI accelerators requires deep understanding of neural network workloads, memory system optimization, and efficient compute architectures. As edge AI adoption accelerates, the demand for specialized NPUs will continue to grow.

Vcores provides AI accelerator IP blocks including configurable MAC arrays, activation function units, and memory controllers optimized for AI workloads. Our solutions enable rapid development of custom NPUs for edge AI applications.

Tags: AI accelerator NPU neural processing edge AI systolic array machine learning hardware

Need IP Cores for Your Design?

Vcores offers silicon-proven IP cores for ASIC and FPGA designs. Get high-quality, verified IP with comprehensive documentation and support.

Explore Products Contact Us