AI Accelerator Architecture: Designing Custom NPUs for Edge AI

Vcores Engineering Team 2025-02-25 20 min read AI/ML

The Rise of Custom AI Accelerators

AI accelerators, also known as Neural Processing Units (NPUs), are specialized hardware designed to efficiently execute machine learning workloads. As AI moves from cloud to edge devices, the demand for custom AI accelerators has exploded. This guide covers the architectural principles, design considerations, and implementation challenges of building custom NPUs for edge AI applications.

Why Custom AI Accelerators?

Performance: 10-1000x faster than CPUs for inference
Power Efficiency: 10-100x better TOPS/Watt than GPUs
Latency: Real-time inference at the edge
Privacy: On-device processing, no cloud dependency
Cost: Optimized silicon for specific workloads

Understanding Neural Network Workloads

Before designing an accelerator, understand the computational patterns of neural networks:

Dominant Operations

Operation	Compute Pattern	% of Inference Time
Convolution (Conv2D)	Matrix multiply + accumulate	60-90%
Fully Connected	Matrix-vector multiply	5-20%
Pooling	Max/Average reduction	2-5%
Activation (ReLU)	Element-wise comparison	1-3%
Batch Normalization	Element-wise multiply-add	2-5%

Memory Bandwidth Challenge

The primary challenge in AI accelerators is memory bandwidth, not compute:

ResNet-50: 25.6 million parameters = 100MB at FP32
BERT-Base: 110 million parameters = 440MB at FP32
GPT-3: 175 billion parameters = 700GB at FP32

Efficient accelerators maximize data reuse and minimize memory access.

AI Accelerator Architectures

1. Systolic Array Architecture

Used by Google TPU, systolic arrays are highly efficient for matrix operations:

Weight Stationary Systolic Array (4x4 example):

        ───→ ───→ ───→ ───→  Activations flow right
       ↓    ↓    ↓    ↓
      ┌──┐ ┌──┐ ┌──┐ ┌──┐
      │PE│→│PE│→│PE│→│PE│   W0,0  W0,1  W0,2  W0,3
      └──┘ └──┘ └──┘ └──┘
       ↓    ↓    ↓    ↓
      ┌──┐ ┌──┐ ┌──┐ ┌──┐
      │PE│→│PE│→│PE│→│PE│   W1,0  W1,1  W1,2  W1,3
      └──┘ └──┘ └──┘ └──┘
       ↓    ↓    ↓    ↓
      ┌──┐ ┌──┐ ┌──┐ ┌──┐
      │PE│→│PE│→│PE│→│PE│   W2,0  W2,1  W2,2  W2,3
      └──┘ └──┘ └──┘ └──┘
       ↓    ↓    ↓    ↓
      ┌──┐ ┌──┐ ┌──┐ ┌──┐
      │PE│→│PE│→│PE│→│PE│   W3,0  W3,1  W3,2  W3,3
      └──┘ └──┘ └──┘ └──┘
       ↓    ↓    ↓    ↓
           Partial sums flow down

Advantages:

High data reuse (each weight used N times)
Regular dataflow, easy to pipeline
Efficient for large matrix multiplications

2. Dataflow Architecture

Flexible architectures that can adapt to different layer shapes:

Weight Stationary: Weights stay in PE, activations flow through
Output Stationary: Partial sums accumulate in PE
Row Stationary: Maximizes all types of data reuse (Eyeriss)

3. Near-Memory Computing

Place compute close to memory to reduce data movement:

Processing-in-Memory (PIM)
3D stacked memory with logic layers
Analog computing in memory arrays

Quantization for Edge Deployment

Reduced precision is essential for edge AI accelerators:

Precision	Bits	Memory Reduction	Typical Accuracy Loss
FP32	32	Baseline	0%
FP16/BF16	16	2x	<0.1%
INT8	8	4x	<1%
INT4	4	8x	1-3%
Binary/Ternary	1-2	16-32x	5-15%

Quantization Techniques

Post-Training Quantization (PTQ): Quantize after training
Quantization-Aware Training (QAT): Train with quantization in loop
Mixed Precision: Different precision for different layers

Processing Element (PE) Design

The PE is the fundamental compute unit in AI accelerators:

Basic MAC Unit

// INT8 MAC Unit with accumulator
module mac_unit (
  input  logic        clk,
  input  logic        rst_n,
  input  logic        enable,
  input  logic [7:0]  weight,      // INT8 weight
  input  logic [7:0]  activation,  // INT8 activation
  input  logic        clear_acc,
  output logic [31:0] accumulator
);

  logic [15:0] product;

  // Signed multiplication
  assign product = $signed(weight) * $signed(activation);

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n)
      accumulator <= '0;
    else if (clear_acc)
      accumulator <= '0;
    else if (enable)
      accumulator <= accumulator + {{16{product[15]}}, product};
  end

endmodule

PE Features for Efficiency

Local SRAM: Weight and activation buffers
Activation Functions: ReLU, Sigmoid, Tanh in hardware
Pooling Support: Max/average pooling logic
Flexible Precision: Support INT8/INT4/Binary modes

Memory Hierarchy Design

Critical for achieving high utilization:

Typical Memory Hierarchy

┌─────────────────────────────────────────┐
│           External DRAM (GB)            │  ~10 GB/s
│         Weights, Activations            │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│         Global Buffer (MB)              │  ~100 GB/s
│      Shared across all PEs              │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│         PE Array Local SRAM (KB)        │  ~1 TB/s
│      Weight buffer, Activation buffer   │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│         Register File (Bytes)           │  ~10 TB/s
│      Current operands                   │
└─────────────────────────────────────────┘

Tiling Strategy

Large layers must be tiled to fit in on-chip memory:

Output Tiling: Compute partial output feature maps
Input Tiling: Process input in spatial chunks
Channel Tiling: Process subset of channels at a time

Performance Metrics

Key Metrics for AI Accelerators

Metric	Definition	Target (Edge)
TOPS	Tera Operations Per Second	1-10 TOPS
TOPS/W	Energy Efficiency	>5 TOPS/W
TOPS/mm²	Area Efficiency	>1 TOPS/mm²
Utilization	% of peak TOPS achieved	>70%
Latency	Time per inference	<10ms

Conclusion

Designing custom AI accelerators requires deep understanding of neural network workloads, memory system optimization, and efficient compute architectures. As edge AI adoption accelerates, the demand for specialized NPUs will continue to grow.

Vcores provides AI accelerator IP blocks including configurable MAC arrays, activation function units, and memory controllers optimized for AI workloads. Our solutions enable rapid development of custom NPUs for edge AI applications.

Tags: AI accelerator NPU neural processing edge AI systolic array machine learning hardware

Need IP Cores for Your Design?

Vcores offers silicon-proven IP cores for ASIC and FPGA designs. Get high-quality, verified IP with comprehensive documentation and support.

Explore Products Contact Us

Control & Interface

Memory Controllers

Securities