As artificial intelligence (AI) workloads become increasingly complex, companies are searching for more efficient and scalable hardware architectures to handle deep learning tasks. FuriosaAI, a rising player in AI chip development, has introduced Tensor Contraction Processing (TCP) as a fundamental departure from traditional systolic arrays. But is TCP truly a groundbreaking innovation, or is it just a refined version of established architectures?
To answer this, FuriosaAI engineers Younggeun Choi and Junyoung Park recently addressed the differences between TCP and traditional systolic arrays at the Hot Chips Conference, the AI Hardware Summit, and the PyTorch Conference. Below is a breakdown of their insights and how TCP could redefine AI acceleration.
What is a Systolic Array, and Where Does It Fall Short?
A systolic array is a grid of processing elements (PEs) that move data through a structured, wave-like sequence—hence the term “systolic.” It is widely used for matrix multiplication, a core operation in deep learning, and has powered many AI accelerators over the years.
When conditions are optimal, systolic arrays provide high computational efficiency and energy savings, as data flows predictably across the array, keeping all elements busy. However, rigid structural constraints pose challenges:
- Fixed-size grids mean that if a workload doesn’t perfectly match the array’s dimensions, utilization efficiency drops.
- The data flow pattern is predefined, which limits adaptability when handling varying tensor shapes—a common issue in AI inference.
- Large arrays waste resources when processing small matrices, while small arrays lack efficiency for larger models.
This is where TCP architecture aims to improve upon systolic arrays.
What Makes TCP Different?
TCP and systolic arrays share the core goal of parallelizing computations. However, FuriosaAI’s TCP architecture introduces key innovations that increase flexibility and efficiency:
- Dynamic Compute Unit Configuration
- Unlike a fixed grid in systolic arrays, TCP features smaller compute units (“slices”) that can be reconfigured dynamically to match different tensor dimensions.
- This allows TCP to maintain high utilization even for workloads with variable tensor sizes and batch sizes.
- More Flexible Data Movement
- Systolic arrays move data in a single, predefined direction across a fixed structure, often resulting in wasted computation cycles.
- TCP introduces a fetch network that broadcasts data to multiple slices simultaneously, increasing data reuse.
- Instead of treating computation as a strictly spatial operation (like systolic arrays), TCP incorporates temporal pipelining to optimize performance across time.
- Tensor Contraction as a Primitive Operation
- Traditional neural processing units (NPUs) focus on matrix multiplications, which require software to transform tensor operations into 2D matrices.
- TCP directly processes tensors while maintaining their original structure, avoiding inefficient conversions.
- This makes optimizing new AI models significantly easier, reducing engineering effort for deploying models like Llama 2 or Llama 3.
- Higher Power Efficiency
- Memory access is a major energy cost in AI processing—moving data between off-chip DRAM and on-chip processing elements consumes up to 10,000 times more energy than the computations themselves.
- TCP maximizes data reuse within on-chip buffers, significantly reducing expensive memory transfers.
How Does TCP Handle Dynamic AI Workloads Better?
One of the biggest challenges in AI inference is dealing with varying batch sizes and tensor shapes. Traditional systolic arrays struggle with this because:
- They require a static workload structure, meaning that when batch sizes change, efficiency drops.
- Utilization is only optimal when a workload perfectly fills the array, which is rare in real-world AI inference.
TCP overcomes these issues by dynamically adjusting its compute units based on tensor shape. For example:
- If a large model is running, TCP can allocate more compute slices to handle it efficiently.
- For smaller workloads, TCP breaks down into multiple independent processing units, avoiding wasted resources.
- Unlike systolic arrays, which require pre-defined tensor partitioning, TCP adapts to changing tensor dimensions in real-time.
This level of flexibility makes TCP an ideal architecture for large AI/ML models used by cloud AI providers and enterprises that require high-performance inference.
How Does TCP Improve AI Inference Compared to GPUs?
Currently, GPUs dominate AI acceleration, but they have fundamental inefficiencies when compared to purpose-built AI chips like TCP:
1. Power Consumption
- High-performance GPUs, like Nvidia’s H100, consume up to 1,200W, while FuriosaAI’s RNGD chip operates at just 150W—an 8x reduction in power usage.
2. Data Processing Efficiency
- GPUs process tensors as 2D matrices, which requires flattening multi-dimensional data. This conversion adds overhead and reduces opportunities for parallelization.
- TCP maintains tensor structures, eliminating inefficient transformations and simplifying model optimization.
3. Model Deployment & Customization
- Optimizing AI models for GPUs requires extensive kernel-level engineering to recover efficiency lost in tensor flattening.
- TCP removes this complexity by processing tensors natively, making it easier to deploy and fine-tune models.
Potential Trade-offs: Where TCP Might Fall Short
While TCP architecture offers superior flexibility, there are a few scenarios where a traditional systolic array might still be preferable:
- If a model’s workload perfectly fits a fixed-size systolic array, TCP’s reconfigurable units may introduce minor overhead compared to a highly optimized fixed grid.
- TCP’s adaptability adds hardware complexity, which could increase chip costs compared to more standardized AI accelerators.
- For ultra-large batch processing, some traditional architectures might still provide higher peak throughput in specific scenarios.
However, for most AI inference tasks—where batch sizes are dynamic and tensor dimensions vary—TCP’s advantages far outweigh its trade-offs.
The Future of AI Acceleration: How TCP Fits Industry Trends
The AI hardware industry is shifting toward custom chips as companies seek greater efficiency and independence from Nvidia’s GPUs. TCP reflects broader industry trends, including:
- The Decline of General-Purpose AI Chips
- While GPUs kickstarted the AI revolution, companies now seek specialized chips that maximize efficiency for deep learning workloads.
- The Rise of Custom AI Silicon
- Tech giants like Google, Meta, and Amazon are developing custom AI accelerators to reduce costs and improve performance.
- Meta’s rumored acquisition of FuriosaAI suggests a shift toward in-house AI hardware development.
- AI Hardware Moving Beyond Systolic Arrays
- Systolic arrays have powered AI acceleration for over a decade, but their rigid structures limit efficiency for modern AI tasks.
- TCP represents the next evolution, optimizing both performance and power consumption.
Conclusion: Is TCP the Future of AI Hardware?
FuriosaAI’s TCP architecture offers a major step forward in AI hardware design by overcoming the limitations of traditional systolic arrays and GPUs.
- For AI/ML providers, TCP provides higher efficiency, power savings, and easier model deployment.
- For enterprises, TCP reduces operational costs by lowering power consumption and simplifying AI model tuning.
- For AI chip design, TCP represents a shift toward flexible, tensor-native architectures, paving the way for more specialized deep learning accelerators.
With industry players like Meta exploring AI chip acquisitions, TCP’s innovative design could play a critical role in the next generation of AI hardware. Will it replace GPUs entirely? Probably not. But as AI workloads become more demanding, TCP’s efficiency and adaptability position it as a key player in AI acceleration.