DeepSeek Introduces Ultra-Fast Long-Context Model Training and Inference

Introduction

DeepSeek just introduced an ultra-fast long-context model training and inference system. This breakthrough, known as NSA (Native Sparse Attention), changes the way AI models handle long texts and speed up long context language model training in a cheaper, efficient way while maintaining accuracy.
NSA outperforms full attention:
Native Sparse Attention(NSA) outperforms full attention Save

The images in this article came from arxiv.org

The Problem with Traditional Method

What traditional language models use is known as full attention. In full attention, every token in a text interacts with every other token. This ensures nothing is missed, but it will be costly when the number of tokens increases. The number of interactions will increase with the square of the number of tokens so there will be 100 million interactions for a text with 10,000 tokens.
When a chatbot is needed to process lengthy legal documents or hold extended conversations, full attention can cause high delays and costs.
This is inefficient and increases computing power very quickly.

How NSA Works

Rather than forcing every token to interact with every other, NSA focuses on only the most important parts of the text. It works by reducing the number of tokens without losing the main ideas.

How Native Sparse Attention(NSA) Works Save

NSA uses a hierarchical strategy that organises the text into layers. At the highest level, NSA captures the overall structure and then zooms in to pick out the details.

3 main parts:

Compression: The model groups nearby words into blocks, much like summarising each paragraph into a headline. A small neural network then converts each block into a single token that captures the main idea.
Selection: Once compressed into tokens, NSA decide which tokens are worth paying attention to. It assigns an “importance score” to each block by using intermediate information from the compression step.
Sliding Window: Even if it had a great summary of the whole document, it still needs to pay attention to the most recent or nearby words to get a full understanding. This mechanism ensures the immediate contexts aren't lost. It takes a fixed number of the latest tokens and processes them separately to capture local nuances.

Speed and Efficiency

You may get deeper details by reading the research paper.

NSA is optimized for today’s GPUs, especially Nvidia’s H800 series.
Native Sparse Attention(NSA) Structure Save Using custom code built with the Triton framework, NSA’s design loads groups of tokens into very fast memory (SRAM) and processes them in parallel. By reducing repeated memory access, NSA cuts down on wasted time and speeds up the performance.

For instance, in tests on an 8-GPU system, NSA’s optimised kernels delivered up to nine times faster forward passes and six times faster backward passes when processing sequences of 64,000 tokens. During decoding (when the model generates text), NSA only needs to load a small fraction of tokens instead of the full set. This results in speedups of more than 11 times compared to traditional methods.

Real-World Testing

In a test, NSA was evaluated on a transformer model with 27 billion parameters, though only 3 billion of those were active at any one time. This model was spread across 30 layers and used a method called Mixture-of-Experts (which divides the workload among several specialised parts) to handle the data.
The training was carried out on a dataset of 270 billion tokens. Initially, the model worked with 8,000 tokens and then fine-tuned on texts stretching up to 32,000 tokens.

When tested on various benchmarks such as knowledge tests (like MMLU), math problem sets (like GSM8K), reading comprehension tests (like DROP), and even coding challenges (MBPP and HumanEval), NSA consistently matched or outperformed the full attention approach.

One particularly impressive test was the “needle-in-a-haystack” challenge, where the NSA had to retrieve a small piece of information from a 64,000-token document. NSA achieved perfect accuracy, which proves its capability to handle very long contexts with ease. Furthermore, when fine-tuned for step-by-step reasoning (known as chain-of-thought), the NSA version of the model scored significantly higher on a difficult math exam (AIME 24) than the full attention model.

NSA also processes faster than full attention:

Metric	NSA	Full Attention
Tokens processed/second	4,900	420
Response time (10k tokens)	2.1 sec	23.8 sec

Efficiency Gains and Impact

Native Sparse Attention(NSA) speed and efficiency Save

9x faster training (forward passes)
6x faster learning (backward passes)
11.6 faster text generation (decoding)
able to handle 3x more conversations

Impact: A 64k token sequence that once took 10 hours with full attention now finishes in under 55 minutes and saving $2,200 per day!

NSA is much more cheaper, faster, and higher efficiency than full attention!

NSA Versus Other Methods

	Native Sparse Attention	Full Attention
Computational	Uses a hierarchical approach that cuts down interactions	Every token interacts with every other token
Inference Speedup	Up to 11.6× faster decoding; 9× forward, 6× backward speed at 64k tokens	Baseline speed; no inherent speedup
Memory Access Efficiency	Loads only selected token blocks; group-wise and block-based, reducing redundant memory transfers	Must access the full key-value cache for each query
Efficiency	Optimised custom Triton kernels on modern GPUs (e.g., Nvidia H800) provide high throughput	Standard implementation; slower with long sequences
Costs	Lower computational and energy costs; reduced need for expensive hardware	High hardware and energy costs; more expensive to scale
Performance on Benchmarks	Matches or outperforms full attention on long-context tasks and reasoning challenges	Strong accuracy but suffers on long-context and reasoning tasks

Conclusion

AI developments are getting better and better. NSA is trying to make language modelling more accessible. In the future, everyone can train a powerful advanced model at a low cost and reduce carbon footprints, not just for giant tech companies with deep pockets.

ShockBS