DeepSeek’s New AI Trick Could Cut Response Time by 85 Percent

Add on Google

DeepSeek has drawn attention with a claim that could matter far beyond its own model lineup. The company says its new DSpark framework can speed up AI responses by as much as 85 percent without relying on flagship chips.

That matters because inference speed and hardware access have become two of the biggest pressure points in the AI industry. For Chinese companies in particular, access to top-tier Nvidia AI chips remains difficult under U.S. sanctions, making efficiency gains especially valuable.

How DSpark is designed to work

DSpark is a speculative decoding framework built for DeepSeek’s V4 family of models. Instead of generating every token from scratch in the usual sequence, the system uses a lighter draft model to propose responses first.

The main model then checks those proposals in batches. If the draft output is correct, the system can move forward more quickly, while errors send the process back to the standard path.

DeepSeek says this approach works because many tokens are relatively easy to predict. The company also says the entire process stays on GPU, which avoids the extra latency that can come from moving work to the CPU.

The framework also uses semi-autoregressive generation. That means it can produce small chunks of tokens at once rather than always generating output one token at a time.

DSpark Feature	What It Does	Why It Matters
Speculative decoding	A lighter draft model proposes responses first	Reduces the amount of work the main model must do from scratch
GPU-only processing	Keeps the full workflow on GPU	Helps limit added latency from CPU transfers
Semi-autoregressive generation	Generates small token groups at a time	Can make output appear faster to users

The efficiency claim DeepSeek is putting forward

To illustrate the impact, DeepSeek says a single GPU that previously handled 100 user requests could rise to about 185 requests with DSpark. That is not presented as a model intelligence upgrade, but as a serving efficiency improvement.

The distinction is important. DeepSeek says DSpark is meant to make inference faster and more efficient, not to make the model smarter or more capable in a general sense.

That framing places the framework squarely in the current AI arms race over cost and throughput. Data centers now need huge numbers of advanced GPUs to run large-scale models, while demand for AI services continues to climb.

At the same time, token costs remain under scrutiny for many companies. DeepSeek points to cases where firms such as Uber and Walmart have limited employee AI token usage because costs keep rising.

Open release and wider testing

DeepSeek says the DSpark research has been released publicly on GitHub and HuggingFace. The project was developed in collaboration with Peking University, suggesting that the framework is meant to be examined and used more broadly.

The company also says DSpark has already been tested on other open-source models, including Gemma from Google DeepMind and Qwen from Alibaba. That indicates the approach may have value beyond DeepSeek’s own ecosystem.

A broader rollout would matter if the gains prove consistent across different model families. In that case, DSpark could become a general efficiency method rather than just an internal optimization tool.

The launch follows DeepSeek’s earlier V4 Preview release in April, which was positioned as a cost-efficient option for handling 1 million-context input. DeepSeek said V4-Pro was built for higher performance, while V4-Flash was designed to be faster and cheaper.

Competition is shifting toward speed

DeepSeek is not alone in chasing faster inference. Earlier this month, Xiaomi’s AI team said its MiMo-V2.5-Pro-UltraSpeed model had reached output speeds above 1,000 tokens per second, which it described as among the fastest in the industry.

That competition shows how speed has become a major battleground alongside capability. For many businesses, a model that is faster and cheaper can be just as valuable as one that is more powerful, especially when compute budgets keep expanding.

DeepSeek’s DSpark puts that idea into practice by focusing on throughput rather than raw intelligence. If the company’s claims hold up in real-world use, the framework could help providers serve more requests without a major increase in infrastructure spending.