Huawei Releases New Large Model: Pangu Ultra-MoE-718B-V1.1
Huawei has released the latest sparse expert large model Pangu Ultra-MoE-718B-V1.1.
The model has a total parameter count of 718B and activated parameters of approximately 39B, belonging to an ultra-large-scale Mixture-of-Experts (MoE) architecture that combines high capacity with high inference efficiency.
🚀 Key Features
🔢 718B Parameters, 39B Activated Parameters
Ultra-MoE-718B-V1.1 adopts a sparse expert architecture (MoE), where only a portion of experts are invoked during inference, resulting in costs far lower than dense models of equivalent scale while maintaining strong expressive capabilities.
🔧 Supports Atlas 800T A2 Inference (Customized Optimized vLLM)
In official demonstrations, the model can complete inference on Atlas 800T A2 (64GB VRAM).
Relying on a deeply customized vLLM (MoE + High Parallelism Optimized Version), the model can run on multi-card clusters.
Due to the massive memory and KV Cache requirements, inference typically requires at least 32 cards in parallel.
📈 Strong Mathematical and Logical Capabilities
Ultra-MoE-718B-V1.1 performs excellently on multiple mathematical benchmarks, especially:
- AIME25: 77.50%
Close to:
- Gemini 2.5 Flash: 78.3%
This indicates high-level capabilities in mathematical reasoning, logical deduction, and rigorous problem-solving.
⚠️ Discussion on Some Benchmark Results
Some code-related benchmarks (such as LiveCodeBench) provided by the official have controversies and do not fully reflect real-world performance.
For example, GPT-OSS-120B, which scores highly on the leaderboard:
Actual code quality is unstable
Only has 4K context, unable to accommodate the first chapter of "Harry Potter and the Philosopher's Stone" (20K+)
Actual testing does not match leaderboard evaluations
Therefore, the reliability of these benchmarks should be treated with caution, which does not affect Ultra-MoE-718B-V1.1's own mathematical/reasoning-related performance.
🏗️ Model Architecture Highlights
Sparse Expert (MoE) Structure
Top-k Routing selects the most suitable expert combination for the current token.
Efficient Parallelism Strategy (Expert Parallelism)
Expert distributed parallelism for large-scale clusters.
Customized vLLM Inference Framework
Improves inference throughput, reduces latency, and enhances expert scheduling efficiency.
39B Activated Parameters
Still possesses extremely strong effective capacity under MoE sparsification.
🧩 Application Directions
Mathematical reasoning, logical reasoning tasks
High-difficulty Q&A
Long-text understanding
Multi-turn dialogue
Research summaries, structured content processing
Code generation (requires verification with actual performance)
📝 Summary
Pangu Ultra-MoE-718B-V1.1 is one of the largest MoE models currently available, with features including:
718B total parameters
39B activated parameters
Supports large-scale multi-card inference
Strong mathematical capabilities
Deep optimization in architecture engineering
It represents an important advancement in the MoE approach in terms of engineering capabilities, model scale, and inference performance.