Huawei Releases New Large Model: Pangu Ultra-MoE-718B-V1.1

DreamActor Team 2025-11-18 5 min read

Huawei has released the latest sparse expert large model Pangu Ultra-MoE-718B-V1.1.

The model has a total parameter count of 718B and activated parameters of approximately 39B, belonging to an ultra-large-scale Mixture-of-Experts (MoE) architecture that combines high capacity with high inference efficiency.

🚀 Key Features

🔢 718B Parameters, 39B Activated Parameters

Ultra-MoE-718B-V1.1 adopts a sparse expert architecture (MoE), where only a portion of experts are invoked during inference, resulting in costs far lower than dense models of equivalent scale while maintaining strong expressive capabilities.

🔧 Supports Atlas 800T A2 Inference (Customized Optimized vLLM)

In official demonstrations, the model can complete inference on Atlas 800T A2 (64GB VRAM).

Relying on a deeply customized vLLM (MoE + High Parallelism Optimized Version), the model can run on multi-card clusters.

Due to the massive memory and KV Cache requirements, inference typically requires at least 32 cards in parallel.

📈 Strong Mathematical and Logical Capabilities

Ultra-MoE-718B-V1.1 performs excellently on multiple mathematical benchmarks, especially:

AIME25: 77.50%

Close to:

Gemini 2.5 Flash: 78.3%

This indicates high-level capabilities in mathematical reasoning, logical deduction, and rigorous problem-solving.

⚠️ Discussion on Some Benchmark Results

Some code-related benchmarks (such as LiveCodeBench) provided by the official have controversies and do not fully reflect real-world performance.

For example, GPT-OSS-120B, which scores highly on the leaderboard:

Actual code quality is unstable
Only has 4K context, unable to accommodate the first chapter of "Harry Potter and the Philosopher's Stone" (20K+)
Actual testing does not match leaderboard evaluations

Therefore, the reliability of these benchmarks should be treated with caution, which does not affect Ultra-MoE-718B-V1.1's own mathematical/reasoning-related performance.

🏗️ Model Architecture Highlights

Sparse Expert (MoE) Structure

Top-k Routing selects the most suitable expert combination for the current token.
Efficient Parallelism Strategy (Expert Parallelism)

Expert distributed parallelism for large-scale clusters.
Customized vLLM Inference Framework

Improves inference throughput, reduces latency, and enhances expert scheduling efficiency.
39B Activated Parameters

Still possesses extremely strong effective capacity under MoE sparsification.

🧩 Application Directions

Mathematical reasoning, logical reasoning tasks
High-difficulty Q&A
Long-text understanding
Multi-turn dialogue
Research summaries, structured content processing
Code generation (requires verification with actual performance)

📝 Summary

Pangu Ultra-MoE-718B-V1.1 is one of the largest MoE models currently available, with features including:

718B total parameters
39B activated parameters
Supports large-scale multi-card inference
Strong mathematical capabilities
Deep optimization in architecture engineering

It represents an important advancement in the MoE approach in terms of engineering capabilities, model scale, and inference performance.

Back to Blog