Huawei Releases New Large Model: Pangu Ultra-MoE-718B-V1.1

DreamActor Team 2025-11-18 5 min read

Huawei has released the latest sparse expert large model Pangu Ultra-MoE-718B-V1.1.

The model has a total parameter count of 718B and activated parameters of approximately 39B, belonging to an ultra-large-scale Mixture-of-Experts (MoE) architecture that combines high capacity with high inference efficiency.


🚀 Key Features

🔢 718B Parameters, 39B Activated Parameters

Ultra-MoE-718B-V1.1 adopts a sparse expert architecture (MoE), where only a portion of experts are invoked during inference, resulting in costs far lower than dense models of equivalent scale while maintaining strong expressive capabilities.


🔧 Supports Atlas 800T A2 Inference (Customized Optimized vLLM)

In official demonstrations, the model can complete inference on Atlas 800T A2 (64GB VRAM).

Relying on a deeply customized vLLM (MoE + High Parallelism Optimized Version), the model can run on multi-card clusters.

Due to the massive memory and KV Cache requirements, inference typically requires at least 32 cards in parallel.


📈 Strong Mathematical and Logical Capabilities

Ultra-MoE-718B-V1.1 performs excellently on multiple mathematical benchmarks, especially:

  • AIME25: 77.50%

Close to:

  • Gemini 2.5 Flash: 78.3%

This indicates high-level capabilities in mathematical reasoning, logical deduction, and rigorous problem-solving.


⚠️ Discussion on Some Benchmark Results

Some code-related benchmarks (such as LiveCodeBench) provided by the official have controversies and do not fully reflect real-world performance.

For example, GPT-OSS-120B, which scores highly on the leaderboard:

  • Actual code quality is unstable

  • Only has 4K context, unable to accommodate the first chapter of "Harry Potter and the Philosopher's Stone" (20K+)

  • Actual testing does not match leaderboard evaluations

Therefore, the reliability of these benchmarks should be treated with caution, which does not affect Ultra-MoE-718B-V1.1's own mathematical/reasoning-related performance.


🏗️ Model Architecture Highlights

  • Sparse Expert (MoE) Structure

    Top-k Routing selects the most suitable expert combination for the current token.

  • Efficient Parallelism Strategy (Expert Parallelism)

    Expert distributed parallelism for large-scale clusters.

  • Customized vLLM Inference Framework

    Improves inference throughput, reduces latency, and enhances expert scheduling efficiency.

  • 39B Activated Parameters

    Still possesses extremely strong effective capacity under MoE sparsification.


🧩 Application Directions

  • Mathematical reasoning, logical reasoning tasks

  • High-difficulty Q&A

  • Long-text understanding

  • Multi-turn dialogue

  • Research summaries, structured content processing

  • Code generation (requires verification with actual performance)


📝 Summary

Pangu Ultra-MoE-718B-V1.1 is one of the largest MoE models currently available, with features including:

  • 718B total parameters

  • 39B activated parameters

  • Supports large-scale multi-card inference

  • Strong mathematical capabilities

  • Deep optimization in architecture engineering

It represents an important advancement in the MoE approach in terms of engineering capabilities, model scale, and inference performance.