Architectural Innovation

Through multiple underlying technology breakthroughs, MiMo-V2-Flash achieves inference efficiency surpassing similar closed-source models while maintaining low cost.

Hybrid Attention

Hybrid Attention

Uses a 5:1 alternating structure of sliding window and global attention. KV Cache reduced by 6x, solving the quadratic complexity of long texts.

Multi-Token Prediction (MTP)

2.0-2.6x Speedup

Built-in lightweight prediction module generating multiple Draft Tokens in one forward pass. No extra I/O, optimized for H200.

Post-Training Distillation

Efficiency

Multi-teacher online policy distillation. Matches teacher performance with 1/50 compute, supports "Step-by-Step" thinking mode.

Performance Benchmarks

Surpasses GPT-5 High and Claude 3.5 Sonnet in Coding and Agent tasks

*Source: Official Tech Report 2025.12
BenchmarkMiMo-V2-FlashClaude Sonnet 4.5DeepSeek-V3.2GPT-5 (High)
SWE-Bench Verified (Coding)73.4%77.2%73.1%74.9%
AIME 2025 (Math)94.1%87.0%93.1%94.6%
GPQA-Diamond (Reasoning)83.7%83.4%82.4%85.7%
LongBench V2 (Context)60.661.858.4-

Tech Specs

  • Released Dec 16, 2025
  • License MIT License (Open Weights)
  • Use Case Reasoning, Coding, Agentic Tasks
  • Framework SGLang (Day-0 Support)
  • Inference Cost $0.1 / 1M Input (Best Value)

Quick Start

Recommend using SGLang for inference to get full MTP & SWA acceleration.

# Install SGLang

pip install sglang

# Launch Server

python -m sglang.launch_server --model-path XiaomiMiMo/MiMo-V2-Flash --port 30000

Try on MiMo AI Studio