Huawei's openPangu 2.0 Is Now Open-Source — 505B MoE, 512K Context, Full Ascend Stack
On June 30, 2026, Huawei delivered on its HDC 2026 promise: openPangu-2.0-Flash weights, base inference code, and training/inference operators went live on GitCode Ascend Tribe. This is the first frontier-scale open-source LLM trained entirely without NVIDIA hardware — and one of the few MoE models planning a full-stack open-source release.
If you need sovereign AI on Ascend, 512K context for long documents, or a NVIDIA-free deployment path, openPangu 2.0 is the headline release of H1 2026. This guide covers: the HDC-to-H2 2026 timeline and all 7 open components; Pro vs Flash specs against DeepSeek, Qwen, and Kimi; mHC, Muon, ModAttn, and DSA+SWA architecture; a six-step ModelArts API and GitCode self-host playbook; hardware thresholds; and a selection matrix for production decisions. Independent third-party benchmarks are not yet public — see the disclaimer at the end.
01 What did openPangu 2.0 open-source? Timeline and 7 components
At HDC 2026 in Dongguan on June 12, Richard Yu's keynote officially launched openPangu 2.0. One week later, Flash weights and the inference stack landed in the open community — Huawei's most significant open-source upgrade since the first Pangu release in 2021.
- Pain point 1: weights-only releases. You can run inference but cannot reproduce training — blocking academic research and domain pre-training.
- Pain point 2: frontier models bind to NVIDIA. Teams blocked from A100/H100 procurement have almost no frontier options.
- Pain point 3: 128K context ceilings. Contracts, codebases, and long conversation histories need larger windows.
- Pain point 4: MoE train/inference drift. Distribution mismatch between training and serving is a well-known MoE production risk.
Release timeline
| Date | Event |
|---|---|
| 2026-06-12 | HDC 2026 keynote: Richard Yu officially launches openPangu 2.0 |
| 2026-06-30 | Flash weights, base inference code, and train/infer operators published on GitCode |
| 2026-07 (planned) | Pro model weights and inference code go live |
| H2 2026 (planned) | Pre-training code, post-training code, and additional training operators roll out |
7 open-source components
- Model architecture (structure definition)
- Model weights (Flash live 6/30; Pro planned for July)
- Technical report (released with weights)
- Inference code (base inference + train/infer operators)
- Pre-training code (H2 2026)
- Post-training code (SFT/RLHF support, H2 2026)
- Training operators (Ascend-optimized custom kernels, H2 2026)
The first four items are standard industry practice. The last three — pre-training, post-training, and custom operators at this scale — are extremely rare for frontier MoE models and define what Huawei calls full-stack open source.
02 openPangu 2.0 Pro vs Flash: specs and how they compare to DeepSeek, Qwen, Kimi
Core parameters — dual variants
| Metric | openPangu 2.0 Pro | openPangu 2.0 Flash |
|---|---|---|
| Total parameters | 505B | 92B |
| Active parameters | 18B | 6B |
| Sparsity ratio | ~28:1 | ~15:1 |
| Context window | 512K | 512K |
| Availability | July 2026 (planned) | Live since 2026-06-30 |
Flash: 92B total, 6B active — inference cost stays low. At ~15:1 sparsity it runs close to a 6B dense model while drawing on a 92B knowledge pool. Single Ascend 910B card is enough; community tests suggest ~96GB unified memory systems may also work.
Pro: 505B total, 18B active — built for heavy long-document workloads. 512K context handles roughly the text volume of eight full-length novels in one pass.
Frontier open-model comparison
| Model | Total params | Active params | Context | Training hardware | Open depth |
|---|---|---|---|---|---|
| openPangu 2.0 Pro | 505B | 18B | 512K | Ascend NPU | Full stack (7 components) |
| openPangu 2.0 Flash | 92B | 6B | 512K | Ascend NPU | Full stack (7 components) |
| DeepSeek V4 Pro | 1.6T | ~200B | 128K | NVIDIA | Weights + inference |
| Qwen 3.7 Max | ~400B+ | varies | 128K | NVIDIA | Weights + inference + partial training |
| Kimi K2.7 | 1T | 32B | 256K | NVIDIA | Weights + inference |
| Llama 4 405B | 405B | — | 128K | NVIDIA | Weights + inference |
Capability matrix
| Capability | openPangu 2.0 Pro | DeepSeek V4 Pro | Qwen 3.7 Max | Kimi K2.7 |
|---|---|---|---|---|
| Code generation | Moderate | Best-in-class | Strong | Strong |
| Complex reasoning | Moderate | Best-in-class | Best-in-class | Strong |
| Tool use / Agent | Strong | Strong | Strong | Best-in-class |
| Ultra-long context | Best-in-class | Moderate | Moderate | Strong |
| Inference efficiency | Best-in-class | Below average | Below average | Strong |
| Sovereign / NVIDIA-free | Best-in-class | Low | Low | Low |
| Full-stack open source | Best-in-class | Partial | Partial | Partial |
03 How is openPangu 2.0 built? mHC, Muon, ModAttn, DSA+SWA, and the Ascend stack
openPangu 2.0 uses a MoE architecture and is the first frontier LLM trained at full scale without any NVIDIA hardware — entirely on Huawei Ascend 910B NPUs, with zero A100 or H100 involvement.
- mHC (Multi-Head Combinatorial) routing: improves expert routing efficiency and reduces load imbalance.
- Muon optimizer: Microsoft's second-order momentum scheme for large-scale training stability.
- ModAttn (Modular Attention): modular attention blocks tuned for 512K context.
- DSA+SWA ultra-sparse attention (Flash only): pushes sparsity further and cuts inference compute demand.
Hardware tuning and training breakthroughs
- Inference throughput: Ascend-native design delivers 2x single-card throughput vs mainstream open models on the same hardware.
- On-device variant: 30B embedded model — 50% faster inference, 20% less memory, runs offline on Kirin-powered phones.
- Latency: 1.2x lower latency than comparable open models.
- Hyper-node training: +30% efficiency on Ascend hyper-node clusters.
- Long-sequence training: +50% throughput at 512K sequence length.
- Train/inference consistency: distribution alignment >99% — a critical MoE production metric.
- Quantized variant: Flash-Int8 with W4A8 support cuts memory use by 40%.
Developer ecosystem
- Software stack: CANN (Huawei's CUDA-class runtime) +
torch_npu(PyTorch adapter). - Framework compatibility: standard PyTorch code; add
import torch_nputo switch to Ascend backend. - Deployment paths: Huawei Cloud ModelArts (managed API); GitCode Ascend Tribe (self-host); HarmonyOS native on-device integration.
04 How to deploy openPangu 2.0: ModelArts API and GitCode self-host in six steps
Option 1: Huawei Cloud ModelArts API (fastest)
- Create a Huawei Cloud account at huaweicloud.com.
- Open ModelArts: Console → ModelArts → AI Gallery.
- Search and subscribe: find "openPangu 2.0" and subscribe to Flash or Pro.
- Collect credentials: copy the API endpoint and auth token after subscription.
- Send a request: use standard Chat Completions JSON format.
- Validate output: confirm responses before wiring into your production Agent pipeline.
curl -X POST "https://modelarts.${REGION}.myhuaweicloud.com/v1/infers/openpangu-2-flash/chat/completions" \
-H "Content-Type: application/json" \
-H "X-Auth-Token: ${TOKEN}" \
-d '{
"model": "openpangu-2.0-flash",
"messages": [
{"role": "user", "content": "Hello, introduce yourself"}
],
"max_tokens": 1024,
"temperature": 0.7
}'
Option 2: GitCode download and self-host
Repository hub: gitcode.com/org/ascend-tribe. Key repos include openPangu-2.0-Flash, openPangu-2.0-Flash-Int8, openPangu-2.0-Infer, and openPangu-2.0-Op.
python inference.py \
--model_path ./openPangu-Flash \
--device npu:0 \
--context_length 512000 \
--precision bf16
python distributed_inference.py \
--model_path ./openPangu-Pro \
--num_devices 8 \
--context_length 512000
python finetune.py \
--model_path ./openPangu-Pro \
--data_path ./domain_data \
--output_dir ./fine_tuned_model \
--method lora \
--lora_rank 16
Option 3: PyTorch + torch_npu
import torch
import torch_npu
model = load_openpangu("./openPangu-Flash")
model = model.to("npu:0")
output = model.generate(
input_ids.to("npu:0"),
max_new_tokens=512,
temperature=0.7
)
05 How much hardware does openPangu 2.0 need? Specs and deployment thresholds
| Variant | Recommended hardware | Minimum config | Notes |
|---|---|---|---|
| Flash (6B active) | Single Ascend 910B | ~96GB unified memory | Community tests on large-memory systems |
| Flash-Int8 | Single Ascend Atlas A2 | ~48GB VRAM | W4A8 quantization; <10% accuracy loss |
| Pro (18B active) | 4+ Ascend 910B cards | Multi-card cluster | Verify after July weight release |
- Total / active params (Pro / Flash): 505B / 92B total; 18B / 6B active; sparsity ~28:1 / ~15:1.
- Context window: 512K tokens on both variants — among the longest in the open-model tier.
- Ascend single-card throughput: 2x mainstream open models on the same NPU hardware.
- Train/inference alignment: >99%, well above typical MoE drift levels.
- Flash-Int8 quantization: 40% memory reduction with <10% accuracy loss.
- Embedded on-device: 30B model — 50% faster inference, 20% less memory.
06 Who should use openPangu 2.0? Selection matrix and strategic significance
Scenario selection matrix
| Scenario | Recommendation | Why |
|---|---|---|
| Code generation / complex reasoning | DeepSeek V4 Pro | ~200B active params, leading benchmarks |
| Agent / multi-tool orchestration | Kimi K2.7 | Most mature MCP ecosystem |
| Ultra-long documents (>256K tokens) | openPangu 2.0 Pro | 512K context is the clear choice |
| Sovereign AI / compliance | openPangu 2.0 | Only frontier model trained entirely without NVIDIA |
| Ascend / Huawei Cloud deployment | openPangu 2.0 | Native optimization, 2x throughput |
| On-device / mobile | openPangu Embedded | 30B on-device, offline on Kirin chips |
| Low-cost local inference | openPangu 2.0 Flash | 6B active, runs on ~96GB systems |
Strategic significance
- Geopolitics and supply chain: under U.S. advanced-chip export restrictions, openPangu 2.0 proves frontier-scale training is achievable without NVIDIA.
- Full-stack open source: researchers can reproduce training end-to-end; enterprises can run domain pre-training on Ascend; lowers the barrier to sovereign AI adoption.
- HarmonyOS Agent foundation: HarmonyOS 7 enters the Agent era with openPangu 2.0 as the native AI engine; HarmonyOS Agent Framework 2.0 reports >90% success on complex tasks.
At HDC 2026, Richard Yu stated: "In my dictionary for the rest of my life, there is no second place — only first. We will go from number one in China to number one in the world."
07 openPangu 2.0 roadmap, openPangu License, and benchmark disclaimer
Open-source roadmap
- 2026-06-30: Flash weights + inference code + train/infer operators (live)
- 2026-07: Pro weights + inference code (planned)
- H2 2026: pre-training code, post-training code, additional operators, data tooling
Track progress at GitCode Ascend Tribe, HDC 2026 official announcements, and Huawei Cloud ModelArts.
openPangu License highlights
- Commercial use permitted
- Royalty-free
- Non-exclusive
- Subject to usage terms in each GitCode repository
Disclaimer: capability assessments in this article are architecture-based estimates. We will update this page when independent third-party benchmark results are published. Published: July 1, 2026.
08 Conclusion: where openPangu 2.0 wins — and JEXCLOUD for Agent development
openPangu 2.0 is not the strongest all-around open model today — DeepSeek V4 Pro still leads on code and complex reasoning. But it is nearly unmatched on five dimensions that matter for sovereign AI deployments:
- 512K ultra-long context — top tier among open models
- NVIDIA-free / sovereign training — the only frontier model trained entirely without NVIDIA hardware
- Ascend-native optimization — 2x throughput vs other models on the same NPU stack
- Full-stack open source — pre-training and post-training code included, rare at this scale
- On-device readiness — runs locally on Kirin-powered phones
If you work on Ascend or Huawei Cloud, process documents beyond 256K tokens, or need a compliance-friendly sovereign AI stack, openPangu 2.0 currently has no direct competitor. Flash weights are downloadable now.
Most production teams split workloads: Ascend cloud inference for the model, and a local Mac environment for Agent orchestration, HarmonyOS/iOS client integration, and CI pipelines. Shared GPU cloud instances often suffer from bandwidth jitter, oversubscription breaking long-lived connections, and multi-tenant contention for unified memory. Local Mac hardware avoids those issues but carries procurement cost and 24/7 maintenance overhead.
For stable OpenClaw, Hermes Agent, or HarmonyOS/iOS integration pipelines, JEXCLOUD multi-region bare-metal Mac nodes are the better fit: dedicated Apple Silicon, no virtualization overhead, elastic monthly scaling, and ~120-second provisioning. See node configs and pricing on the JEXCLOUD pricing page.