2026 antirez ds4 local DeepSeek V4: 96GB threshold, Metal performance, and high-memory cloud Mac rental decision guide
Redis creator Salvatore Sanfilippo (antirez) recently open-sourced ds4 (DwarfStar 4), a pure-C inference engine built exclusively for DeepSeek V4 Flash. It is the first stack to make a 284B-parameter MoE model practically runnable on consumer-grade Apple Silicon Macs. The project crossed 10,000 GitHub stars within days, and community attention rivals the early Redis launch.
By the end of this article you should be able to answer three questions. First, how ds4 differs in essence from general stacks such as llama.cpp and Ollama. Second, how much unified memory Flash and PRO paths require, and what order of magnitude the official benchmarks sit at. Third, when buying a Mac Studio runs into tens of thousands of dollars, how on-demand high-memory bare-metal Mac rental through JEXCLOUD multi-region nodes becomes the more realistic entry point, plus a six-step rollout checklist.
01 What ds4 is: single-model focus and why it exploded in 2026
Most local inference tools take the generalist route. llama.cpp loads hundreds of architectures, Ollama wraps it with a friendly CLI, and MLX targets Apple ecosystem conversion. ds4 goes the opposite way: one mainline model, DeepSeek V4 Flash. The README states it is "intentionally narrow." It is not a universal GGUF loader and does not wrap other runtimes. Instead it ships a self-contained Metal and CUDA graph executor together with DS4-specific loading, prompt rendering, tool calling, KV state (memory and disk), a ds4-server API, and a built-in coding agent.
In public interviews antirez said he spent about one intensive week validating whether a local model could replace daily Claude and GPT calls. That narrative is the engine behind ds4's momentum: the bottleneck is not the inference abstraction layer, but whether you have a frontier-class open-weight model that fits in a large-memory machine. DeepSeek V4 Flash, a 284B total / roughly 13B active MoE, combined with ds4's asymmetric 2/8-bit quantization and disk-backed KV, turns "offline coding agent on a Mac" from a demo into something you can use every day.
- Clear target hardware: Metal is the primary macOS backend from day one, aimed at MacBook Pro and Mac Studio machines with 96GB or more unified memory. Linux CUDA support is advancing in parallel, including DGX Spark class workstations.
- Fast community validation: Third-party reviews on 128GB MacBooks ran 18 real tasks covering long-context coding, tool calling, and agent loops. The takeaway is that a specialized engine plus dedicated GGUF weights is the first combination to pull a massive MoE down to acceptable latency.
- Complements cloud APIs: ds4 fits fixed-model, privacy-sensitive, offline workflows. When you need full-precision quality or a shared team endpoint, cloud APIs still win. The choice should not be treated as all-or-nothing.
One-line summary: ds4 trades "do one thing" for "make DeepSeek V4 Flash usable on a Mac." The heat comes from technical feasibility plus antirez's personal credibility stacked together.
02 ds4 technical highlights and a general local inference decision matrix
Before you invest in ds4, separate two goals: "I want to swap models for fun" versus "I want DeepSeek V4 Flash as daily productivity." The matrix below compares three common paths so you and your team can align expectations.
| Dimension | ds4 (DwarfStar 4) | llama.cpp / Ollama / MLX | Cloud API (Claude / GPT etc.) |
|---|---|---|---|
| Model scope | DeepSeek V4 Flash only (plus evolving PRO path in the repo) | Many architectures and quantizations, weekly model drops | Vendor full catalog, closed or hosted open models |
| Hardware focus | 96GB+ unified-memory Mac; CUDA workstations with large VRAM | Depends on model; small models run on 16GB machines | No local hardware; pay per token |
| Differentiators | Disk KV persistence, million-token context design, native tool calling, ds4-server OpenAI / Anthropic compatible |
Rich plugin ecosystem, many community quant schemes | Full quality, multimodal, enterprise SLA |
| Privacy and offline | Weights and inference stay on your machine or dedicated instance | Same, but large models still need enough RAM | Data passes through third parties; network required |
| Typical pain points | High entry cost (memory, download, compile); single model | Very large MoE models often fail or crawl | Long-term token cost, compliance, rate limits |
A few technical points are worth remembering on their own because they drive the "why Mac" conversation:
- Metal graph executor: Operator fusion tuned for DeepSeek V4 Flash, not generic graph traversal. Official benchmarks on an M3 Ultra with 512GB reach hundreds of tokens per second on long-prompt prefill (see section 05; data from the antirez/ds4 README).
- Asymmetric quantization: More aggressive 2-bit on routed experts, higher precision elsewhere, so Flash runs on 128GB-class machines. The README also documents a q4 path on a 512GB Mac Studio.
- Disk KV cache: Session KV can persist to disk. Combined with fast macOS SSDs, context survives restarts and repeated prefill drops. That matters most for repo-scale agent tasks.
- Built-in coding agent: CLI and
ds4-serverare tested against Cursor, opencode, and similar toolchains, reducing glue code to wire a local model into your IDE.
Why Mac for consumer-grade scenarios? Apple Silicon's unified memory architecture (UMA) lets CPU and GPU share one large memory pool with bandwidth that is hard to match at similar price points. ds4's Metal backend and disk KV design both assume "large memory plus fast SSD" at the same time. Typical cloud GPU instances cap VRAM around 80GB, which often cannot hold a full q2-quantized 284B-class weight set. Even when weights fit, bandwidth and MoE routing patterns can make generation speed unacceptable. Community tests on an RTX PRO 6000 with 96GB (roughly 43 tok/s on short generation) show CUDA is viable, but for most developers, 128GB Mac plus Metal remains the best-documented main battlefield.
03 Local DeepSeek V4 deployment: 96GB floor and hardware purchase matrix
No matter how attractive ds4 looks on paper, memory capacity is the first filter. The matrix below combines repo guidance with community deployment experience. Purchase figures are 2026 market-order-of-magnitude estimates for budget planning only; confirm actual pricing with your vendor.
| Model / quantization | Minimum unified memory | Typical machine | Purchase cost band (reference) |
|---|---|---|---|
| V4 Flash (q2) | 96 GB | MacBook Pro M3/M4/M5 Max | from ~$4,200 |
| V4 Flash (q4) | 256 GB | Mac Studio Ultra | from ~$8,400 |
| V4 PRO (q2) | 512 GB | Mac Studio M3 Ultra max config | from ~$15,400 |
Three recurring pain points follow from those numbers:
- CAPEX too high upfront: Individual developers and teams under five people rarely get budget approval for an Ultra just to trial a frontier local model.
- Uncertain utilization: Inference load is often spiky (release weeks and research sprints are intense; the rest of the quarter sits idle). Owned hardware depreciates fast.
- Environment setup cost: Even after you buy the machine, you still compile ds4, pull hundreds of gigabytes of GGUF weights, and debug Metal plus
ds4-server. Time cost can rival hardware cost.
When the goal shifts from "own a Mac" to "run a ds4 agent within a defined week," on-demand rental of 128GB or 512GB bare-metal Macs turns a capital decision into operating expense, and you can resize nodes per task. For daily, weekly, and monthly lease combinations, see the in-site project-based cloud Mac cost matrix article. This guide focuses on high-memory inference scenarios.
04 Running ds4 on a high-memory cloud Mac: six-step rollout checklist
The flow below assumes you have provisioned a JEXCLOUD bare-metal Mac through the order page (128GB minimum recommended) and logged in via SSH or VNC. If you already own a 96GB+ physical machine, the same steps apply; skip the rental step only.
- Task and quantization selection: Confirm whether you target Flash q2 (128GB is more comfortable) or q4 / PRO. Align internally on "offline agent" versus "CLI trial only" so you do not mid-lease discover insufficient memory and need a resize.
- Provision and accept the node: Pick a high-memory SKU in the console (for example M4 Max 128GB or Studio-class 512GB), inject SSH keys, then run
sysctl hw.memsizeandsystem_profiler SPDisplaysDataTypeto verify memory and Metal availability. - Fetch ds4 and dependencies: Clone with
git clone https://github.com/antirez/ds4.git, then runmakeon macOS to build the Metal build. The README warns that CPU-only paths hit VM-related issues on some macOS versions. Production inference must use Metal or CUDA backends. - Prepare model weights: Download the matching q2 or q4 GGUF per repo docs (tens to hundreds of gigabytes), verify checksums, and place weights on local SSD with enough free space for disk KV and logs.
- Start services and smoke test: Run
./ds4 -p "Hello" --metalfor a short prompt smoke test. Then start./ds4-serverand hit it with curl using OpenAI-compatible completion format. Record whether prefill and generation speeds land in the same band as README benchmarks. - Wire IDE and agent toolchain: Point Cursor or similar clients at the instance-local or SSH-tunneled
ds4-serverURL, configure API keys if enabled, and run a real repo-scale refactor or test-generation task. Confirm tool calling and long-context KV reuse behave as expected before extending the lease.
Memory and Metal preflight
sysctl hw.memsize
./ds4 -p "Summarize KV cache design in one sentence." --metal
Start OpenAI-compatible local service (port per repo default)
./ds4-server --metal
curl -s http://127.0.0.1:PORT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"deepseek-v4-flash","messages":[{"role":"user","content":"ping"}]}'
05 Citable technical data: official benchmarks and model specs (with sources)
When you write an internal evaluation or ask leadership for budget, you can cite the data points below directly. All figures come from the public benchmark table in the antirez/ds4 repository; test conditions follow the README.
- Model spec: DeepSeek V4 Flash is a 284B total-parameter MoE with roughly 13B active parameters. ds4 hard-codes quantization and graph fusion for that checkpoint. Do not assume other GGUF files will work unchanged.
- MacBook Pro M3 Max (128 GB), q2, short prompt: prefill about 58.52 t/s, generation about 26.68 t/s.
- MacBook Pro M3 Max (128 GB), q2, long prompt (~11.7k tokens): prefill about 250.11 t/s, generation about 21.47 t/s.
- Mac Studio M3 Ultra (512 GB), q2, long prompt: prefill about 468.03 t/s, generation about 27.39 t/s. q4 long prompt: prefill about 448.82 t/s, generation about 26.62 t/s.
- DGX Spark GB10 (128 GB), CUDA, q2, long prompt: prefill about 343.81 t/s, generation about 13.75 t/s, showing non-Mac paths work but generation is more bandwidth-bound.
Community numbers on newer hardware such as M5 Max (prefill around 463 t/s) are useful trend signals. For external materials, anchor on the repo table and footnote test date plus quantization version.
06 Rent vs buy: when JEXCLOUD high-memory bare metal is the right ds4 landing zone
antirez proved with ds4 that consumer-grade high-memory Macs can already host DeepSeek V4-class local inference in production terms. The real blocker is usually hardware CAPEX and environment setup time, not whether the C code compiles.
Buying a maxed Mac Studio still fits a "always-on, single-machine dedicated" core R&D role. For most teams, three substitutes expose hard limits quickly. First, forcing ds4 onto a generic 16GB cloud host fails before inference starts because q2 weights will not load. Second, a home Mac mini on shared broadband gets throttled by large model downloads and long-running inference competing with upload bandwidth and neighbor noise. Third, relying only on public cloud APIs turns long agent runs into token bills and data-residency compliance ceilings you do not see until month-end.
The steadier production path is this: provision 128GB or 512GB instances on JEXCLOUD multi-region bare-metal Mac on demand, with compile toolchain and storage headroom ready, run ds4 inference, then release or downsize when the sprint ends. You get dedicated Apple Silicon without virtualization overselling, and inference data stays on your instance instead of a third-party API. One shared high-memory node for evaluation and agent pilots beats everyone buying an Ultra. See node specs, regions, and pricing on the JEXCLOUD pricing page. For deployment and SSH questions, use the help center.