Cloud Mac ds4 2026.05.26

2026 antirez ds4 local DeepSeek V4: 96GB threshold, Metal performance, and high-memory cloud Mac rental decision guide

JEX

JEXCLOUD Engineering team

· May 26, 2026 · About 12 minutes to read

Redis creator Salvatore Sanfilippo (antirez) recently open-sourced ds4 (DwarfStar 4), a pure-C inference engine built exclusively for DeepSeek V4 Flash. It is the first stack to make a 284B-parameter MoE model practically runnable on consumer-grade Apple Silicon Macs. The project crossed 10,000 GitHub stars within days, and community attention rivals the early Redis launch.

By the end of this article you should be able to answer three questions. First, how ds4 differs in essence from general stacks such as llama.cpp and Ollama. Second, how much unified memory Flash and PRO paths require, and what order of magnitude the official benchmarks sit at. Third, when buying a Mac Studio runs into tens of thousands of dollars, how on-demand high-memory bare-metal Mac rental through JEXCLOUD multi-region nodes becomes the more realistic entry point, plus a six-step rollout checklist.

01 What ds4 is: single-model focus and why it exploded in 2026

Most local inference tools take the generalist route. llama.cpp loads hundreds of architectures, Ollama wraps it with a friendly CLI, and MLX targets Apple ecosystem conversion. ds4 goes the opposite way: one mainline model, DeepSeek V4 Flash. The README states it is "intentionally narrow." It is not a universal GGUF loader and does not wrap other runtimes. Instead it ships a self-contained Metal and CUDA graph executor together with DS4-specific loading, prompt rendering, tool calling, KV state (memory and disk), a ds4-server API, and a built-in coding agent.

In public interviews antirez said he spent about one intensive week validating whether a local model could replace daily Claude and GPT calls. That narrative is the engine behind ds4's momentum: the bottleneck is not the inference abstraction layer, but whether you have a frontier-class open-weight model that fits in a large-memory machine. DeepSeek V4 Flash, a 284B total / roughly 13B active MoE, combined with ds4's asymmetric 2/8-bit quantization and disk-backed KV, turns "offline coding agent on a Mac" from a demo into something you can use every day.

Clear target hardware: Metal is the primary macOS backend from day one, aimed at MacBook Pro and Mac Studio machines with 96GB or more unified memory. Linux CUDA support is advancing in parallel, including DGX Spark class workstations.
Fast community validation: Third-party reviews on 128GB MacBooks ran 18 real tasks covering long-context coding, tool calling, and agent loops. The takeaway is that a specialized engine plus dedicated GGUF weights is the first combination to pull a massive MoE down to acceptable latency.
Complements cloud APIs: ds4 fits fixed-model, privacy-sensitive, offline workflows. When you need full-precision quality or a shared team endpoint, cloud APIs still win. The choice should not be treated as all-or-nothing.

One-line summary: ds4 trades "do one thing" for "make DeepSeek V4 Flash usable on a Mac." The heat comes from technical feasibility plus antirez's personal credibility stacked together.

02 ds4 technical highlights and a general local inference decision matrix

Before you invest in ds4, separate two goals: "I want to swap models for fun" versus "I want DeepSeek V4 Flash as daily productivity." The matrix below compares three common paths so you and your team can align expectations.

ds4 vs general local inference vs cloud API (2026 selection guide)
Dimension	ds4 (DwarfStar 4)	llama.cpp / Ollama / MLX	Cloud API (Claude / GPT etc.)
Model scope	DeepSeek V4 Flash only (plus evolving PRO path in the repo)	Many architectures and quantizations, weekly model drops	Vendor full catalog, closed or hosted open models
Hardware focus	96GB+ unified-memory Mac; CUDA workstations with large VRAM	Depends on model; small models run on 16GB machines	No local hardware; pay per token
Differentiators	Disk KV persistence, million-token context design, native tool calling, `ds4-server` OpenAI / Anthropic compatible	Rich plugin ecosystem, many community quant schemes	Full quality, multimodal, enterprise SLA
Privacy and offline	Weights and inference stay on your machine or dedicated instance	Same, but large models still need enough RAM	Data passes through third parties; network required
Typical pain points	High entry cost (memory, download, compile); single model	Very large MoE models often fail or crawl	Long-term token cost, compliance, rate limits

A few technical points are worth remembering on their own because they drive the "why Mac" conversation:

Metal graph executor: Operator fusion tuned for DeepSeek V4 Flash, not generic graph traversal. Official benchmarks on an M3 Ultra with 512GB reach hundreds of tokens per second on long-prompt prefill (see section 05; data from the antirez/ds4 README).
Asymmetric quantization: More aggressive 2-bit on routed experts, higher precision elsewhere, so Flash runs on 128GB-class machines. The README also documents a q4 path on a 512GB Mac Studio.
Disk KV cache: Session KV can persist to disk. Combined with fast macOS SSDs, context survives restarts and repeated prefill drops. That matters most for repo-scale agent tasks.
Built-in coding agent: CLI and ds4-server are tested against Cursor, opencode, and similar toolchains, reducing glue code to wire a local model into your IDE.

Why Mac for consumer-grade scenarios? Apple Silicon's unified memory architecture (UMA) lets CPU and GPU share one large memory pool with bandwidth that is hard to match at similar price points. ds4's Metal backend and disk KV design both assume "large memory plus fast SSD" at the same time. Typical cloud GPU instances cap VRAM around 80GB, which often cannot hold a full q2-quantized 284B-class weight set. Even when weights fit, bandwidth and MoE routing patterns can make generation speed unacceptable. Community tests on an RTX PRO 6000 with 96GB (roughly 43 tok/s on short generation) show CUDA is viable, but for most developers, 128GB Mac plus Metal remains the best-documented main battlefield.

03 Local DeepSeek V4 deployment: 96GB floor and hardware purchase matrix

No matter how attractive ds4 looks on paper, memory capacity is the first filter. The matrix below combines repo guidance with community deployment experience. Purchase figures are 2026 market-order-of-magnitude estimates for budget planning only; confirm actual pricing with your vendor.

DeepSeek V4 + ds4 typical hardware floor and purchase cost bands
Model / quantization	Minimum unified memory	Typical machine	Purchase cost band (reference)
V4 Flash (q2)	96 GB	MacBook Pro M3/M4/M5 Max	from ~$4,200
V4 Flash (q4)	256 GB	Mac Studio Ultra	from ~$8,400
V4 PRO (q2)	512 GB	Mac Studio M3 Ultra max config	from ~$15,400

Three recurring pain points follow from those numbers:

CAPEX too high upfront: Individual developers and teams under five people rarely get budget approval for an Ultra just to trial a frontier local model.
Uncertain utilization: Inference load is often spiky (release weeks and research sprints are intense; the rest of the quarter sits idle). Owned hardware depreciates fast.
Environment setup cost: Even after you buy the machine, you still compile ds4, pull hundreds of gigabytes of GGUF weights, and debug Metal plus ds4-server. Time cost can rival hardware cost.

When the goal shifts from "own a Mac" to "run a ds4 agent within a defined week," on-demand rental of 128GB or 512GB bare-metal Macs turns a capital decision into operating expense, and you can resize nodes per task. For daily, weekly, and monthly lease combinations, see the in-site project-based cloud Mac cost matrix article. This guide focuses on high-memory inference scenarios.

04 Running ds4 on a high-memory cloud Mac: six-step rollout checklist

The flow below assumes you have provisioned a JEXCLOUD bare-metal Mac through the order page (128GB minimum recommended) and logged in via SSH or VNC. If you already own a 96GB+ physical machine, the same steps apply; skip the rental step only.

Task and quantization selection: Confirm whether you target Flash q2 (128GB is more comfortable) or q4 / PRO. Align internally on "offline agent" versus "CLI trial only" so you do not mid-lease discover insufficient memory and need a resize.
Provision and accept the node: Pick a high-memory SKU in the console (for example M4 Max 128GB or Studio-class 512GB), inject SSH keys, then run sysctl hw.memsize and system_profiler SPDisplaysDataType to verify memory and Metal availability.
Fetch ds4 and dependencies: Clone with git clone https://github.com/antirez/ds4.git, then run make on macOS to build the Metal build. The README warns that CPU-only paths hit VM-related issues on some macOS versions. Production inference must use Metal or CUDA backends.
Prepare model weights: Download the matching q2 or q4 GGUF per repo docs (tens to hundreds of gigabytes), verify checksums, and place weights on local SSD with enough free space for disk KV and logs.
Start services and smoke test: Run ./ds4 -p "Hello" --metal for a short prompt smoke test. Then start ./ds4-server and hit it with curl using OpenAI-compatible completion format. Record whether prefill and generation speeds land in the same band as README benchmarks.
Wire IDE and agent toolchain: Point Cursor or similar clients at the instance-local or SSH-tunneled ds4-server URL, configure API keys if enabled, and run a real repo-scale refactor or test-generation task. Confirm tool calling and long-context KV reuse behave as expected before extending the lease.

ds4-smoke.sh

Memory and Metal preflight
sysctl hw.memsize
./ds4 -p "Summarize KV cache design in one sentence." --metal

Start OpenAI-compatible local service (port per repo default)
./ds4-server --metal
curl -s http://127.0.0.1:PORT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"deepseek-v4-flash","messages":[{"role":"user","content":"ping"}]}'

05 Citable technical data: official benchmarks and model specs (with sources)

When you write an internal evaluation or ask leadership for budget, you can cite the data points below directly. All figures come from the public benchmark table in the antirez/ds4 repository; test conditions follow the README.

Model spec: DeepSeek V4 Flash is a 284B total-parameter MoE with roughly 13B active parameters. ds4 hard-codes quantization and graph fusion for that checkpoint. Do not assume other GGUF files will work unchanged.
MacBook Pro M3 Max (128 GB), q2, short prompt: prefill about 58.52 t/s, generation about 26.68 t/s.
MacBook Pro M3 Max (128 GB), q2, long prompt (~11.7k tokens): prefill about 250.11 t/s, generation about 21.47 t/s.
Mac Studio M3 Ultra (512 GB), q2, long prompt: prefill about 468.03 t/s, generation about 27.39 t/s. q4 long prompt: prefill about 448.82 t/s, generation about 26.62 t/s.
DGX Spark GB10 (128 GB), CUDA, q2, long prompt: prefill about 343.81 t/s, generation about 13.75 t/s, showing non-Mac paths work but generation is more bandwidth-bound.

Community numbers on newer hardware such as M5 Max (prefill around 463 t/s) are useful trend signals. For external materials, anchor on the repo table and footnote test date plus quantization version.

06 Rent vs buy: when JEXCLOUD high-memory bare metal is the right ds4 landing zone

antirez proved with ds4 that consumer-grade high-memory Macs can already host DeepSeek V4-class local inference in production terms. The real blocker is usually hardware CAPEX and environment setup time, not whether the C code compiles.

Buying a maxed Mac Studio still fits a "always-on, single-machine dedicated" core R&D role. For most teams, three substitutes expose hard limits quickly. First, forcing ds4 onto a generic 16GB cloud host fails before inference starts because q2 weights will not load. Second, a home Mac mini on shared broadband gets throttled by large model downloads and long-running inference competing with upload bandwidth and neighbor noise. Third, relying only on public cloud APIs turns long agent runs into token bills and data-residency compliance ceilings you do not see until month-end.

The steadier production path is this: provision 128GB or 512GB instances on JEXCLOUD multi-region bare-metal Mac on demand, with compile toolchain and storage headroom ready, run ds4 inference, then release or downsize when the sprint ends. You get dedicated Apple Silicon without virtualization overselling, and inference data stays on your instance instead of a third-party API. One shared high-memory node for evaluation and agent pilots beats everyone buying an Ultra. See node specs, regions, and pricing on the JEXCLOUD pricing page. For deployment and SSH questions, use the help center.

Back to blog list

Tags: ds4 DeepSeek V4 Metal Cloud Mac High-memory rental