Salvatore Sanfilippo, better known as antirez and the creator of Redis, has once again touched a sensitive nerve in the technical community. His new project, ds4.c, is not another generic runtime for language models or just another layer on top of existing tools. It is a native inference engine with a very specific goal: to run DeepSeek V4 Flash on Apple Silicon through Metal, with an aggressive bet on quantization, long context and using the SSD as an active part of the cache.

The reaction has been immediate because the message connects with a growing obsession among developers, researchers and companies: running increasingly capable models locally, without always depending on closed APIs, remote clusters or variable inference costs. The promise is not small. DeepSeek V4 Flash has a 1-million-token context window and, according to Reuters, is part of DeepSeek’s new V4 family, presented as a series also adapted to Huawei Ascend chips in the context of China’s race to reduce technological dependence on foreign suppliers.

A narrow engine, not a universal runtime

The first key to understanding ds4.c is that it does not try to compete with llama.cpp in breadth. Antirez defines it as a small engine built specifically for DeepSeek V4 Flash. It is not a generic GGUF loader, it does not aim to run any model, and it is not presented as a framework. Its main path is a custom Metal graph executor, with model loading, prompt rendering, KV state and server API glue designed around DS4.

That decision matters. Much of the local inference ecosystem has advanced by seeking compatibility with many models. Antirez takes the opposite route: choose one model and polish the end-to-end experience until it is useful for coding agents. In his own documentation, he describes it as a combination of an inference engine with an HTTP API, a GGUF file prepared for that engine, and testing with real agent implementations. The idea is not just that it “runs”, but that it can be used in serious workflows.

The project explicitly acknowledges its debt to llama.cpp and GGML. It does not link against GGML, but it uses its ecosystem, quantization formats, kernels, tests and accumulated engineering knowledge as a reference. It also keeps or adapts some pieces under the MIT license, such as GGUF quantization layouts, CPU quant/dot logic and certain Metal kernels.

The most striking part is quantization. ds4.c only works with the GGUF files published for this project, not with arbitrary files. For machines with 128 GB of RAM, there is a q2 path; for systems with 256 GB or more, there is a q4 path. The 2-bit quantization is not applied uniformly: the routed MoE experts are compressed, while other critical components are kept at higher precision to try to preserve quality.

The SSD enters the KV cache conversation

The other major idea is treating the KV cache as something that does not always need to live in RAM. In long-context models, the memory consumed by the key-value cache becomes one of the practical limits. ds4.c argues that with compressed caches like those in DeepSeek V4 and fast SSDs in modern Macs, it makes sense to persist part of that state on disk.

This is not a minor optimization. Coding agents often resend long histories, system prompts, instructions, tools and project context. If every request forces the system to reprocess tens of thousands of tokens from scratch, local inference becomes slow and awkward. The ds4.c server compares the input tokens with cached prefixes and can reuse already computed state, both in memory and from disk, to continue sessions or recover them after restarts.

The documentation does warn that there are practical limits. Although the model supports 1 million tokens, on a machine with 128 GB of RAM and 2-bit quantization it does not always make sense to configure the maximum context size. Antirez recommends windows of 100,000 to 300,000 tokens on that type of system, because the model already consumes a huge amount of memory and a 1-million-token configuration can add tens of gigabytes more.

In terms of performance, the published numbers are interesting, but not miraculous. On a MacBook Pro M3 Max with 128 GB, the q2 quantization reaches 58.52 tokens per second in prefill with a short prompt and 26.68 tokens per second in generation. With an 11,709-token prompt, prefill rises to 250.11 tokens per second and generation falls to 21.47 tokens per second. On a Mac Studio M3 Ultra with 512 GB, the values improve, with 36.86 tokens per second in q2 generation with a short prompt and 35.50 tokens per second in q4.

These are usable figures for local work, especially in agent mode, but they do not turn a laptop into a GPU cluster. The advance is that a large, specialized model can run reasonably well on high-end personal hardware. That is already a lot, but it should not be confused with data-center-speed frontier inference.

What changes for coding agents

Integration with agents is one of the most interesting parts of the project. ds4-server exposes OpenAI- and Anthropic-compatible endpoints, including /v1/chat/completions, /v1/completions and /v1/messages. This makes it possible to connect it to programming clients that already speak those protocols, such as Claude Code-style workflows, OpenCode or Pi. It also supports SSE streaming, tools and function calls, with conversion into DeepSeek’s DSML format.

This is where ds4.c becomes relevant beyond technical curiosity. Local AI is not just about answering questions from a terminal. It is about whether a model can read a repository, maintain context, use tools, edit code, run tests, ask for more information and avoid getting lost in long sessions. Antirez claims that the 2-bit quantizations “work well” under coding agents and call tools reliably, although that claim comes from the project’s author and needs independent validation through benchmarks and real-world cases.

There are also important caveats. The server is Metal-only. There is no CUDA support yet. Inference is serialized through a single Metal worker, meaning there is no batching of multiple independent requests; in practice, concurrent requests wait their turn. In addition, the CPU path is not a production target, and the README itself warns of an issue in current macOS versions that can crash the kernel when running it.

The 2-bit quantization is another reason for caution. The approach is clever because it does not compress every part of the model equally. But it is still a very aggressive quantization. It will be necessary to measure how much quality is lost compared with the full model, especially in long tasks, reasoning, tool calling, coding and information retrieval. The community has already learned that “it works” and “it preserves the original model’s behavior” do not always mean the same thing.

What matters, even with those reservations, is the direction of travel. Local inference is moving from hobbyist experimentation towards a practical alternative in specific scenarios: privacy, agent testing, offline development, cost control, research and heavy context usage. It does not replace the cloud for training, multi-user deployments, high availability or enterprise workloads, but it does begin to weaken the idea that every capable model must necessarily live behind a remote API.

ds4.c does not break local AI on its own. It is alpha-quality code, narrow, dependent on one specific model and mainly designed for Apple Silicon. But it does show something important: when an open model with long context is combined with specific engineering, careful quantization and high-memory personal hardware, the result can get much closer to a “frontier AI on your own machine” experience than seemed reasonable not long ago.

There is also a cultural reading. While the big labs compete with massive investments in data centers, chips and cloud deals, part of the progress still comes from hackers able to look at a problem from a different angle. Antirez has not created a universal substitute for the cloud. He has shown that, for some models and some use cases, the local edge still has a lot to say.

Frequently asked questions

What is ds4.c?
It is a local inference engine created by antirez to run DeepSeek V4 Flash on Apple Silicon through Metal. It is not a generic runtime or a universal GGUF loader.

Can it run DeepSeek V4 Flash on a MacBook?
According to the project documentation, the q2 quantization is intended for machines with 128 GB of RAM, such as some high-end MacBook Pro models. The q4 path requires 256 GB or more.

Does ds4.c use CUDA or work on NVIDIA GPUs?
Not for now. The project is Metal-only and focused on Apple Silicon. CUDA support is not available in this version.

Does 2-bit quantization preserve the quality of the original model?
That cannot be assumed. The quantization is designed to preserve critical components, but it is still very aggressive compression and may involve quality loss compared with the full model.

Scroll to Top