For years, the rule seemed clear: if you wanted to run large language models locally, you needed a lot of VRAM. There was not much room for debate. A consumer GPU could be useful for small or medium-sized models, especially with quantization, but once you started talking about models with tens of billions of parameters, the natural next step was a professional GPU, multiple cards, or cloud infrastructure.
However, new projects are emerging that approach the problem from a different angle. Instead of trying to fit the entire model into GPU memory, they split it, load it layer by layer, and use NVMe storage as an active part of the inference process. It is not magic, it does not remove physical limitations, and it does not turn a consumer card into a data center GPU, but it does change the kind of experiments that can be run from a regular workstation.
One of the most interesting projects in this area is oLLM, a lightweight Python library built on top of Hugging Face Transformers and PyTorch. Its proposal is clear: enabling local inference with large models and long contexts on consumer GPUs, without necessarily relying on quantization, distillation, or pruning. The repository mentions examples such as qwen3-next-80B, gpt-oss-20B, Gemma 3 12B, and Llama 3.1 8B Instruct, running with contexts of up to 100k tokens on NVIDIA GPUs with 8 GB of VRAM.
The idea may sound counterintuitive, but for system administrators and developers it makes a lot of sense: if you cannot easily increase VRAM, you can redesign the data flow so the GPU only loads what it needs at each step.
The Real Problem: Weights, KV Cache, and Context
When people talk about running an LLM locally, the issue is often simplified to “the model takes up X GB.” But that is only part of the problem. Several memory blocks are involved during inference:
- Model weights.
- The KV cache generated during attention.
- Intermediate tensors.
- Input context.
- Memory reserved by the framework and the CUDA/ROCm/MPS runtime.
That is why a model that apparently “weighs” 16 GB may require much more memory at runtime, especially with long contexts. The KV cache grows rapidly as the context window increases, and this is where many local configurations start to fail.
oLLM addresses this problem by moving part of that pressure outside VRAM. Its main mechanisms are:
- Loading weights from SSD to GPU layer by layer.
- Offloading KV cache to SSD.
- Optional layer offloading to CPU.
- Use of FlashAttention-2 with online softmax to avoid materializing the full attention matrix.
- Chunked MLP to reduce peak memory usage in intermediate projections.
Put simply: the GPU no longer has to swallow the entire model at once. It runs one part, frees memory, loads the next one, and continues.
This Is Not Quantization: It Is Another Way to Move the Model
The difference from quantization is important. Quantization reduces model size by lowering the precision of the weights, for example to 8-bit, 4-bit, or less. It is a very useful technique and, in practice, the one that has popularized local AI on desktop machines. Tools such as llama.cpp, Ollama, LM Studio, and text-generation-webui have brought this approach to a wide audience.
But quantization means accepting some kind of compromise. In many cases the loss is small, but it exists. It can affect the model more or less depending on the model itself, the quantization method, the task, and the type of prompt.
oLLM proposes something different: keeping fp16/bf16 precision and using NVMe storage to offload weights and cache. This is not about maximizing speed, but about allowing a much larger model to run on a machine with limited VRAM.
For a developer, this opens an interesting possibility: testing the behavior of large models without aggressively degrading them. For a system administrator, it enables advanced local inference labs without immediately provisioning a professional GPU with 48, 80, or 96 GB of memory.
What Can Be Achieved in Practice
According to the data published by the project, oLLM can run qwen3-next-80B with around 160 GB of bf16 weights and a 50k-token context on an 8 GB GPU, using around 7.5 GB of VRAM and approximately 180 GB of SSD.
Other cited cases include:
| Model | Weights | Context | Estimated VRAM without offload | VRAM with oLLM | Approx. SSD usage |
|---|---|---|---|---|---|
| qwen3-next-80B | 160 GB bf16 | 50k | ~190 GB | ~7.5 GB | 180 GB |
| gpt-oss-20B | 13 GB packed bf16 | 10k | ~40 GB | ~7.3 GB | 15 GB |
| gemma3-12B | 25 GB bf16 | 50k | ~45 GB | ~6.7 GB | 43 GB |
| llama3-1B-chat | 2 GB bf16 | 100k | ~16 GB | ~5 GB | 15 GB |
| llama3-3B-chat | 7 GB bf16 | 100k | ~42 GB | ~5.3 GB | 42 GB |
| llama3-8B-chat | 16 GB bf16 | 100k | ~71 GB | ~6.6 GB | 69 GB |
The most striking figure is qwen3-next-80B on an 8 GB GPU, but it also best illustrates the trade-off: the reported performance is around 1 token every 2 seconds. That is not suitable for a smooth conversational experience, but it may be enough for offline workloads.
And that is the key point. oLLM does not compete with a data center GPU in throughput. It competes with the impossibility of running certain models locally at all.
Use Cases for Sysadmins and Developers
For a publication aimed at system administrators and developers, the interest is not in a spectacular demo, but in understanding where this technique fits.
One clear use case is large-scale document analysis. Contracts, regulations, technical manuals, compliance reports, legacy documentation, or RFCs can be processed with much larger context windows without depending on external APIs.
Another scenario is log and security report analysis. An administrator could use a large model to review threat reports, log dumps, traces, or internal documentation without sending sensitive data outside their environment.
It can also make sense for testing models before deploying them on larger infrastructure. A development team can validate prompts, behavior, compatibility with PEFT adapters, or context capacity before deciding whether it is worth moving the model to a GPU with more VRAM or to a production environment.
In addition, oLLM supports multimodal cases such as Gemma 3 12B with image and text or Voxtral Small 24B with audio and text, making it relevant for local AI labs beyond the simple chatbot scenario.
AirLLM: The Direct Precedent
oLLM is not the first project to explore this route. AirLLM had already popularized the idea of running 70B models on 4 GB GPUs through layer-by-layer inference, without quantization, distillation, or pruning. It also claims to run Llama 3.1 405B with 8 GB of VRAM, again with the expected performance limitations.
The philosophy is similar: split the model, load layers when needed, and reduce VRAM usage. AirLLM later added optional 4-bit or 8-bit compression to speed up inference, because in this approach the bottleneck is often disk loading.
The most interesting difference is that oLLM appears to be especially focused on long contexts and on more direct integration with recent models through Hugging Face Transformers and PyTorch, including AutoInference for Llama 3 and Gemma 3 models, PEFT adapter support, and KV cache offloading to SSD.
The Hidden Cost: The SSD Takes the Hit
The less flashy but most important part for system administrators is the impact on storage. In these approaches, the SSD stops being a simple model repository and becomes an active component during inference.
This has several practical implications.
The first is space. Downloading the model is not enough. Some projects split the model by layers, generate caches, and may temporarily duplicate data. When working with 80B or 405B models, storage requirements can grow very quickly.
The second is real NVMe latency and bandwidth. Not all SSDs are equal. A PCIe 4.0 or 5.0 NVMe drive with a good controller, DRAM, and proper cooling may behave reasonably well. A cheap SSD, DRAM-less unit, nearly full drive, or thermally throttled device can completely ruin the experience.
The third is wear. Constant offloading of weights and KV cache involves reads and, depending on the configuration, significant writes as well. For occasional workloads this should not be dramatic, but for intensive use it is worth monitoring:
- TBW consumed.
- NVMe temperature.
- Thermal throttling.
- SMART errors.
- Free space.
- IOPS and latency under load.
For serious testing, it makes sense to use a dedicated NVMe drive for models and caches, rather than the system’s main disk. It is also advisable to avoid low-end QLC drives if the workload is expected to be intensive.
Practical Recommendations Before Testing It
For a local lab, a reasonable setup should include an NVIDIA GPU with 8 GB or more, a fast NVMe SSD with plenty of free space, enough system RAM, and an isolated Python environment using venv or conda.
On NVIDIA, oLLM can benefit from optional dependencies such as kvikio and flash-attn, although the project states they are no longer mandatory. This matters because it reduces hardware restrictions and makes installation easier on more machines.
A basic installation starts with a Python environment:
python3 -m venv ollm_env
source ollm_env/bin/activate
pip install --no-build-isolation ollm
Or from source:
git clone https://github.com/Mega4alik/ollm.git
cd ollm
pip install --no-build-isolation -e .
For NVIDIA with CUDA, kvikio can be added:
pip install kvikio-cu12
And to avoid PyTorch memory fragmentation issues, the project itself suggests running examples with:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python example.py
In development environments, it is also advisable to separate directories:
models/for weights.kv_cache/for cache.- A dedicated NVMe volume if possible.
- Monitoring with
nvme smart-log,iostat,iotop,nvidia-smi, and system metrics.
Not for Interactive Production, but Useful for Advanced Labs
The correct reading of oLLM is not “large GPUs are no longer needed.” That conclusion would be exaggerated. For production, concurrency, low latency, interactive agents, or multi-user APIs, VRAM, memory bandwidth, and compute capacity are still essential.
The correct reading is different: some offline workloads can benefit from large models even if they run slowly. If a process takes half an hour but allows a full report to be analyzed with a model that previously could not even be loaded, the compromise may be acceptable.
This fits well with tasks such as:
- Internal document analysis.
- Large log review.
- Private AI labs.
- Model testing before deployment.
- Comparing behavior between quantized and non-quantized models.
- Local processing of sensitive data.
- Experimenting with very long contexts.
For developers, it is a way to get closer to large models without always depending on external APIs. For system administrators, it is another example of how inference architecture is starting to look increasingly like a classic systems problem: memory hierarchy, latency, cache, throughput, storage, and resource planning.
The Shift in Mindset
For a long time, the debate around local AI has focused on “which model fits in my GPU.” Projects such as oLLM and AirLLM change the question: “which model can I run if I redesign how memory moves.”
It is not a universal solution. It is not fast. It is not free for the SSD. But it is a very powerful tool for certain technical profiles.
The key is understanding that VRAM is no longer the only critical resource. CPU, RAM, the PCIe bus, SSD, file system, runtime, and layer organization all matter. For those coming from the systems world, this feels familiar: when everything does not fit in fast memory, you build a hierarchy and accept the cost of moving between levels.
In that sense, oLLM is not just a curiosity for running huge models on modest hardware. It is a signal of where local inference may evolve: less obsessed with loading everything into the GPU, and more focused on intelligently orchestrating the available resources.
Frequently Asked Questions
Is oLLM useful for running a fast local chatbot?
Not as its main use case. For a smooth local chat experience, smaller quantized models are usually a better option. oLLM makes more sense for offline inference, long contexts, and testing large models without reducing precision.
What advantage does it have over quantizing the model?
Its main advantage is that it can keep fp16/bf16 precision and avoid part of the degradation associated with some quantization methods. The downside is that inference will be much slower and highly dependent on the SSD.
What type of SSD should be used?
Ideally, a fast NVMe drive with a good controller, proper cooling, and reasonable endurance. For intensive use, it is better to dedicate a specific SSD to models and caches, and avoid low-end or nearly full drives.
Can it be used in production?
For interactive production or multiple users, it does not seem like the right option. For batch processes, offline analysis, internal labs, or model evaluation, it can be very useful.
Do AirLLM and oLLM do the same thing?
They share a similar philosophy: reducing VRAM usage through layer-by-layer loading and offloading. AirLLM was one of the projects that popularized this approach. oLLM adds a clear focus on long contexts, SSD cache, recent models, and usage with Hugging Face Transformers/PyTorch.
