TL;DR: Quantization shrinks AI models by reducing the precision of their internal numbers — like compressing a RAW photo to JPEG. The file gets much smaller, it runs on cheaper hardware, and the quality loss is barely noticeable. This is how Ollama runs a 70-billion-parameter model on your MacBook. You don't need to understand the math. You need to know which quantization level to pick and what you're trading off.
Why Vibe Coders Need to Know This
Every AI model you use through Claude, ChatGPT, or Cursor runs on massive GPU clusters in data centers. That works great — until you need to work offline, process sensitive code you don't want leaving your machine, or just stop paying per-token for tasks you do a hundred times a day.
Local AI is the answer. Tools like Ollama let you run AI models directly on your laptop. But there's a problem: these models are enormous. Llama 3 70B — one of the best open-source models — needs about 140GB of memory at full precision. Your MacBook probably has 16–32GB.
Quantization solves this. It's the technology that makes local AI possible on consumer hardware. When you type ollama run llama3, you're running a quantized model. When you see "Q4_K_M" or "Q8_0" in a model name on HuggingFace, that's telling you how aggressively the model has been compressed.
You don't need to become a quantization expert. But understanding the basics helps you pick the right model for your hardware and understand why some local models feel "smarter" than others.
The Photo Analogy: RAW vs JPEG
Here's the simplest way to think about it:
- A full-precision AI model is like a RAW photo file. Massive. Every pixel stored at maximum detail. 50MB per image. Beautiful — but you can't email it, it takes forever to load, and your phone's storage fills up instantly.
- A quantized AI model is like a JPEG. The file is 5x–10x smaller. You lose some detail if you zoom way in, but at normal viewing distance? Looks identical. Good enough for everything except professional printing.
Quantization does the same thing to AI models. It reduces the precision of the billions of numbers inside the model, making the file dramatically smaller and faster to run. The model loses a tiny bit of "intelligence" — but for most tasks, including coding, you can't tell the difference.
The Numbers: What FP32, FP16, INT8, INT4 Actually Mean
You'll see these abbreviations everywhere in local AI. Here's what they mean in practical terms — no math required:
| Format | Bits Per Number | Model Size (70B params) | Quality | Your Hardware |
|---|---|---|---|---|
| FP32 (Full) | 32 bits | ~280 GB | 100% (reference) | Data center only |
| FP16 (Half) | 16 bits | ~140 GB | ~99.9% | High-end GPU (A100, H100) |
| INT8 / Q8 | 8 bits | ~70 GB | ~99% | 64GB+ Mac or dual GPU |
| INT4 / Q4 | 4 bits | ~35–40 GB | ~96–97% | 32GB Mac or single GPU |
| INT2 / Q2 | 2 bits | ~17–20 GB | ~90% | 16GB laptop (experimental) |
Every AI model is just billions of numbers (called parameters or weights). At full precision (FP32), each number takes 32 bits of storage. Quantization says: "What if we stored each number with only 4 bits instead of 32?" The model gets 8x smaller. Some precision is lost — a number that was 0.7823451 might become 0.78 — but the model still works remarkably well.
Q4_K_M (4-bit, K-quant, medium quality) is the go-to for most people. It's what Ollama uses by default for most models. It runs on 16–32GB hardware, the quality loss is barely measurable, and it's fast. If you have more RAM, step up to Q5_K_M or Q6_K. If you're RAM-constrained, Q3_K_M still works but you'll notice the model is less precise in its responses.
Decoding Model Names: What "Q4_K_M" Actually Tells You
When you browse models on HuggingFace or Ollama, you'll see names like:
llama-3-70b-instruct-Q4_K_M.gguf
codellama-34b-Q5_K_S.gguf
mistral-7b-Q8_0.gguf
Here's how to read them:
- Q4, Q5, Q6, Q8 — The number of bits. Lower = smaller file, slightly less quality
- K — Uses "K-quant" method (better quality than older methods at the same size)
- S, M, L — Small, Medium, Large. This refers to how many layers get the higher-precision treatment. M (medium) is the default sweet spot
- _0 — Original quantization method (no K-quant). You'll see this on Q8_0
- .gguf — The file format. GGUF (GPT-Generated Unified Format) is the standard for local AI models
So Q4_K_M means: "4-bit quantization, using the K-quant method, medium quality tier." That's the one you want 90% of the time.
How Ollama Makes This Invisible
Here's the beautiful part: if you're using Ollama, you probably don't need to think about any of this. When you run:
ollama run llama3
Ollama automatically downloads a quantized version that fits your hardware. It picks the right size based on your available RAM. You just type a prompt and get responses.
But understanding quantization helps when you want to:
- Try a bigger model: "Can I run the 70B model instead of 8B?" — depends on your RAM and which quantization level you choose
- Improve quality: If responses feel off, try a higher quantization level (Q5 instead of Q4) or a larger model at lower quantization
- Download from HuggingFace: When a model isn't on Ollama yet, you'll pick the GGUF file yourself
- Compare models: A 13B model at Q8 might be better than a 70B model at Q2 — size isn't everything
Practical Guide: Picking the Right Model for Your Machine
| Your RAM | Best Model Size | Quantization | What You Can Run |
|---|---|---|---|
| 8 GB | 3B–7B | Q4_K_M | Phi-3, Gemma 2B, small Llamas. Good for quick questions, basic code gen |
| 16 GB | 7B–13B | Q4_K_M | Llama 3 8B, CodeLlama 13B, Mistral 7B. Solid coding assistant |
| 32 GB | 13B–34B | Q4_K_M to Q5_K_M | CodeLlama 34B, Llama 3 70B at Q2/Q3. Very capable |
| 64 GB | 70B | Q4_K_M to Q6_K | Llama 3 70B at good quality. Near-frontier intelligence locally |
| 128 GB+ | 70B+ | Q8_0 or FP16 | Full quality large models. Research-grade local setup |
The rule: The model file needs to fit in RAM with room to spare. If a Q4_K_M GGUF file is 35GB, you need at least 40–45GB of available RAM to run it comfortably (the model needs extra memory for processing).
On Apple Silicon Macs, the GPU and CPU share the same memory (unified memory architecture). This is why Macs are disproportionately good at running local AI — a 64GB M2 Max can run models that would require an expensive dedicated GPU on Windows/Linux. See our guide to running large AI models on your Mac for the full setup.
Google TurboQuant: The Future of Compression
In March 2026, Google Research published TurboQuant — a new quantization technique that pushes compression further than anyone thought possible. The headline result: models compressed to 2-bit precision (Q2) that perform nearly as well as standard 4-bit models.
Why this matters for vibe coders:
- Today: Running a 70B model needs 32–64GB RAM at Q4
- With TurboQuant-level compression: That same model could fit in 16–20GB — meaning a base MacBook Air could run frontier-level models
- The trajectory: Every year, the same intelligence runs on cheaper hardware. What needed a $3,000 Mac last year might need a $1,200 Mac next year
TurboQuant isn't a tool you can download today — it's a research technique that model creators will incorporate into future releases. But it signals where local AI is heading: smaller, faster, and eventually running on your phone.
GGUF: The Format That Made Local AI Happen
Every local AI model uses the GGUF file format. It was created by the llama.cpp project (the open-source engine that Ollama, LM Studio, and most local AI tools are built on) and became the universal standard.
Think of GGUF like MP4 for video — it's just the container format everyone agreed on. When you download a model for local use, it's a single .gguf file. That file contains the model weights (the billions of numbers), the quantization method used, and metadata about how to run it.
You don't need to understand GGUF internals. You just need to know: if it's a .gguf file, it works with Ollama, LM Studio, llama.cpp, and most other local AI tools. If it's in a different format (like PyTorch .pt files or SafeTensors .safetensors), you'll need to convert it first — or find someone who already did on HuggingFace.
When to Run Local vs When to Use the Cloud
| Scenario | Use Local (Quantized) | Use Cloud (Claude/ChatGPT) |
|---|---|---|
| Complex app architecture | ✅ Needs frontier intelligence | |
| Sensitive proprietary code | ✅ Never leaves your machine | |
| On an airplane | ✅ No internet needed | |
| Batch processing 500 files | ✅ No per-token cost | |
| Debugging a tricky error | ✅ Needs best reasoning | |
| Quick code snippets | ✅ Instant, no latency | |
| Learning a new framework | ✅ Needs broad knowledge | |
| Experimenting with prompts | ✅ No API costs |
Most vibe coders use a hybrid approach: cloud APIs (through OpenRouter or directly) for their primary work, and local models for privacy-sensitive, offline, or high-volume tasks.
What AI Gets Wrong About Quantization
AI assistants often recommend the largest model available without considering your hardware. "Use Llama 3 70B for best results!" — sure, if you have 64GB of RAM. On a 16GB machine, a well-quantized 8B model will run circles around a 70B model that's constantly swapping to disk. A model that fits in memory and runs fast is always better than one that technically loads but takes 30 seconds per response.
AI will say "quantization barely affects quality" as a blanket statement. That's true for Q8 and Q6. It's mostly true for Q4_K_M. It's not true for Q2 or Q3 — at those levels, you'll notice the model makes more mistakes, loses context more easily, and generates less coherent code. The difference between Q4_K_M and Q2_K is like the difference between JPEG at 80% quality and JPEG at 20% quality.
Quantization shrinks the model weights, but the context window (how much text the model can "see" at once) also uses RAM. A 70B model at Q4 might fit in 40GB of RAM — but if you need a 32K context window, add another 10–15GB. AI rarely mentions this. If your model runs fine with short prompts but crashes with long files, you're hitting the context memory limit, not a model size problem. Reduce context length or use a smaller model.
The local AI landscape changes monthly. AI might recommend Llama 2 quantized models when Llama 3 has been out for months and is dramatically better at the same size. Always check the Ollama library or HuggingFace trending models for the latest options. The best 7B model today is significantly better than the best 13B model from six months ago — newer architectures matter more than raw parameter count.
Getting Started: Your First Quantized Model in 60 Seconds
If you haven't run a local model yet, here's the fastest path:
# Install Ollama (macOS)
curl -fsSL https://ollama.com/install.sh | sh
# Run your first model (auto-downloads quantized version)
ollama run llama3
# That's it. Start typing prompts.
Ollama handles everything — downloading the right quantized version for your hardware, managing memory, and serving the model. You'll be chatting with a local AI in under a minute.
If you want more control (specific quantization levels, models not on Ollama):
# Download a specific GGUF from HuggingFace
# Example: CodeLlama 34B at Q4_K_M
# 1. Find the model on huggingface.co (search for GGUF)
# 2. Download the .gguf file
# 3. Create an Ollama Modelfile:
echo 'FROM ./codellama-34b-instruct-Q4_K_M.gguf' > Modelfile
ollama create my-codellama -f Modelfile
ollama run my-codellama
"I have a MacBook with [X]GB of RAM. I want to run a local AI model for [coding/writing/general use]. What's the best model and quantization level? I'm using Ollama."
Frequently Asked Questions
Slightly, but usually not enough to notice. Q8 loses almost nothing. Q4_K_M loses about 1–3% on benchmarks — roughly like using a model one size smaller. For coding, Q4 models still generate working code, catch bugs, and follow instructions. You'd need side-by-side comparisons to spot the difference. The tradeoff is worth it when the alternative is not running the model at all.
With 16GB RAM: Q4_K_M on a 7B–8B model. With 32GB: Q4_K_M on a 13B–34B model (or Q5_K_M on 13B). With 64GB+: Q4_K_M on 70B — that's frontier-level intelligence running on your laptop. Always leave 5–10GB of headroom for your OS, browser, and IDE.
GGUF is the standard file format for local AI models, created by the llama.cpp project. Think of it like MP3 for music — it's the container format everyone agreed on. It works with Ollama, LM Studio, llama.cpp, and most local AI tools. If a model file ends in .gguf, you can run it locally.
Use both. Cloud APIs for complex architecture, debugging, and learning new things. Local models for sensitive code, offline work, batch processing, and cost-free experimentation. Most vibe coders end up with a hybrid setup — cloud for primary work, local for specific use cases.
A March 2026 research technique from Google that achieves extreme compression (2-bit and below) with less quality loss than previous methods. It's not downloadable yet — it's a research paper that model creators will incorporate into future releases. It signals that local AI will keep getting smaller and faster, eventually running frontier models on budget hardware.