TL;DR: Ollama is a free app that downloads AI models — like Llama 3, Mistral, or Gemma — and runs them on your own Mac, Windows, or Linux machine. You get a ChatGPT-style AI that works with no internet, no API keys, and no per-message cost. Quality is behind the top cloud models, but good enough for a lot of real work, and it's completely private.

The Problem Ollama Solves

Picture this: you're building a side project that uses AI. You paste some code into Claude, get a great answer, paste more code, get another answer. At the end of the month, you check your API bill: $47. You used it more than you thought. Next month you try to be conservative and the project slows down.

Or maybe the problem is different. You work with client data — code, contracts, notes — and you're not sure you should be sending it to a server you don't control. The terms of service say your data isn't used for training, but you're not completely comfortable with it.

Or you're on a plane, a job site, or somewhere with bad internet. Your AI-assisted workflow stops completely.

Ollama fixes all three of these at once. You download it, you pull a model, and from that point on you have a fully functional AI that lives on your machine. It doesn't phone home. It doesn't charge you. It works in the middle of a mountain.

What Ollama Actually Is

Think of Ollama like an app store manager for AI models, but the apps run entirely on your computer instead of in someone else's server room.

When you use ChatGPT, your message travels over the internet to OpenAI's computers, their AI processes it, and the answer comes back. Their hardware, their servers, their electricity bill — you're renting all of that per message.

Ollama works differently. It downloads the AI model — the actual software that does the thinking — and installs it on your machine. When you ask a question, everything happens right there on your CPU or GPU. No network hop. No waiting on someone else's server. No cost per answer.

The AI models Ollama runs are open-source models. These are models that companies and researchers have made publicly available — Meta's Llama series, Mistral AI's models, Google's Gemma, Microsoft's Phi series, and many others. They're free to download and run. Ollama handles all the technical plumbing to make them actually work on your hardware.

This is what the community of people building with local AI models for coding has rallied around. Ollama became the default tool for this because it made something that used to require a Linux server and a PhD to set up something anyone can do in fifteen minutes.

Getting Started: What It Actually Looks Like

Here's the full journey from nothing to a working local AI — no steps skipped.

Step 1: Install Ollama

Go to ollama.com, download the installer for your operating system, and run it. On a Mac it's a standard .dmg install — drag it to Applications. On Windows it's a standard .exe installer. On Linux, there's a one-line install script. The whole thing takes under two minutes.

After installation, Ollama runs quietly in the background as a system service. You'll see a small icon in your menu bar (Mac) or system tray (Windows). It's ready to receive commands.

Step 2: Pull a Model

Open a terminal (Command Prompt or PowerShell on Windows, Terminal on Mac). Type this:

Download Llama 3.2 (3 billion parameters — fast, runs on anything)

ollama pull llama3.2

Ollama downloads the model file. Llama 3.2 3B is about 2GB. A larger model like Llama 3.1 8B is about 5GB. This download happens once — after that it lives on your machine permanently until you delete it.

Think of it like downloading an app. The app file sits on your computer and runs locally. The AI model is the same idea, just a different kind of file.

Step 3: Talk to It

Once the model is downloaded, you can start chatting immediately:

Start an interactive chat session

ollama run llama3.2

Your terminal turns into a chat interface. Type a message, press Enter, and the model responds — right there in your terminal, no internet connection involved. Type /bye to exit.

That's it. You now have a working local AI. No account, no API key, no credit card, no subscription.

Which Models Can You Run?

Ollama's model library at ollama.com/library has hundreds of options. Here are the ones that matter most for everyday coding and writing work:

Models Worth Knowing About

llama3.2        — Meta's fast general model (3B/11B). Good for quick tasks.
llama3.1        — Bigger Llama variant (8B/70B). More capable, needs more RAM.
mistral         — Mistral AI's workhorse (7B). Strong at code and reasoning.
codellama       — Meta's coding-focused model. Built for writing and explaining code.
deepseek-coder  — Strong coding model from DeepSeek. Competitive with paid tools.
gemma2          — Google's Gemma 2 (2B/9B/27B). Good balance of size and quality.
phi3            — Microsoft's small but capable Phi-3. Runs well on weak hardware.
qwen2.5-coder   — Alibaba's coding-focused model. Surprisingly good for code tasks.

The number after the model name (7B, 13B, 70B) refers to how many billion parameters the model has. More parameters generally means smarter, but also means larger file size and more RAM required. A 7B model needs about 8GB of RAM to run comfortably. A 70B model needs 40–48GB, which means you'd need a high-end Mac with 48GB unified memory or a dedicated GPU workstation.

For most people on a standard laptop: start with llama3.2 or mistral. Both are capable enough for real work without needing beastly hardware.

Ollama vs. Cloud APIs: The Honest Comparison

This is the comparison people ask about constantly, especially after the Reddit thread on r/ChatGPTCoding comparing Ollama Cloud Max setups against paying for Claude Max. Here's the real picture:

Factor Ollama (Local) Cloud APIs (Claude, GPT-4)
Cost Free after download. Zero per-query cost. Pay per token, or $20–$200/month subscriptions.
Model Quality Good for most tasks. Behind GPT-4/Claude Sonnet on hard problems. Best-in-class. Complex reasoning, large context, nuanced output.
Speed Depends on your hardware. Apple Silicon is fast. Older laptops are slow. Fast and consistent. Server-grade GPUs handle the load.
Privacy Complete. Nothing leaves your machine. Ever. Data sent to provider's servers. Policy-dependent handling.
Internet Required Only for initial download. Works fully offline after. Always. Every query goes over the network.
Model Selection Hundreds of open-source models. New ones added constantly. Limited to that provider's models. Claude = Claude, GPT = GPT.
Setup Effort 15 minutes to first working model. Minutes (API key) to instant (web interface).
Context Window Smaller — typically 4K–128K tokens depending on model. Larger — Claude supports up to 200K tokens.

The Reddit thread that blew up compared "Ollama Cloud Max" setups — people running powerful models locally via Ollama — against paying $100/month for Claude Max. The honest community verdict: if you have a beefy Mac with 48GB RAM, running Llama 3.1 70B locally gets close to Claude Sonnet quality for zero ongoing cost. If you're on a standard laptop, Claude Max is still worth it for hard problems. Most developers end up using both.

Using Ollama With Your Coding Tools

The terminal chat interface is just the start. The real power is connecting Ollama to the tools you already use.

Open WebUI: A Proper Chat Interface

Open WebUI is a free web app that gives you a ChatGPT-style interface for your local Ollama models. It runs in your browser, looks polished, supports conversation history, and lets you switch between models easily. You install it via Docker (one command), point it at your running Ollama server, and suddenly your local AI feels like a proper product rather than a terminal experiment.

Continue.dev: AI Inside VS Code

Continue is a free VS Code extension that adds an AI assistant sidebar to your editor. By default it connects to Claude or GPT-4, but you can tell it to use your local Ollama server instead. Switch one setting and your code completions, explanations, and chat all run locally. No API costs, no internet dependency while you code.

The Local API

Ollama runs a local web server on port 11434. Any app that can talk to an API can talk to Ollama. The format is deliberately similar to OpenAI's API — which means tools built for ChatGPT can often be redirected to Ollama with one config change.

Ask Ollama a question via its local API (runs in terminal)

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain what a REST API is like I'm a plumber",
  "stream": false
}'

This is how you'd plug Ollama into your own scripts or projects — point any OpenAI-compatible code at http://localhost:11434 and it works. No API key required because you're talking to your own machine.

If you're building with Python and want to understand what tokens and context windows mean for your local model, this breakdown of AI tokens and context limits explains the mechanics without jargon.

What Your Computer Actually Needs

This is the practical question most people have before they try. The short answer: most computers made in the last five years can run something useful. The question is how big a model and how fast.

Hardware Reality Check

MacBook Air M2 (8GB RAM)
  → Can run: llama3.2 3B, phi3, gemma2 2B
  → Sweet spot: Small, fast models for quick tasks
  → Feel: Responsive. 10–20 tokens/second.

MacBook Pro M3 (16–24GB RAM)
  → Can run: mistral 7B, llama3.1 8B, codellama
  → Sweet spot: Most everyday tasks comfortably
  → Feel: Fast. 20–40 tokens/second.

Mac Studio / MacBook Pro M3 Max (48–96GB RAM)
  → Can run: llama3.1 70B, mixtral 8x7B
  → Sweet spot: Serious local AI work, near-cloud quality
  → Feel: Surprisingly fast even for large models.

Windows/Linux with Nvidia GPU (8GB+ VRAM)
  → Can run: Models sized to fit in GPU VRAM
  → Sweet spot: Fast inference with the right CUDA setup
  → Feel: Faster than CPU-only once properly configured.

Older laptop (8GB RAM, no GPU)
  → Can run: Small models (3B–7B) on CPU only
  → Sweet spot: Patient use, simple tasks
  → Feel: Slow. 2–5 tokens/second. Usable but not snappy.

Apple Silicon is the reason Ollama took off so dramatically. The M1/M2/M3 chip family uses "unified memory" — the CPU and GPU share the same RAM pool. That means a 16GB MacBook Pro can run a 13B model where a 16GB Windows laptop with a discrete GPU can barely run a 7B model (because the GPU has its own 4–8GB VRAM that the model has to fit into). This is part of why on-device AI has become practical so quickly.

Using Ollama for Coding: What Actually Works

The question that matters for vibe coders is: can local models actually help me build things? The honest answer by model type:

Code Explanation and Understanding

Strong. Paste a function you don't understand and ask Llama 3.1 8B to explain it — it handles this well. Not quite as thorough as Claude Sonnet, but for most snippets, the explanation is accurate and readable.

Writing Boilerplate and Simple Functions

Good. Asking Mistral 7B to "write a Python function that reads a CSV and returns rows where the 'status' column is 'active'" works reliably. It'll get the structure right and usually get the logic right.

Debugging

Decent. Paste the error message and the surrounding code — local models can often spot the issue. They struggle more with multi-file context and subtle architectural problems where Claude or GPT-4 might catch something they miss.

Large Codebase Understanding

Weaker. The context window on most local models is smaller than cloud APIs, and even when the window is large, smaller models get confused with very long inputs. For "help me understand this 10,000-line codebase" — cloud models are still ahead. For "help me understand this 100-line function" — local is fine.

Dedicated Coding Models

The best option for pure coding work is to use a coding-specialized model. deepseek-coder-v2 and qwen2.5-coder are notably better at code than general-purpose models of similar size. If coding is your primary use case, pull one of these rather than a general model.

For a broader look at how local models fit into a coding workflow, this guide to local AI models for coding covers the full picture — not just Ollama but the whole landscape of options.

Privacy: The Real Reason Many Builders Switch

Here's a scenario that happens more than people talk about: you're building a tool for a client. The client's code, their business logic, their database schemas are all sitting in your prompts. You're using Claude or ChatGPT because they're good. But every time you paste that client code into a chat window, it's leaving your machine.

Most API providers have clear policies: they don't train on your data, they process it securely. But "processed on their server" is different from "never left your machine." For client work, for proprietary code, for anything under an NDA, Ollama changes the calculus completely.

When you use Ollama, your prompt is processed by software running on your CPU or GPU. It generates a response using model weights stored on your hard drive. The whole thing happens inside your own hardware. There is no network traffic to inspect. The client code stays on the client machine.

This is also why Ollama is increasingly interesting for teams at larger companies where sending proprietary code to external AI services requires legal review. Running Ollama on a local server inside a company's network keeps the AI entirely within their infrastructure.

What AI Gets Wrong About Ollama

Ask ChatGPT or Claude to explain Ollama and you'll often get technically correct but practically misleading answers. Here's where the AI summaries miss the point:

"Ollama requires technical expertise to set up"

This was true in 2022. It hasn't been true since late 2023. The current Ollama installer is a standard double-click install on Mac and Windows. The first model pull is one terminal command. If you can install Discord, you can install Ollama. The "technical expertise required" framing scares off exactly the people who would benefit most.

"Local models are significantly worse than cloud models"

Significantly worse on the hardest benchmarks, yes. Noticeably worse for everyday coding tasks, no. Llama 3.1 8B handles the kind of questions most developers ask AI assistants — "explain this error," "write this function," "refactor this code" — well enough that the gap doesn't matter in practice for most sessions. The gap matters most for complex multi-step reasoning, nuanced writing, and tasks requiring very large context.

"You need a powerful GPU"

You need a GPU for fast inference with large models. For small models (3B–7B parameters), Apple Silicon Macs run them fast enough on the CPU/unified memory alone. Many people run Ollama daily on a standard M2 MacBook Air with zero frustration. The GPU requirement is real for 70B models, not for the models most people actually use.

"Ollama is just for experiments, not real work"

Software developers actively use Ollama-powered setups as their primary AI assistant — not as a backup or experiment. The models are capable enough for a high percentage of everyday development tasks. The people saying "not for real work" are often comparing it to GPT-4 on benchmark leaderboards, not to what you actually ask AI assistants to do on a Tuesday afternoon.

How Ollama Relates to Running AI on Other Devices

Ollama is the dominant tool for running AI on laptops and desktops. The broader trend it's part of is on-device AI — processing AI workloads on the device in your hands rather than in a distant server room. This is happening across form factors:

On phones, dedicated AI chips from Apple (Neural Engine) and Qualcomm (Hexagon NPU) allow increasingly capable models to run on-device. Running LLMs on your phone is now possible with tools like llama.cpp-based apps — the same underlying technology that Ollama uses under the hood. The models are smaller (1B–3B parameters to fit in phone memory) but the privacy and offline benefits are the same.

Ollama itself uses llama.cpp under the hood — a highly optimized C++ library for running transformer models efficiently on consumer hardware. You never have to interact with llama.cpp directly; Ollama wraps it in a clean interface. But it's worth knowing that the performance improvements that have made local models fast have come largely from this open-source project.

Frequently Asked Questions

Is Ollama free?

Completely free. Ollama itself is open source with no license cost. The models you download are open-source and also free. You pay nothing ongoing. Your only "cost" is the hardware you already own and the electricity it uses — which for a laptop is negligible.

What computers can run Ollama?

Mac (Apple Silicon M1/M2/M3/M4 runs best, Intel Macs work too), Windows 10/11, and Linux. Apple Silicon Macs are the most popular choice because their unified memory lets you run larger models without a dedicated GPU. Any machine made in the last few years can run something — even if older hardware runs smaller models more slowly.

Does Ollama require internet?

Only for the initial installation and model downloads. After that, it works completely offline. This is one of its biggest selling points — your AI assistant works on planes, job sites, anywhere without connectivity. Nothing in your prompts ever touches the internet once the model is downloaded.

Is Ollama as good as ChatGPT or Claude?

For most everyday tasks — explaining code, writing short functions, summarizing text, drafting messages — local models via Ollama are close enough that the difference won't slow you down. For complex multi-step reasoning, very long documents, and nuanced creative work, the top cloud models are still better. Most serious builders use both: Ollama for quick local work, cloud for the hard stuff.

Can I use Ollama with VS Code or Cursor?

Yes. The Continue.dev extension for VS Code connects to your local Ollama server. Some other tools support Ollama directly in their settings. You point the tool at http://localhost:11434 instead of an external API, and local models handle your code completions and chat. No API key needed because you're talking to your own machine.

What is the best Ollama model for coding?

For coding specifically: deepseek-coder-v2 and qwen2.5-coder are currently among the best. For a general model that also handles code well: llama3.1 (8B for most laptops, 70B if you have a Mac with 48GB+ RAM). For something small and fast on any machine: phi3 or llama3.2 3B.

Ollama Cloud Max vs Claude Max — which should I pay for?

They're different things. "Ollama Cloud Max" usually means running powerful local models via Ollama on high-end hardware — no monthly fee, but requires a beefy machine. Claude Max is a $100–$200/month subscription for unlimited access to Anthropic's Claude models. If you have a Mac with 48GB+ RAM, serious local models can rival mid-tier Claude performance for zero ongoing cost. If you're on standard hardware, Claude Max delivers more raw capability per dollar. Many developers use Ollama for private or quick tasks and Claude Max for the hardest problems.

How much RAM do I need for Ollama?

A rough guide: 8GB RAM can run 3B–7B models (useful, quick). 16GB RAM can run 7B–13B models (solid for most tasks). 32GB can run 13B–34B models (noticeably more capable). 48GB+ can run 70B models (near-cloud quality). On Apple Silicon, all system RAM counts. On Windows/Linux with a GPU, model size is limited by GPU VRAM, not total RAM.

What to Learn Next

Ollama is a great entry point into a whole world of local and on-device AI. These articles fill in the surrounding picture.