DSPy is an open-source Python framework from Stanford that lets you build LLM pipelines by writing code instead of hand-crafted prompts. You describe what you want each step to do — 'summarize this', 'classify this', 'answer this question' — and DSPy's optimizer automatically figures out the best prompt wording to get that result from your chosen model.

What Is DSPy? Stanford's Framework for Building LLM Pipelines Without Manual Prompts

Q: What does DSPy stand for?

DSPy stands for Declarative Self-improving Language Programs (in Python). The 'self-improving' part is the key idea: given a scoring function that tells it whether outputs are good or bad, DSPy can automatically rewrite its own prompts to score better.

Q: Is DSPy hard to learn?

DSPy requires Python and a basic understanding of LLM concepts like prompts and completions. The core API is small — you mainly write Signatures (what goes in, what comes out) and Modules (steps in your pipeline). If you're comfortable writing Python functions, DSPy's learning curve is manageable. The hard part is understanding when you need it: DSPy pays off most on multi-step pipelines where prompt drift is a real problem.

Q: Why isn't everyone using DSPy if it's so good?

A few real reasons: the optimization step (called compilation) can be slow and expensive — it runs your pipeline many times against example data. DSPy also has a steeper setup curve than just writing a prompt string. For simple one-shot use cases, it's overkill. But for teams building production LLM pipelines that need to stay reliable across model updates, DSPy's automatic prompt optimization becomes genuinely valuable.

Q: What's the difference between DSPy and LangChain?

LangChain is a toolkit of components for wiring LLMs to data sources, memory, and tools — you still write prompts by hand. DSPy is an optimizer: you describe the logic, and DSPy writes the prompts for you. They're solving different problems. Some teams use both — LangChain for orchestration and retrieval, DSPy for the prompting layer.

Q: Does DSPy work with Claude and GPT-4?

Yes. DSPy supports OpenAI models (GPT-3.5, GPT-4, GPT-4o), Anthropic models (Claude 3, Claude 4), Google (Gemini), Mistral, local models via Ollama, and most models accessible through LiteLLM. You configure the model once at the top of your script and the rest of your code stays model-agnostic.

Q: What is a DSPy Signature?

A Signature is DSPy's way of declaring what a step in your pipeline does — without writing the actual prompt text. You define input fields and output fields in a Python class, give the class a one-line docstring explaining the task, and DSPy generates the prompt automatically. For example: 'class Summarize(dspy.Signature): document -> summary' tells DSPy to take a document and produce a summary, and DSPy figures out how to prompt the model to do that.

Q: Should I use DSPy for my side project?

Probably not yet — unless your side project is specifically an LLM pipeline (like a RAG system, a multi-step classifier, or an automated QA tool) and you're already hitting problems with inconsistent outputs. For a chatbot or one-off generation task, writing a good system prompt directly is faster. Come back to DSPy when you have a pipeline with three or more chained LLM steps and you're spending more time tweaking prompts than building features.

TL;DR: DSPy is a Python framework from Stanford that replaces hand-written LLM prompts with code. You write a Signature (what goes in, what comes out) and a Module (the pipeline logic), then run an optimizer that automatically finds the best prompt wording for your chosen model. When you switch models, DSPy re-optimizes instead of you rewriting everything manually. It's overkill for simple chatbots, but genuinely useful when you're building multi-step pipelines that need to stay reliable.

The Problem DSPy Solves

Let's say you're building a tool that reads customer support tickets and automatically categorizes them — billing, bug report, feature request, compliment. You write a prompt:

Your First Prompt (Works Great in Testing)

You are a customer support classifier. Read the ticket below and classify it
into one of: billing, bug_report, feature_request, compliment.

Ticket: {ticket_text}

Respond with only the category label.

Works great. Then six months later, something changes — maybe you switch from GPT-3.5 to GPT-4o to save costs, or you add a new category, or the ticket language shifts because you got customers in a new region. Suddenly your 94% accuracy drops to 71%. You spend a weekend manually tweaking the prompt back into shape.

This is prompt drift: the gap between a prompt that worked once and a prompt that keeps working as your pipeline evolves. Every serious LLM application runs into it eventually.

DSPy attacks this at the root. Instead of storing prompt text as a string that breaks when anything changes, you store the intent of each step as Python code. Then you let DSPy's optimizer generate the best prompt for whatever model you're running, against whatever examples you have right now.

Where DSPy Came From

DSPy was created at Stanford's NLP Group, led by Omar Khattab. The first public version landed in late 2022. The core idea came from a frustration that the ML community already had decades of experience with automatic hyperparameter optimization — you don't tune neural network learning rates by hand, you run a search — but prompt engineering was still entirely manual.

The name stands for Declarative Self-improving Language Programs (in Python). "Declarative" means you say what you want, not how to ask for it. "Self-improving" means DSPy can rewrite its own prompts using a scoring function you define. This was genuinely novel when it shipped.

A Hacker News post titled "If DSPy is so great, why isn't anyone using it?" hit 160 points in early 2026 and started a real conversation. The honest answer from the community: it is being used — just mostly by teams building production LLM pipelines, not by casual builders. The setup cost is real, but so is the payoff.

The Three Building Blocks

Everything in DSPy is built from three concepts. You don't need to memorize the internals — just understand what each piece does.

1. Signatures — What Goes In and What Comes Out

A Signature is like a function type signature, but for an LLM call. You declare input fields and output fields. DSPy reads the field names, types, and the class docstring to understand the task — and then generates the actual prompt text automatically.

import dspy

# Instead of writing a prompt string, you write a Signature class
class ClassifyTicket(dspy.Signature):
    """Classify a customer support ticket into the correct category."""

    ticket_text: str = dspy.InputField(desc="the raw support ticket text")
    category: str = dspy.OutputField(desc="one of: billing, bug_report, feature_request, compliment")

# DSPy will turn this into a real prompt behind the scenes.
# You never see the prompt text — you just define the task.

Notice what's missing: there's no "You are a helpful assistant" boilerplate. No "respond with only the label" instruction. No few-shot examples. DSPy fills all of that in based on your Signature, your docstring, and any training examples you provide.

2. Modules — The Pipeline Steps

A Module wraps a Signature into an executable step. The simplest Module is dspy.Predict — it just calls the LLM once with your Signature. More complex Modules like dspy.ChainOfThought add automatic reasoning steps before the final answer.

# The simplest Module: just call the LLM with the Signature
classify = dspy.Predict(ClassifyTicket)

# Use it
result = classify(ticket_text="I was charged twice this month, please help")
print(result.category)  # → "billing"

# ChainOfThought makes the model think step by step before answering
# — often more accurate for ambiguous inputs
classify_cot = dspy.ChainOfThought(ClassifyTicket)
result = classify_cot(ticket_text="Your app crashed and now I lost my whole project!")
print(result.category)  # → "bug_report" (with reasoning trace available)

You can chain Modules together into bigger pipelines by writing a custom Module class with a forward method — similar to how you'd define a neural network layer in PyTorch.

class SupportPipeline(dspy.Module):
    def __init__(self):
        # Two steps: classify the ticket, then draft a response
        self.classify = dspy.ChainOfThought(ClassifyTicket)
        self.draft_reply = dspy.Predict(DraftReply)  # another Signature you define

    def forward(self, ticket_text):
        classification = self.classify(ticket_text=ticket_text)
        reply = self.draft_reply(
            ticket_text=ticket_text,
            category=classification.category
        )
        return reply

3. Optimizers — The Magic Part

This is what makes DSPy different from any other LLM library. An Optimizer (DSPy calls them teleprompters in older docs, now called optimizers) takes your pipeline and a set of labeled examples, then runs a search to find the best prompt wording — including the best few-shot examples to include.

# You need: (1) your pipeline, (2) a metric function, (3) example data
pipeline = SupportPipeline()

# Define what "correct" means for your task
def accuracy_metric(example, prediction, trace=None):
    return example.category == prediction.category

# Load a few labeled examples
trainset = [
    dspy.Example(
        ticket_text="I was charged twice this month",
        category="billing"
    ).with_inputs("ticket_text"),
    dspy.Example(
        ticket_text="The export button does nothing when I click it",
        category="bug_report"
    ).with_inputs("ticket_text"),
    # ... more examples
]

# Run the optimizer — this calls the LLM many times internally
from dspy.teleprompt import BootstrapFewShot
optimizer = BootstrapFewShot(metric=accuracy_metric)
optimized_pipeline = optimizer.compile(pipeline, trainset=trainset)

# Now optimized_pipeline has better prompts than you would have written by hand
optimized_pipeline.save("classifier_v1.json")

The optimizer tries different prompt phrasings, different few-shot examples, and different orderings — then keeps what scores highest against your metric. This is the step that takes time (it can run dozens or hundreds of LLM calls), but you only run it when you update the pipeline — not on every inference request.

DSPy vs. Writing Prompts by Hand

Both approaches work. The question is which one holds up as your project grows.

Manual Prompt Engineering

Pros:
  ✓ Zero setup — just write a string and call the API
  ✓ Easy to read and debug for simple cases
  ✓ No dependencies beyond the model's SDK

Cons:
  ✗ Prompts break when you switch models
  ✗ Prompts drift as your data and task evolve
  ✗ Multi-step pipelines become prompt spaghetti
  ✗ No automatic improvement from labeled examples

DSPy

Pros:
  ✓ Prompts are auto-generated and auto-optimized
  ✓ Model-agnostic — switch from GPT-4o to Claude without rewriting
  ✓ Structured, testable pipeline code
  ✓ Scales to complex multi-step LLM workflows

Cons:
  ✗ Requires Python and ~2 hours of setup learning
  ✗ Optimizer is slow and uses extra LLM credits
  ✗ Overkill for single-step tasks
  ✗ Smaller community than LangChain

The inflection point is roughly three or more chained LLM steps. Below that, you're probably fine with manual prompts. Above that, DSPy starts paying back its setup cost quickly.

Real Use Cases Where DSPy Shines

DSPy is purpose-built for a specific type of application. These are the scenarios where it earns its complexity.

RAG Systems (Retrieval-Augmented Generation)

RAG is the pattern where you search a knowledge base and feed the results to an LLM to answer questions. It's the foundation of most "chat with your docs" tools. RAG has at least three steps: retrieve relevant chunks, rank them, generate an answer. That's already three places where prompt drift can hurt you.

DSPy has first-class support for RAG through its dspy.Retrieve module and the RAG example in its documentation. The optimizer can tune the retrieval query generation and the answer synthesis prompt simultaneously — something that's painful to do by hand.

Multi-Hop Question Answering

Some questions require chaining multiple lookups: "Who founded the company that made the API used in that codebase?" Each hop is an LLM call. Manual prompts for multi-hop chains get fragile fast. DSPy's MultiHopRAG module handles this pattern and optimizes each hop's prompt together.

Automated Data Labeling and Classification

If you're processing large batches of documents — emails, tickets, reviews, transcripts — and need structured output (categories, sentiment, extracted fields), DSPy's Signatures are a clean way to define those schemas. The optimizer ensures consistent output format far better than "please respond with valid JSON" instructions.

LLM-as-Judge Pipelines

Many production AI systems use a second LLM call to evaluate the first one's output — checking for safety, relevance, or correctness. DSPy can optimize both the generation step and the evaluation step together, which humans doing this manually almost never do.

Setting Up DSPy: A First Look

Here's what getting started actually looks like. This is the minimum viable DSPy program — no optimizer, just using it as a structured prompt layer.

# Install DSPy
# pip install dspy-ai

import dspy

# Configure your model (DSPy works with most major providers)
# Using Claude:
lm = dspy.LM("anthropic/claude-sonnet-4-6", api_key="sk-ant-...")
# Using GPT-4o:
# lm = dspy.LM("openai/gpt-4o", api_key="sk-...")
# Using a local Ollama model:
# lm = dspy.LM("ollama_chat/llama3", api_base="http://localhost:11434")

dspy.configure(lm=lm)

# Define a Signature
class ExtractKeyFacts(dspy.Signature):
    """Extract the three most important facts from a news article."""

    article: str = dspy.InputField()
    key_facts: list[str] = dspy.OutputField(desc="list of exactly 3 key facts")

# Create a Module and call it
extract = dspy.Predict(ExtractKeyFacts)

article_text = """
Scientists announced Tuesday that a new deep-sea species of fish has been
discovered off the coast of New Zealand. The fish, named Pachycara melanostomias,
lives at depths below 2,000 meters and appears to feed primarily on invertebrates.
The discovery was made by a research team from the University of Auckland during
a six-week expedition funded by the New Zealand government.
"""

result = extract(article=article_text)
print(result.key_facts)
# → ['New deep-sea fish species discovered off New Zealand coast',
#    'Species named Pachycara melanostomias lives below 2,000 meters',
#    'Discovery made by University of Auckland team during six-week expedition']

That's it for basic usage. No prompt string. No "please respond in JSON format" gymnastics. DSPy handles the output formatting because you declared key_facts: list[str] as the output type.

The Honest Trade-offs

The Hacker News thread asking why nobody uses DSPy was honest about the friction, and it's worth being honest with you too.

The Optimizer Is Expensive to Run

When you compile a pipeline with even a basic optimizer, DSPy calls the LLM dozens of times per training example — trying different prompt variants and scoring each one. A training set of 50 examples might cost $5–20 in API credits and take 15–30 minutes. For a small team experimenting, this adds up quickly.

The counter-argument: you only pay the compilation cost once. After that, the optimized pipeline runs at normal inference cost. And $20 in API credits is cheap compared to the engineering hours you'd spend manually tweaking prompts to achieve the same accuracy improvement.

The Learning Curve Is Real

DSPy's mental model is genuinely different from "write a prompt and call an API." You have to internalize Signatures, Modules, and Optimizers before the pieces click together. Budget two to four hours for the concepts to make sense. The official DSPy documentation has improved a lot in 2025–2026, and the Discord community is active.

It's Not for Everything

If your entire LLM use case is "user types a message, Claude responds," DSPy adds complexity with no real payoff. The sweet spot is pipelines: multi-step workflows where the output of one LLM call feeds into the next, and where you have (or can create) labeled examples to score against.

The Ecosystem Is Smaller

LangChain has more tutorials, more community examples, and more third-party integrations. If you're stuck on a DSPy problem at midnight, you'll find less help on Stack Overflow than you would with LangChain. This gap is closing — DSPy's GitHub star count crossed 20k in 2025 — but it's real today.

DSPy vs. LangChain: What's the Actual Difference?

They're solving different problems, and the confusion is understandable because both are "LLM pipeline frameworks."

LangChain is a toolkit. It gives you pre-built components for connecting LLMs to vector databases, document loaders, memory stores, and tool-calling systems. You write the prompt strings yourself. LangChain's job is wiring everything together.

DSPy is an optimizer. It doesn't do much for retrieval or tool integration — you bring your own. Its job is making sure each LLM call in your pipeline is asking for what it needs in the best possible way, automatically.

Many production teams use both: LangChain (or LlamaIndex) for orchestration and data retrieval, DSPy for the prompting layer that runs on top. You define your Signatures in DSPy, but the data that flows through them might be fetched by LangChain retrievers.

Should You Use DSPy?

Here's a simple decision tree for vibe coders:

Use DSPy if:

✓ You're building a pipeline with 3+ chained LLM calls
✓ You have (or can create) 20+ labeled input/output examples
✓ Prompt drift is already a real problem for you
✓ You need to switch models without rewriting everything
✓ You're building something that needs to stay reliable long-term
✓ You're comfortable with Python classes and basic OOP

Skip DSPy for now if:

✗ You're building a simple chatbot or single-step generation task
✗ You're still figuring out what your pipeline even needs to do
✗ You don't have labeled examples to optimize against
✗ You want something up in an afternoon — start with manual prompts
✗ You're not yet comfortable with Python

The pattern that works well: build the first version with hand-crafted prompts, ship it, gather real inputs and outputs, label them, then migrate the messy parts to DSPy once you know what "good" actually looks like for your use case. DSPy is a tool for refining something that already works — not for figuring out what to build.

Frequently Asked Questions

What does DSPy stand for?

DSPy stands for Declarative Self-improving Language Programs (in Python). "Declarative" means you describe what you want each step to produce, not how to word the request. "Self-improving" means DSPy can rewrite its own prompts when you give it a way to score whether the outputs are good.

Is DSPy hard to learn?

It's harder than writing a prompt string, easier than most ML frameworks. The concepts — Signatures, Modules, Optimizers — are consistent once you internalize them. Budget a couple of hours with the official tutorials. The hardest part for most people is understanding when to use it, which is why this article exists.

Why isn't everyone using DSPy if it's so good?

The setup cost is real. Optimization runs are slow and use extra API credits. For simple use cases, it's overkill. But the Hacker News thread that asked this question actually revealed the opposite of what the title implied: DSPy is used heavily in enterprise NLP teams, academic research, and production RAG systems. It's just not as visible as LangChain's community content.

Does DSPy work with Claude and GPT-4?

Yes. DSPy supports Claude 3 and Claude 4 (via Anthropic's API), all GPT-4 variants (via OpenAI), Gemini (via Google), Mistral, and local models via Ollama. You configure the model once at the top of your script — dspy.configure(lm=dspy.LM("anthropic/claude-sonnet-4-6")) — and the rest of your pipeline code stays model-agnostic. Switching models means changing one line.

What is a DSPy Signature?

A Signature is a Python class that declares the inputs and outputs of one LLM step, without specifying the prompt text. You write the field names, types, and a one-line docstring. DSPy reads all of that and generates an appropriate prompt automatically — including few-shot examples if the optimizer has run. Think of it as a typed function declaration for an LLM call.

What's the difference between DSPy and LangChain?

LangChain is a toolkit for wiring LLMs to external data and tools — you still write prompts by hand. DSPy is an optimizer that writes the prompts for you. They're often used together: LangChain or LlamaIndex for retrieval and orchestration, DSPy as the prompting layer on top.

Should I use DSPy for my side project?

Probably not at the start. Build the first version with manual prompts, ship it, and collect real examples. Once you have 20–50 labeled input/output pairs and you're spending real time tweaking prompts, that's when migrating to DSPy makes sense. It's a tool for maturing a pipeline, not for prototyping one.

How do I install DSPy?

Install via pip: pip install dspy-ai. Then import dspy and configure your language model with dspy.configure(lm=dspy.LM("anthropic/claude-sonnet-4-6", api_key="your-key")). The official DSPy documentation at dspy.ai has working quickstart examples for all major model providers.

What to Learn Next

DSPy lives at the intersection of Python, LLMs, and pipeline thinking. These articles fill in the surrounding context.

Related Concept

What Is Prompt Chaining?

The manual version of what DSPy automates — learn the pattern first.

Foundation

What Are AI Tokens and Context Limits?

DSPy optimizers pass a lot of tokens — understand what that means for cost.

Prerequisite

What Is Python?

DSPy is a Python library. If you're new to Python, start here.

Related Tool

What Is an MCP Server?

Another way to extend AI tools — MCP handles tool-calling, DSPy handles prompt optimization.

What Is DSPy? The Framework That Writes Your LLM Prompts For You