Back to mlops

mlops v1.0.0 7.7 min read 593 lines

dspy

Stanford NLP의 DSPy — 선언형 프로그래밍으로 AI 시스템 구축, 프롬프트 자동 최적화

Download ZIP

Orchestra Research

MIT

DSPy: Declarative Language Model Programming

When to Use This Skill

Use DSPy when you need to:

Build complex AI systems with multiple components and workflows
Program LMs declaratively instead of manual prompt engineering
Optimize prompts automatically using data-driven methods
Create modular AI pipelines that are maintainable and portable
Improve model outputs systematically with optimizers
Build RAG systems, agents, or classifiers with better reliability

GitHub Stars: 22,000+ | Created By: Stanford NLP

Installation

# Stable release
pip install dspy
Latest development version

pip install git+https://github.com/stanfordnlp/dspy.git
With specific LM providers

pip install dspy[openai]        # OpenAI
pip install dspy[anthropic]     # Anthropic Claude
pip install dspy[all]           # All providers

Quick Start

Basic Example: Question Answering

import dspy
Configure your language model

lm = dspy.Claude(model="claude-sonnet-4-5-20250929")
dspy.settings.configure(lm=lm)
Define a signature (input → output)

class QA(dspy.Signature):
    """Answer questions with short factual answers."""
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")
Create a module

qa = dspy.Predict(QA)
Use it

response = qa(question="What is the capital of France?")
print(response.answer)  # "Paris"

Chain of Thought Reasoning

import dspy
lm = dspy.Claude(model="claude-sonnet-4-5-20250929")
dspy.settings.configure(lm=lm)
Use ChainOfThought for better reasoning

class MathProblem(dspy.Signature):
    """Solve math word problems."""
    problem = dspy.InputField()
    answer = dspy.OutputField(desc="numerical answer")
ChainOfThought generates reasoning steps automatically

cot = dspy.ChainOfThought(MathProblem)response = cot(problem="If John has 5 apples and gives 2 to Mary, how many does he have?")
print(response.rationale)  # Shows reasoning steps
print(response.answer)     # "3"

Core Concepts

1. Signatures

Signatures define the structure of your AI task (inputs → outputs):

# Inline signature (simple)
qa = dspy.Predict("question -> answer")
Class signature (detailed)

class Summarize(dspy.Signature):
    """Summarize text into key points."""
    text = dspy.InputField()
    summary = dspy.OutputField(desc="bullet points, 3-5 items")summarizer = dspy.ChainOfThought(Summarize)

When to use each:

Inline: Quick prototyping, simple tasks
Class: Complex tasks, type hints, better documentation

2. Modules

Modules are reusable components that transform inputs to outputs:

dspy.Predict

Basic prediction module:

predictor = dspy.Predict("context, question -> answer")
result = predictor(context="Paris is the capital of France",
                   question="What is the capital?")

dspy.ChainOfThought

Generates reasoning steps before answering:

cot = dspy.ChainOfThought("question -> answer")
result = cot(question="Why is the sky blue?")
print(result.rationale)  # Reasoning steps
print(result.answer)     # Final answer

dspy.ReAct

Agent-like reasoning with tools:

from dspy.predict import ReAct
class SearchQA(dspy.Signature):
    """Answer questions using search."""
    question = dspy.InputField()
    answer = dspy.OutputField()
def search_tool(query: str) -> str:
    """Search Wikipedia."""
    # Your search implementation
    return resultsreact = ReAct(SearchQA, tools=[search_tool])
result = react(question="When was Python created?")

dspy.ProgramOfThought

Generates and executes code for reasoning:

pot = dspy.ProgramOfThought("question -> answer")
result = pot(question="What is 15% of 240?")
Generates: answer = 240 * 0.15

3. Optimizers

Optimizers improve your modules automatically using training data:

BootstrapFewShot

Learns from examples:

from dspy.teleprompt import BootstrapFewShot
Training data

trainset = [
    dspy.Example(question="What is 2+2?", answer="4").with_inputs("question"),
    dspy.Example(question="What is 3+5?", answer="8").with_inputs("question"),
]
Define metric

def validate_answer(example, pred, trace=None):
    return example.answer == pred.answer
Optimize

optimizer = BootstrapFewShot(metric=validate_answer, max_bootstrapped_demos=3)
optimized_qa = optimizer.compile(qa, trainset=trainset)
Now optimized_qa performs better!

MIPRO (Most Important Prompt Optimization)

Iteratively improves prompts:

from dspy.teleprompt import MIPRO
optimizer = MIPRO(
    metric=validate_answer,
    num_candidates=10,
    init_temperature=1.0
)optimized_cot = optimizer.compile(
    cot,
    trainset=trainset,
    num_trials=100
)

BootstrapFinetune

Creates datasets for model fine-tuning:

from dspy.teleprompt import BootstrapFinetune
optimizer = BootstrapFinetune(metric=validate_answer)
optimized_module = optimizer.compile(qa, trainset=trainset)
Exports training data for fine-tuning

4. Building Complex Systems

Multi-Stage Pipeline

import dspy
class MultiHopQA(dspy.Module):
    def __init__(self):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=3)
        self.generate_query = dspy.ChainOfThought("question -> search_query")
        self.generate_answer = dspy.ChainOfThought("context, question -> answer")
    def forward(self, question):
        # Stage 1: Generate search query
        search_query = self.generate_query(question=question).search_query
        # Stage 2: Retrieve context
        passages = self.retrieve(search_query).passages
        context = "\n".join(passages)
        # Stage 3: Generate answer
        answer = self.generate_answer(context=context, question=question).answer
        return dspy.Prediction(answer=answer, context=context)
Use the pipeline

qa_system = MultiHopQA()
result = qa_system(question="Who wrote the book that inspired the movie Blade Runner?")

RAG System with Optimization

import dspy
from dspy.retrieve.chromadb_rm import ChromadbRM
Configure retriever

retriever = ChromadbRM(
    collection_name="documents",
    persist_directory="./chroma_db"
)
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate = dspy.ChainOfThought("context, question -> answer")
    def forward(self, question):
        context = self.retrieve(question).passages
        return self.generate(context=context, question=question)
Create and optimize

rag = RAG()
Optimize with training data

from dspy.teleprompt import BootstrapFewShotoptimizer = BootstrapFewShot(metric=validate_answer)
optimized_rag = optimizer.compile(rag, trainset=trainset)

LM Provider Configuration

Anthropic Claude

import dspylm = dspy.Claude(
    model="claude-sonnet-4-5-20250929",
    api_key="your-api-key",  # Or set ANTHROPIC_API_KEY env var
    max_tokens=1000,
    temperature=0.7
)
dspy.settings.configure(lm=lm)

OpenAI

lm = dspy.OpenAI(
    model="gpt-4",
    api_key="your-api-key",
    max_tokens=1000
)
dspy.settings.configure(lm=lm)

Local Models (Ollama)

lm = dspy.OllamaLocal(
    model="llama3.1",
    base_url="http://localhost:11434"
)
dspy.settings.configure(lm=lm)

Multiple Models

# Different models for different tasks
cheap_lm = dspy.OpenAI(model="gpt-3.5-turbo")
strong_lm = dspy.Claude(model="claude-sonnet-4-5-20250929")
Use cheap model for retrieval, strong model for reasoning

with dspy.settings.context(lm=cheap_lm):
    context = retriever(question)with dspy.settings.context(lm=strong_lm):
    answer = generator(context=context, question=question)

Common Patterns

Pattern 1: Structured Output

from pydantic import BaseModel, Field
class PersonInfo(BaseModel):
    name: str = Field(description="Full name")
    age: int = Field(description="Age in years")
    occupation: str = Field(description="Current job")
class ExtractPerson(dspy.Signature):
    """Extract person information from text."""
    text = dspy.InputField()
    person: PersonInfo = dspy.OutputField()extractor = dspy.TypedPredictor(ExtractPerson)
result = extractor(text="John Doe is a 35-year-old software engineer.")
print(result.person.name)  # "John Doe"
print(result.person.age)   # 35

Pattern 2: Assertion-Driven Optimization

import dspy
from dspy.primitives.assertions import assert_transform_module, backtrack_handler
class MathQA(dspy.Module):
    def __init__(self):
        super().__init__()
        self.solve = dspy.ChainOfThought("problem -> solution: float")
    def forward(self, problem):
        solution = self.solve(problem=problem).solution
        # Assert solution is numeric
        dspy.Assert(
            isinstance(float(solution), float),
            "Solution must be a number",
            backtrack=backtrack_handler
        )        return dspy.Prediction(solution=solution)

Pattern 3: Self-Consistency

import dspy
from collections import Counter
class ConsistentQA(dspy.Module):
    def __init__(self, num_samples=5):
        super().__init__()
        self.qa = dspy.ChainOfThought("question -> answer")
        self.num_samples = num_samples
    def forward(self, question):
        # Generate multiple answers
        answers = []
        for _ in range(self.num_samples):
            result = self.qa(question=question)
            answers.append(result.answer)        # Return most common answer
        most_common = Counter(answers).most_common(1)[0][0]
        return dspy.Prediction(answer=most_common)

Pattern 4: Retrieval with Reranking

class RerankedRAG(dspy.Module):
    def __init__(self):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=10)
        self.rerank = dspy.Predict("question, passage -> relevance_score: float")
        self.answer = dspy.ChainOfThought("context, question -> answer")
    def forward(self, question):
        # Retrieve candidates
        passages = self.retrieve(question).passages
        # Rerank passages
        scored = []
        for passage in passages:
            score = float(self.rerank(question=question, passage=passage).relevance_score)
            scored.append((score, passage))
        # Take top 3
        top_passages = [p for _, p in sorted(scored, reverse=True)[:3]]
        context = "\n\n".join(top_passages)        # Generate answer
        return self.answer(context=context, question=question)

Evaluation and Metrics

Custom Metrics

def exact_match(example, pred, trace=None):
    """Exact match metric."""
    return example.answer.lower() == pred.answer.lower()
def f1_score(example, pred, trace=None):
    """F1 score for text overlap."""
    pred_tokens = set(pred.answer.lower().split())
    gold_tokens = set(example.answer.lower().split())
    if not pred_tokens:
        return 0.0
    precision = len(pred_tokens & gold_tokens) / len(pred_tokens)
    recall = len(pred_tokens & gold_tokens) / len(gold_tokens)
    if precision + recall == 0:
        return 0.0    return 2  (precision  recall) / (precision + recall)

Evaluation

from dspy.evaluate import Evaluate
Create evaluator

evaluator = Evaluate(
    devset=testset,
    metric=exact_match,
    num_threads=4,
    display_progress=True
)
Evaluate model

score = evaluator(qa_system)
print(f"Accuracy: {score}")
Compare optimized vs unoptimized

score_before = evaluator(qa)
score_after = evaluator(optimized_qa)
print(f"Improvement: {score_after - score_before:.2%}")

Best Practices

1. Start Simple, Iterate

# Start with Predict
qa = dspy.Predict("question -> answer")
Add reasoning if needed

qa = dspy.ChainOfThought("question -> answer")
Add optimization when you have data

optimized_qa = optimizer.compile(qa, trainset=data)

2. Use Descriptive Signatures

# ❌ Bad: Vague
class Task(dspy.Signature):
    input = dspy.InputField()
    output = dspy.OutputField()
✅ Good: Descriptive

class SummarizeArticle(dspy.Signature):
    """Summarize news articles into 3-5 key points."""
    article = dspy.InputField(desc="full article text")
    summary = dspy.OutputField(desc="bullet points, 3-5 items")

3. Optimize with Representative Data

# Create diverse training examples
trainset = [
    dspy.Example(question="factual", answer="...).with_inputs("question"),
    dspy.Example(question="reasoning", answer="...").with_inputs("question"),
    dspy.Example(question="calculation", answer="...").with_inputs("question"),
]
Use validation set for metric

def metric(example, pred, trace=None):
    return example.answer in pred.answer

4. Save and Load Optimized Models

# Save
optimized_qa.save("models/qa_v1.json")
Load

loaded_qa = dspy.ChainOfThought("question -> answer")
loaded_qa.load("models/qa_v1.json")

5. Monitor and Debug

# Enable tracing
dspy.settings.configure(lm=lm, trace=[])
Run prediction

result = qa(question="...")
Inspect trace

for call in dspy.settings.trace:
    print(f"Prompt: {call['prompt']}")
    print(f"Response: {call['response']}")

Comparison to Other Approaches

| Feature | Manual Prompting | LangChain | DSPy |
|---------|-----------------|-----------|------|
| Prompt Engineering | Manual | Manual | Automatic |
| Optimization | Trial & error | None | Data-driven |
| Modularity | Low | Medium | High |
| Type Safety | No | Limited | Yes (Signatures) |
| Portability | Low | Medium | High |
| Learning Curve | Low | Medium | Medium-High |

When to choose DSPy:

You have training data or can generate it
You need systematic prompt improvement
You're building complex multi-stage systems
You want to optimize across different LMs

When to choose alternatives:

Quick prototypes (manual prompting)
Simple chains with existing tools (LangChain)
Custom optimization logic needed

Resources

Documentation: https://dspy.ai
GitHub: https://github.com/stanfordnlp/dspy (22k+ stars)
Discord: https://discord.gg/XCGy2WDCQB
Twitter: @DSPyOSS
Paper: "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines"

Related Skills / 관련 스킬

mlops v1.0.0

ZIP

modal-serverless-gpu

서버리스 GPU 클라우드 — ML 워크로드 온디맨드 GPU, 모델 API 배포, 자동 스케일링

mlops v1.0.0

ZIP

evaluating-llms-harness

60개 이상 학술 벤치마크로 LLM 평가 — MMLU, HumanEval, GSM8K, TruthfulQA 등

mlops v1.0.0

ZIP

weights-and-biases

W&B로 ML 실험 추적 — 자동 로깅, 실시간 시각화, 하이퍼파라미터 스윕, 모델 레지스트리

mlops v1.0.0

ZIP

huggingface-hub

Hugging Face Hub CLI (hf) — 모델/데이터셋 검색, 다운로드, 업로드, Space 관리

DSPy: Declarative Language Model Programming

When to Use This Skill

Installation

Latest development version

With specific LM providers

Quick Start

Basic Example: Question Answering

Configure your language model

Define a signature (input → output)

Create a module

Use it

Chain of Thought Reasoning

Use ChainOfThought for better reasoning

ChainOfThought generates reasoning steps automatically

Core Concepts

1. Signatures

Class signature (detailed)

2. Modules

dspy.Predict

dspy.ChainOfThought

dspy.ReAct

dspy.ProgramOfThought

Generates: answer = 240 * 0.15

3. Optimizers

BootstrapFewShot

Training data

Define metric

Optimize

Now optimized_qa performs better!

MIPRO (Most Important Prompt Optimization)

BootstrapFinetune

Exports training data for fine-tuning

4. Building Complex Systems

Multi-Stage Pipeline

Use the pipeline

RAG System with Optimization

Configure retriever

Create and optimize

Optimize with training data

LM Provider Configuration

Anthropic Claude

OpenAI

Local Models (Ollama)

Multiple Models

Use cheap model for retrieval, strong model for reasoning

Common Patterns

Pattern 1: Structured Output

Pattern 2: Assertion-Driven Optimization

Pattern 3: Self-Consistency

Pattern 4: Retrieval with Reranking

Evaluation and Metrics

Custom Metrics

Evaluation

Create evaluator

Evaluate model

Compare optimized vs unoptimized

Best Practices

1. Start Simple, Iterate

Add reasoning if needed

Add optimization when you have data

2. Use Descriptive Signatures

✅ Good: Descriptive

3. Optimize with Representative Data

Use validation set for metric

4. Save and Load Optimized Models

Load

5. Monitor and Debug

Run prediction

Inspect trace

Comparison to Other Approaches

Resources

See Also

Related Skills / 관련 스킬

modal-serverless-gpu

evaluating-llms-harness

weights-and-biases

huggingface-hub