Back to mlops
mlops v1.0.0 6.3 min read 593 lines

weights-and-biases

W&B로 ML 실험 추적 — 자동 로깅, 실시간 시각화, 하이퍼파라미터 스윕, 모델 레지스트리

Orchestra Research
MIT

Weights & Biases: ML Experiment Tracking & MLOps

When to Use This Skill

Use Weights & Biases (W&B) when you need to:

  • Track ML experiments with automatic metric logging
  • Visualize training in real-time dashboards
  • Compare runs across hyperparameters and configurations
  • Optimize hyperparameters with automated sweeps
  • Manage model registry with versioning and lineage
  • Collaborate on ML projects with team workspaces
  • Track artifacts (datasets, models, code) with lineage

Users: 200,000+ ML practitioners | GitHub Stars: 10.5k+ | Integrations: 100+

Installation

# Install W&B
pip install wandb

Login (creates API key)


wandb login

Or set API key programmatically


export WANDB_API_KEY=your_api_key_here

Quick Start

Basic Experiment Tracking

import wandb

Initialize a run


run = wandb.init(
project="my-project",
config={
"learning_rate": 0.001,
"epochs": 10,
"batch_size": 32,
"architecture": "ResNet50"
}
)

Training loop


for epoch in range(run.config.epochs):
# Your training code
train_loss = train_epoch()
val_loss = validate()

# Log metrics
wandb.log({
"epoch": epoch,
"train/loss": train_loss,
"val/loss": val_loss,
"train/accuracy": train_acc,
"val/accuracy": val_acc
})

Finish the run


wandb.finish()

With PyTorch

import torch
import wandb

Initialize


wandb.init(project="pytorch-demo", config={
"lr": 0.001,
"epochs": 10
})

Access config


config = wandb.config

Training loop


for epoch in range(config.epochs):
for batch_idx, (data, target) in enumerate(train_loader):
# Forward pass
output = model(data)
loss = criterion(output, target)

# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()

# Log every 100 batches
if batch_idx % 100 == 0:
wandb.log({
"loss": loss.item(),
"epoch": epoch,
"batch": batch_idx
})

Save model


torch.save(model.state_dict(), "model.pth")
wandb.save("model.pth") # Upload to W&B

wandb.finish()

Core Concepts

1. Projects and Runs

Project: Collection of related experiments
Run: Single execution of your training script

# Create/use project
run = wandb.init(
project="image-classification",
name="resnet50-experiment-1", # Optional run name
tags=["baseline", "resnet"], # Organize with tags
notes="First baseline run" # Add notes
)

Each run has unique ID


print(f"Run ID: {run.id}")
print(f"Run URL: {run.url}")

2. Configuration Tracking

Track hyperparameters automatically:

config = {
# Model architecture
"model": "ResNet50",
"pretrained": True,

# Training params
"learning_rate": 0.001,
"batch_size": 32,
"epochs": 50,
"optimizer": "Adam",

# Data params
"dataset": "ImageNet",
"augmentation": "standard"
}

wandb.init(project="my-project", config=config)

Access config during training


lr = wandb.config.learning_rate
batch_size = wandb.config.batch_size

3. Metric Logging

# Log scalars
wandb.log({"loss": 0.5, "accuracy": 0.92})

Log multiple metrics


wandb.log({
"train/loss": train_loss,
"train/accuracy": train_acc,
"val/loss": val_loss,
"val/accuracy": val_acc,
"learning_rate": current_lr,
"epoch": epoch
})

Log with custom x-axis


wandb.log({"loss": loss}, step=global_step)

Log media (images, audio, video)


wandb.log({"examples": [wandb.Image(img) for img in images]})

Log histograms


wandb.log({"gradients": wandb.Histogram(gradients)})

Log tables


table = wandb.Table(columns=["id", "prediction", "ground_truth"])
wandb.log({"predictions": table})

4. Model Checkpointing

import torch
import wandb

Save model checkpoint


checkpoint = {
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}

torch.save(checkpoint, 'checkpoint.pth')

Upload to W&B


wandb.save('checkpoint.pth')

Or use Artifacts (recommended)


artifact = wandb.Artifact('model', type='model')
artifact.add_file('checkpoint.pth')
wandb.log_artifact(artifact)

Hyperparameter Sweeps

Automatically search for optimal hyperparameters.

Define Sweep Configuration

sweep_config = {
'method': 'bayes', # or 'grid', 'random'
'metric': {
'name': 'val/accuracy',
'goal': 'maximize'
},
'parameters': {
'learning_rate': {
'distribution': 'log_uniform',
'min': 1e-5,
'max': 1e-1
},
'batch_size': {
'values': [16, 32, 64, 128]
},
'optimizer': {
'values': ['adam', 'sgd', 'rmsprop']
},
'dropout': {
'distribution': 'uniform',
'min': 0.1,
'max': 0.5
}
}
}

Initialize sweep


sweep_id = wandb.sweep(sweep_config, project="my-project")

Define Training Function

def train():
# Initialize run
run = wandb.init()

# Access sweep parameters
lr = wandb.config.learning_rate
batch_size = wandb.config.batch_size
optimizer_name = wandb.config.optimizer

# Build model with sweep config
model = build_model(wandb.config)
optimizer = get_optimizer(optimizer_name, lr)

# Training loop
for epoch in range(NUM_EPOCHS):
train_loss = train_epoch(model, optimizer, batch_size)
val_acc = validate(model)

# Log metrics
wandb.log({
"train/loss": train_loss,
"val/accuracy": val_acc
})

Run sweep


wandb.agent(sweep_id, function=train, count=50) # Run 50 trials

Sweep Strategies

# Grid search - exhaustive
sweep_config = {
'method': 'grid',
'parameters': {
'lr': {'values': [0.001, 0.01, 0.1]},
'batch_size': {'values': [16, 32, 64]}
}
}

Random search


sweep_config = {
'method': 'random',
'parameters': {
'lr': {'distribution': 'uniform', 'min': 0.0001, 'max': 0.1},
'dropout': {'distribution': 'uniform', 'min': 0.1, 'max': 0.5}
}
}

Bayesian optimization (recommended)


sweep_config = {
'method': 'bayes',
'metric': {'name': 'val/loss', 'goal': 'minimize'},
'parameters': {
'lr': {'distribution': 'log_uniform', 'min': 1e-5, 'max': 1e-1}
}
}

Artifacts

Track datasets, models, and other files with lineage.

Log Artifacts

# Create artifact
artifact = wandb.Artifact(
name='training-dataset',
type='dataset',
description='ImageNet training split',
metadata={'size': '1.2M images', 'split': 'train'}
)

Add files


artifact.add_file('data/train.csv')
artifact.add_dir('data/images/')

Log artifact


wandb.log_artifact(artifact)

Use Artifacts

# Download and use artifact
run = wandb.init(project="my-project")

Download artifact


artifact = run.use_artifact('training-dataset:latest')
artifact_dir = artifact.download()

Use the data


data = load_data(f"{artifact_dir}/train.csv")

Model Registry

# Log model as artifact
model_artifact = wandb.Artifact(
name='resnet50-model',
type='model',
metadata={'architecture': 'ResNet50', 'accuracy': 0.95}
)

model_artifact.add_file('model.pth')
wandb.log_artifact(model_artifact, aliases=['best', 'production'])

Link to model registry


run.link_artifact(model_artifact, 'model-registry/production-models')

Integration Examples

HuggingFace Transformers

from transformers import Trainer, TrainingArguments
import wandb

Initialize W&B


wandb.init(project="hf-transformers")

Training arguments with W&B


training_args = TrainingArguments(
output_dir="./results",
report_to="wandb", # Enable W&B logging
run_name="bert-finetuning",
logging_steps=100,
save_steps=500
)

Trainer automatically logs to W&B


trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)

trainer.train()

PyTorch Lightning

from pytorch_lightning import Trainer
from pytorch_lightning.loggers import WandbLogger
import wandb

Create W&B logger


wandb_logger = WandbLogger(
project="lightning-demo",
log_model=True # Log model checkpoints
)

Use with Trainer


trainer = Trainer(
logger=wandb_logger,
max_epochs=10
)

trainer.fit(model, datamodule=dm)

Keras/TensorFlow

import wandb
from wandb.keras import WandbCallback

Initialize


wandb.init(project="keras-demo")

Add callback


model.fit(
x_train, y_train,
validation_data=(x_val, y_val),
epochs=10,
callbacks=[WandbCallback()] # Auto-logs metrics
)

Visualization & Analysis

Custom Charts

# Log custom visualizations
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x, y)
wandb.log({"custom_plot": wandb.Image(fig)})

Log confusion matrix


wandb.log({"conf_mat": wandb.plot.confusion_matrix(
probs=None,
y_true=ground_truth,
preds=predictions,
class_names=class_names
)})

Reports

Create shareable reports in W&B UI:

  • Combine runs, charts, and text
  • Markdown support
  • Embeddable visualizations
  • Team collaboration

Best Practices

1. Organize with Tags and Groups

wandb.init(
project="my-project",
tags=["baseline", "resnet50", "imagenet"],
group="resnet-experiments", # Group related runs
job_type="train" # Type of job
)

2. Log Everything Relevant

# Log system metrics
wandb.log({
"gpu/util": gpu_utilization,
"gpu/memory": gpu_memory_used,
"cpu/util": cpu_utilization
})

Log code version


wandb.log({"git_commit": git_commit_hash})

Log data splits


wandb.log({
"data/train_size": len(train_dataset),
"data/val_size": len(val_dataset)
})

3. Use Descriptive Names

# ✅ Good: Descriptive run names
wandb.init(
project="nlp-classification",
name="bert-base-lr0.001-bs32-epoch10"
)

❌ Bad: Generic names


wandb.init(project="nlp", name="run1")

4. Save Important Artifacts

# Save final model
artifact = wandb.Artifact('final-model', type='model')
artifact.add_file('model.pth')
wandb.log_artifact(artifact)

Save predictions for analysis


predictions_table = wandb.Table(
columns=["id", "input", "prediction", "ground_truth"],
data=predictions_data
)
wandb.log({"predictions": predictions_table})

5. Use Offline Mode for Unstable Connections

import os

Enable offline mode


os.environ["WANDB_MODE"] = "offline"

wandb.init(project="my-project")

... your code ...

Sync later


wandb sync


Team Collaboration

Share Runs

# Runs are automatically shareable via URL
run = wandb.init(project="team-project")
print(f"Share this URL: {run.url}")

Team Projects

  • Create team account at wandb.ai
  • Add team members
  • Set project visibility (private/public)
  • Use team-level artifacts and model registry

Pricing

  • Free: Unlimited public projects, 100GB storage
  • Academic: Free for students/researchers
  • Teams: $50/seat/month, private projects, unlimited storage
  • Enterprise: Custom pricing, on-prem options

Resources

  • Documentation: https://docs.wandb.ai
  • GitHub: https://github.com/wandb/wandb (10.5k+ stars)
  • Examples: https://github.com/wandb/examples
  • Community: https://wandb.ai/community
  • Discord: https://wandb.me/discord

See Also

  • references/sweeps.md - Comprehensive hyperparameter optimization guide
  • references/artifacts.md - Data and model versioning patterns
  • references/integrations.md - Framework-specific examples

Related Skills / 관련 스킬

mlops v1.0.0

modal-serverless-gpu

서버리스 GPU 클라우드 — ML 워크로드 온디맨드 GPU, 모델 API 배포, 자동 스케일링

mlops v1.0.0

evaluating-llms-harness

60개 이상 학술 벤치마크로 LLM 평가 — MMLU, HumanEval, GSM8K, TruthfulQA 등

mlops v1.0.0

huggingface-hub

Hugging Face Hub CLI (hf) — 모델/데이터셋 검색, 다운로드, 업로드, Space 관리

mlops v1.0.0

gguf-quantization

GGUF 포맷 및 llama.cpp 양자화 — CPU/GPU 효율적 추론, 2~8비트 유연한 양자화