Back to mlops

mlops v1.0.0 6.6 min read 522 lines

stable-diffusion-image-generation

HuggingFace Diffusers로 Stable Diffusion 텍스트→이미지 생성

Download ZIP

Orchestra Research

MIT

Stable Diffusion Image Generation

Comprehensive guide to generating images with Stable Diffusion using the HuggingFace Diffusers library.

When to use Stable Diffusion

Use Stable Diffusion when:

Generating images from text descriptions
Performing image-to-image translation (style transfer, enhancement)
Inpainting (filling in masked regions)
Outpainting (extending images beyond boundaries)
Creating variations of existing images
Building custom image generation workflows

Key features:

Text-to-Image: Generate images from natural language prompts
Image-to-Image: Transform existing images with text guidance
Inpainting: Fill masked regions with context-aware content
ControlNet: Add spatial conditioning (edges, poses, depth)
LoRA Support: Efficient fine-tuning and style adaptation
Multiple Models: SD 1.5, SDXL, SD 3.0, Flux support

Use alternatives instead:

DALL-E 3: For API-based generation without GPU
Midjourney: For artistic, stylized outputs
Imagen: For Google Cloud integration
Leonardo.ai: For web-based creative workflows

Quick start

Installation

pip install diffusers transformers accelerate torch
pip install xformers  # Optional: memory-efficient attention

Basic text-to-image

from diffusers import DiffusionPipeline
import torch
Load pipeline (auto-detects model type)

pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16
)
pipe.to("cuda")
Generate image

image = pipe(
    "A serene mountain landscape at sunset, highly detailed",
    num_inference_steps=50,
    guidance_scale=7.5
).images[0]image.save("output.png")

Using SDXL (higher quality)

from diffusers import AutoPipelineForText2Image
import torch
pipe = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")
Enable memory optimization

pipe.enable_model_cpu_offload()image = pipe(
    prompt="A futuristic city with flying cars, cinematic lighting",
    height=1024,
    width=1024,
    num_inference_steps=30
).images[0]

Architecture overview

Three-pillar design

Diffusers is built around three core components:

Pipeline (orchestration)
├── Model (neural networks)
│   ├── UNet / Transformer (noise prediction)
│   ├── VAE (latent encoding/decoding)
│   └── Text Encoder (CLIP/T5)
└── Scheduler (denoising algorithm)

Pipeline inference flow

Text Prompt → Text Encoder → Text Embeddings
                                    ↓
Random Noise → [Denoising Loop] ← Scheduler
                      ↓
               Predicted Noise
                      ↓
              VAE Decoder → Final Image

Core concepts

Pipelines

Pipelines orchestrate complete workflows:

| Pipeline | Purpose |
|----------|---------|
| StableDiffusionPipeline | Text-to-image (SD 1.x/2.x) |
| StableDiffusionXLPipeline | Text-to-image (SDXL) |
| StableDiffusion3Pipeline | Text-to-image (SD 3.0) |
| FluxPipeline | Text-to-image (Flux models) |
| StableDiffusionImg2ImgPipeline | Image-to-image |
| StableDiffusionInpaintPipeline | Inpainting |

Schedulers

Schedulers control the denoising process:

| Scheduler | Steps | Quality | Use Case |
|-----------|-------|---------|----------|
| EulerDiscreteScheduler | 20-50 | Good | Default choice |
| EulerAncestralDiscreteScheduler | 20-50 | Good | More variation |
| DPMSolverMultistepScheduler | 15-25 | Excellent | Fast, high quality |
| DDIMScheduler | 50-100 | Good | Deterministic |
| LCMScheduler | 4-8 | Good | Very fast |
| UniPCMultistepScheduler | 15-25 | Excellent | Fast convergence |

Swapping schedulers

from diffusers import DPMSolverMultistepScheduler
Swap for faster generation

pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    pipe.scheduler.config
)
Now generate with fewer steps

image = pipe(prompt, num_inference_steps=20).images[0]

Generation parameters

Key parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| prompt | Required | Text description of desired image |
| negative_prompt | None | What to avoid in the image |
| num_inference_steps | 50 | Denoising steps (more = better quality) |
| guidance_scale | 7.5 | Prompt adherence (7-12 typical) |
| height, width | 512/1024 | Output dimensions (multiples of 8) |
| generator | None | Torch generator for reproducibility |
| num_images_per_prompt | 1 | Batch size |

Reproducible generation

import torch
generator = torch.Generator(device="cuda").manual_seed(42)image = pipe(
    prompt="A cat wearing a top hat",
    generator=generator,
    num_inference_steps=50
).images[0]

Negative prompts

image = pipe(
    prompt="Professional photo of a dog in a garden",
    negative_prompt="blurry, low quality, distorted, ugly, bad anatomy",
    guidance_scale=7.5
).images[0]

Image-to-image

Transform existing images with text guidance:

from diffusers import AutoPipelineForImage2Image
from PIL import Image
pipe = AutoPipelineForImage2Image.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")
init_image = Image.open("input.jpg").resize((512, 512))image = pipe(
    prompt="A watercolor painting of the scene",
    image=init_image,
    strength=0.75,  # How much to transform (0-1)
    num_inference_steps=50
).images[0]

Inpainting

Fill masked regions:

from diffusers import AutoPipelineForInpainting
from PIL import Image
pipe = AutoPipelineForInpainting.from_pretrained(
    "runwayml/stable-diffusion-inpainting",
    torch_dtype=torch.float16
).to("cuda")
image = Image.open("photo.jpg")
mask = Image.open("mask.png")  # White = inpaint regionresult = pipe(
    prompt="A red car parked on the street",
    image=image,
    mask_image=mask,
    num_inference_steps=50
).images[0]

ControlNet

Add spatial conditioning for precise control:

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
Load ControlNet for edge conditioning

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/control_v11p_sd15_canny",
    torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")
Use Canny edge image as control

control_image = get_canny_image(input_image)image = pipe(
    prompt="A beautiful house in the style of Van Gogh",
    image=control_image,
    num_inference_steps=30
).images[0]

Available ControlNets

| ControlNet | Input Type | Use Case |
|------------|------------|----------|
| canny | Edge maps | Preserve structure |
| openpose | Pose skeletons | Human poses |
| depth | Depth maps | 3D-aware generation |
| normal | Normal maps | Surface details |
| mlsd | Line segments | Architectural lines |
| scribble | Rough sketches | Sketch-to-image |

LoRA adapters

Load fine-tuned style adapters:

from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")
Load LoRA weights

pipe.load_lora_weights("path/to/lora", weight_name="style.safetensors")
Generate with LoRA style

image = pipe("A portrait in the trained style").images[0]
Adjust LoRA strength

pipe.fuse_lora(lora_scale=0.8)
Unload LoRA

pipe.unload_lora_weights()

Multiple LoRAs

# Load multiple LoRAs
pipe.load_lora_weights("lora1", adapter_name="style")
pipe.load_lora_weights("lora2", adapter_name="character")
Set weights for each

pipe.set_adapters(["style", "character"], adapter_weights=[0.7, 0.5])image = pipe("A portrait").images[0]

Memory optimization

Enable CPU offloading

# Model CPU offload - moves models to CPU when not in use
pipe.enable_model_cpu_offload()
Sequential CPU offload - more aggressive, slower

pipe.enable_sequential_cpu_offload()

Attention slicing

# Reduce memory by computing attention in chunks
pipe.enable_attention_slicing()
Or specific chunk size

pipe.enable_attention_slicing("max")

xFormers memory-efficient attention

# Requires xformers package
pipe.enable_xformers_memory_efficient_attention()

VAE slicing for large images

# Decode latents in tiles for large images
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()

Model variants

Loading different precisions

# FP16 (recommended for GPU)
pipe = DiffusionPipeline.from_pretrained(
    "model-id",
    torch_dtype=torch.float16,
    variant="fp16"
)
BF16 (better precision, requires Ampere+ GPU)

pipe = DiffusionPipeline.from_pretrained(
    "model-id",
    torch_dtype=torch.bfloat16
)

Loading specific components

from diffusers import UNet2DConditionModel, AutoencoderKL
Load custom VAE

vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
Use with pipeline

pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    vae=vae,
    torch_dtype=torch.float16
)

Batch generation

Generate multiple images efficiently:

# Multiple prompts
prompts = [
    "A cat playing piano",
    "A dog reading a book",
    "A bird painting a picture"
]
images = pipe(prompts, num_inference_steps=30).images
Multiple images per prompt

images = pipe(
    "A beautiful sunset",
    num_images_per_prompt=4,
    num_inference_steps=30
).images

Common workflows

Workflow 1: High-quality generation

from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler
import torch
1. Load SDXL with optimizations

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
2. Generate with quality settings

image = pipe(
    prompt="A majestic lion in the savanna, golden hour lighting, 8k, detailed fur",
    negative_prompt="blurry, low quality, cartoon, anime, sketch",
    num_inference_steps=30,
    guidance_scale=7.5,
    height=1024,
    width=1024
).images[0]

Workflow 2: Fast prototyping

from diffusers import AutoPipelineForText2Image, LCMScheduler
import torch
Use LCM for 4-8 step generation

pipe = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16
).to("cuda")
Load LCM LoRA for fast generation

pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.fuse_lora()
Generate in ~1 second

image = pipe(
    "A beautiful landscape",
    num_inference_steps=4,
    guidance_scale=1.0
).images[0]

Common issues

CUDA out of memory:

# Enable memory optimizations
pipe.enable_model_cpu_offload()
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()
Or use lower precision

pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)

Black/noise images:

# Check VAE configuration
Use safety checker bypass if needed

pipe.safety_checker = None
Ensure proper dtype consistency

pipe = pipe.to(dtype=torch.float16)

Slow generation:

# Use faster scheduler
from diffusers import DPMSolverMultistepScheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
Reduce steps

image = pipe(prompt, num_inference_steps=20).images[0]

References

Advanced Usage - Custom pipelines, fine-tuning, deployment
Troubleshooting - Common issues and solutions

Resources

Documentation: https://huggingface.co/docs/diffusers
Repository: https://github.com/huggingface/diffusers
Model Hub: https://huggingface.co/models?library=diffusers
Discord: https://discord.gg/diffusers

Related Skills / 관련 스킬

mlops v1.0.0

ZIP

modal-serverless-gpu

서버리스 GPU 클라우드 — ML 워크로드 온디맨드 GPU, 모델 API 배포, 자동 스케일링

mlops v1.0.0

ZIP

evaluating-llms-harness

60개 이상 학술 벤치마크로 LLM 평가 — MMLU, HumanEval, GSM8K, TruthfulQA 등

mlops v1.0.0

ZIP

weights-and-biases

W&B로 ML 실험 추적 — 자동 로깅, 실시간 시각화, 하이퍼파라미터 스윕, 모델 레지스트리

mlops v1.0.0

ZIP

huggingface-hub

Hugging Face Hub CLI (hf) — 모델/데이터셋 검색, 다운로드, 업로드, Space 관리