Back to mlops
mlops v1.0.0 3.6 min read 256 lines

clip

OpenAI 비전-언어 연결 모델 — 제로샷 이미지 분류, 이미지-텍스트 매칭, 크로스모달 검색

Orchestra Research
MIT

CLIP - Contrastive Language-Image Pre-Training

OpenAI's model that understands images from natural language.

When to use CLIP

Use when:

  • Zero-shot image classification (no training data needed)
  • Image-text similarity/matching
  • Semantic image search
  • Content moderation (detect NSFW, violence)
  • Visual question answering
  • Cross-modal retrieval (image→text, text→image)

Metrics:

  • 25,300+ GitHub stars
  • Trained on 400M image-text pairs
  • Matches ResNet-50 on ImageNet (zero-shot)
  • MIT License

Use alternatives instead:

  • BLIP-2: Better captioning
  • LLaVA: Vision-language chat
  • Segment Anything: Image segmentation

Quick start

Installation

pip install git+https://github.com/openai/CLIP.git
pip install torch torchvision ftfy regex tqdm

Zero-shot classification

import torch
import clip
from PIL import Image

Load model


device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

Load image


image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)

Define possible labels


text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)

Compute similarity


with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)

# Cosine similarity
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()

Print results


labels = ["a dog", "a cat", "a bird", "a car"]
for label, prob in zip(labels, probs[0]):
print(f"{label}: {prob:.2%}")

Available models

# Models (sorted by size)
models = [
"RN50", # ResNet-50
"RN101", # ResNet-101
"ViT-B/32", # Vision Transformer (recommended)
"ViT-B/16", # Better quality, slower
"ViT-L/14", # Best quality, slowest
]

model, preprocess = clip.load("ViT-B/32")

| Model | Parameters | Speed | Quality |
|-------|------------|-------|---------|
| RN50 | 102M | Fast | Good |
| ViT-B/32 | 151M | Medium | Better |
| ViT-L/14 | 428M | Slow | Best |

Image-text similarity

# Compute embeddings
image_features = model.encode_image(image)
text_features = model.encode_text(text)

Normalize


image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

Cosine similarity


similarity = (image_features @ text_features.T).item()
print(f"Similarity: {similarity:.4f}")

Semantic image search

# Index images
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
image_embeddings = []

for img_path in image_paths:
image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
with torch.no_grad():
embedding = model.encode_image(image)
embedding /= embedding.norm(dim=-1, keepdim=True)
image_embeddings.append(embedding)

image_embeddings = torch.cat(image_embeddings)

Search with text query


query = "a sunset over the ocean"
text_input = clip.tokenize([query]).to(device)
with torch.no_grad():
text_embedding = model.encode_text(text_input)
text_embedding /= text_embedding.norm(dim=-1, keepdim=True)

Find most similar images


similarities = (text_embedding @ image_embeddings.T).squeeze(0)
top_k = similarities.topk(3)

for idx, score in zip(top_k.indices, top_k.values):
print(f"{image_paths[idx]}: {score:.3f}")

Content moderation

# Define categories
categories = [
"safe for work",
"not safe for work",
"violent content",
"graphic content"
]

text = clip.tokenize(categories).to(device)

Check image


with torch.no_grad():
logits_per_image, _ = model(image, text)
probs = logits_per_image.softmax(dim=-1)

Get classification


max_idx = probs.argmax().item()
max_prob = probs[0, max_idx].item()

print(f"Category: {categories[max_idx]} ({max_prob:.2%})")

Batch processing

# Process multiple images
images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)]
images = torch.stack(images).to(device)

with torch.no_grad():
image_features = model.encode_image(images)
image_features /= image_features.norm(dim=-1, keepdim=True)

Batch text


texts = ["a dog", "a cat", "a bird"]
text_tokens = clip.tokenize(texts).to(device)

with torch.no_grad():
text_features = model.encode_text(text_tokens)
text_features /= text_features.norm(dim=-1, keepdim=True)

Similarity matrix (10 images × 3 texts)


similarities = image_features @ text_features.T
print(similarities.shape) # (10, 3)

Integration with vector databases

# Store CLIP embeddings in Chroma/FAISS
import chromadb

client = chromadb.Client()
collection = client.create_collection("image_embeddings")

Add image embeddings


for img_path, embedding in zip(image_paths, image_embeddings):
collection.add(
embeddings=[embedding.cpu().numpy().tolist()],
metadatas=[{"path": img_path}],
ids=[img_path]
)

Query with text


query = "a sunset"
text_embedding = model.encode_text(clip.tokenize([query]))
results = collection.query(
query_embeddings=[text_embedding.cpu().numpy().tolist()],
n_results=5
)

Best practices

  • Use ViT-B/32 for most cases - Good balance
  • Normalize embeddings - Required for cosine similarity
  • Batch processing - More efficient
  • Cache embeddings - Expensive to recompute
  • Use descriptive labels - Better zero-shot performance
  • GPU recommended - 10-50× faster
  • Preprocess images - Use provided preprocess function

Performance

| Operation | CPU | GPU (V100) |
|-----------|-----|------------|
| Image encoding | ~200ms | ~20ms |
| Text encoding | ~50ms | ~5ms |
| Similarity compute | <1ms | <1ms |

Limitations

  • Not for fine-grained tasks - Best for broad categories
  • Requires descriptive text - Vague labels perform poorly
  • Biased on web data - May have dataset biases
  • No bounding boxes - Whole image only
  • Limited spatial understanding - Position/counting weak

Resources

  • GitHub: https://github.com/openai/CLIP ⭐ 25,300+
  • Paper: https://arxiv.org/abs/2103.00020
  • Colab: https://colab.research.google.com/github/openai/clip/
  • License: MIT

Related Skills / 관련 스킬

mlops v1.0.0

modal-serverless-gpu

서버리스 GPU 클라우드 — ML 워크로드 온디맨드 GPU, 모델 API 배포, 자동 스케일링

mlops v1.0.0

evaluating-llms-harness

60개 이상 학술 벤치마크로 LLM 평가 — MMLU, HumanEval, GSM8K, TruthfulQA 등

mlops v1.0.0

weights-and-biases

W&B로 ML 실험 추적 — 자동 로깅, 실시간 시각화, 하이퍼파라미터 스윕, 모델 레지스트리

mlops v1.0.0

huggingface-hub

Hugging Face Hub CLI (hf) — 모델/데이터셋 검색, 다운로드, 업로드, Space 관리