Back to mlops
mlops v1.0.0 7.5 min read 567 lines

audiocraft-audio-generation

PyTorch 오디오 생성 — 텍스트→음악(MusicGen), 텍스트→사운드(AudioGen)

Orchestra Research
MIT

AudioCraft: Audio Generation

Comprehensive guide to using Meta's AudioCraft for text-to-music and text-to-audio generation with MusicGen, AudioGen, and EnCodec.

When to use AudioCraft

Use AudioCraft when:

  • Need to generate music from text descriptions
  • Creating sound effects and environmental audio
  • Building music generation applications
  • Need melody-conditioned music generation
  • Want stereo audio output
  • Require controllable music generation with style transfer

Key features:

  • MusicGen: Text-to-music generation with melody conditioning
  • AudioGen: Text-to-sound effects generation
  • EnCodec: High-fidelity neural audio codec
  • Multiple model sizes: Small (300M) to Large (3.3B)
  • Stereo support: Full stereo audio generation
  • Style conditioning: MusicGen-Style for reference-based generation

Use alternatives instead:

  • Stable Audio: For longer commercial music generation
  • Bark: For text-to-speech with music/sound effects
  • Riffusion: For spectogram-based music generation
  • OpenAI Jukebox: For raw audio generation with lyrics

Quick start

Installation

# From PyPI
pip install audiocraft

From GitHub (latest)


pip install git+https://github.com/facebookresearch/audiocraft.git

Or use HuggingFace Transformers


pip install transformers torch torchaudio

Basic text-to-music (AudioCraft)

import torchaudio
from audiocraft.models import MusicGen

Load model


model = MusicGen.get_pretrained('facebook/musicgen-small')

Set generation parameters


model.set_generation_params(
duration=8, # seconds
top_k=250,
temperature=1.0
)

Generate from text


descriptions = ["happy upbeat electronic dance music with synths"]
wav = model.generate(descriptions)

Save audio


torchaudio.save("output.wav", wav[0].cpu(), sample_rate=32000)

Using HuggingFace Transformers

from transformers import AutoProcessor, MusicgenForConditionalGeneration
import scipy

Load model and processor


processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
model.to("cuda")

Generate music


inputs = processor(
text=["80s pop track with bassy drums and synth"],
padding=True,
return_tensors="pt"
).to("cuda")

audio_values = model.generate(
**inputs,
do_sample=True,
guidance_scale=3,
max_new_tokens=256
)

Save


sampling_rate = model.config.audio_encoder.sampling_rate
scipy.io.wavfile.write("output.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())

Text-to-sound with AudioGen

from audiocraft.models import AudioGen

Load AudioGen


model = AudioGen.get_pretrained('facebook/audiogen-medium')

model.set_generation_params(duration=5)

Generate sound effects


descriptions = ["dog barking in a park with birds chirping"]
wav = model.generate(descriptions)

torchaudio.save("sound.wav", wav[0].cpu(), sample_rate=16000)

Core concepts

Architecture overview

AudioCraft Architecture:
┌──────────────────────────────────────────────────────────────┐
│ Text Encoder (T5) │
│ │ │
│ Text Embeddings │
└────────────────────────┬─────────────────────────────────────┘

┌────────────────────────▼─────────────────────────────────────┐
│ Transformer Decoder (LM) │
│ Auto-regressively generates audio tokens │
│ Using efficient token interleaving patterns │
└────────────────────────┬─────────────────────────────────────┘

┌────────────────────────▼─────────────────────────────────────┐
│ EnCodec Audio Decoder │
│ Converts tokens back to audio waveform │
└──────────────────────────────────────────────────────────────┘

Model variants

| Model | Size | Description | Use Case |
|-------|------|-------------|----------|
| musicgen-small | 300M | Text-to-music | Quick generation |
| musicgen-medium | 1.5B | Text-to-music | Balanced |
| musicgen-large | 3.3B | Text-to-music | Best quality |
| musicgen-melody | 1.5B | Text + melody | Melody conditioning |
| musicgen-melody-large | 3.3B | Text + melody | Best melody |
| musicgen-stereo-* | Varies | Stereo output | Stereo generation |
| musicgen-style | 1.5B | Style transfer | Reference-based |
| audiogen-medium | 1.5B | Text-to-sound | Sound effects |

Generation parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| duration | 8.0 | Length in seconds (1-120) |
| top_k | 250 | Top-k sampling |
| top_p | 0.0 | Nucleus sampling (0 = disabled) |
| temperature | 1.0 | Sampling temperature |
| cfg_coef | 3.0 | Classifier-free guidance |

MusicGen usage

Text-to-music generation

from audiocraft.models import MusicGen
import torchaudio

model = MusicGen.get_pretrained('facebook/musicgen-medium')

Configure generation


model.set_generation_params(
duration=30, # Up to 30 seconds
top_k=250, # Sampling diversity
top_p=0.0, # 0 = use top_k only
temperature=1.0, # Creativity (higher = more varied)
cfg_coef=3.0 # Text adherence (higher = stricter)
)

Generate multiple samples


descriptions = [
"epic orchestral soundtrack with strings and brass",
"chill lo-fi hip hop beat with jazzy piano",
"energetic rock song with electric guitar"
]

Generate (returns [batch, channels, samples])


wav = model.generate(descriptions)

Save each


for i, audio in enumerate(wav):
torchaudio.save(f"music_{i}.wav", audio.cpu(), sample_rate=32000)

Melody-conditioned generation

from audiocraft.models import MusicGen
import torchaudio

Load melody model


model = MusicGen.get_pretrained('facebook/musicgen-melody')
model.set_generation_params(duration=30)

Load melody audio


melody, sr = torchaudio.load("melody.wav")

Generate with melody conditioning


descriptions = ["acoustic guitar folk song"]
wav = model.generate_with_chroma(descriptions, melody, sr)

torchaudio.save("melody_conditioned.wav", wav[0].cpu(), sample_rate=32000)

Stereo generation

from audiocraft.models import MusicGen

Load stereo model


model = MusicGen.get_pretrained('facebook/musicgen-stereo-medium')
model.set_generation_params(duration=15)

descriptions = ["ambient electronic music with wide stereo panning"]
wav = model.generate(descriptions)

wav shape: [batch, 2, samples] for stereo


print(f"Stereo shape: {wav.shape}") # [1, 2, 480000]
torchaudio.save("stereo.wav", wav[0].cpu(), sample_rate=32000)

Audio continuation

from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen-medium")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium")

Load audio to continue


import torchaudio
audio, sr = torchaudio.load("intro.wav")

Process with text and audio


inputs = processor(
audio=audio.squeeze().numpy(),
sampling_rate=sr,
text=["continue with a epic chorus"],
padding=True,
return_tensors="pt"
)

Generate continuation


audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=512)

MusicGen-Style usage

Style-conditioned generation

from audiocraft.models import MusicGen

Load style model


model = MusicGen.get_pretrained('facebook/musicgen-style')

Configure generation with style


model.set_generation_params(
duration=30,
cfg_coef=3.0,
cfg_coef_beta=5.0 # Style influence
)

Configure style conditioner


model.set_style_conditioner_params(
eval_q=3, # RVQ quantizers (1-6)
excerpt_length=3.0 # Style excerpt length
)

Load style reference


style_audio, sr = torchaudio.load("reference_style.wav")

Generate with text + style


descriptions = ["upbeat dance track"]
wav = model.generate_with_style(descriptions, style_audio, sr)

Style-only generation (no text)

# Generate matching style without text prompt
model.set_generation_params(
duration=30,
cfg_coef=3.0,
cfg_coef_beta=None # Disable double CFG for style-only
)

wav = model.generate_with_style([None], style_audio, sr)

AudioGen usage

Sound effect generation

from audiocraft.models import AudioGen
import torchaudio

model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=10)

Generate various sounds


descriptions = [
"thunderstorm with heavy rain and lightning",
"busy city traffic with car horns",
"ocean waves crashing on rocks",
"crackling campfire in forest"
]

wav = model.generate(descriptions)

for i, audio in enumerate(wav):
torchaudio.save(f"sound_{i}.wav", audio.cpu(), sample_rate=16000)

EnCodec usage

Audio compression

from audiocraft.models import CompressionModel
import torch
import torchaudio

Load EnCodec


model = CompressionModel.get_pretrained('facebook/encodec_32khz')

Load audio


wav, sr = torchaudio.load("audio.wav")

Ensure correct sample rate


if sr != 32000:
resampler = torchaudio.transforms.Resample(sr, 32000)
wav = resampler(wav)

Encode to tokens


with torch.no_grad():
encoded = model.encode(wav.unsqueeze(0))
codes = encoded[0] # Audio codes

Decode back to audio


with torch.no_grad():
decoded = model.decode(codes)

torchaudio.save("reconstructed.wav", decoded[0].cpu(), sample_rate=32000)

Common workflows

Workflow 1: Music generation pipeline

import torch
import torchaudio
from audiocraft.models import MusicGen

class MusicGenerator:
def __init__(self, model_name="facebook/musicgen-medium"):
self.model = MusicGen.get_pretrained(model_name)
self.sample_rate = 32000

def generate(self, prompt, duration=30, temperature=1.0, cfg=3.0):
self.model.set_generation_params(
duration=duration,
top_k=250,
temperature=temperature,
cfg_coef=cfg
)

with torch.no_grad():
wav = self.model.generate([prompt])

return wav[0].cpu()

def generate_batch(self, prompts, duration=30):
self.model.set_generation_params(duration=duration)

with torch.no_grad():
wav = self.model.generate(prompts)

return wav.cpu()

def save(self, audio, path):
torchaudio.save(path, audio, sample_rate=self.sample_rate)

Usage


generator = MusicGenerator()
audio = generator.generate(
"epic cinematic orchestral music",
duration=30,
temperature=1.0
)
generator.save(audio, "epic_music.wav")

Workflow 2: Sound design batch processing

import json
from pathlib import Path
from audiocraft.models import AudioGen
import torchaudio

def batch_generate_sounds(sound_specs, output_dir):
"""
Generate multiple sounds from specifications.

Args:
sound_specs: list of {"name": str, "description": str, "duration": float}
output_dir: output directory path
"""
model = AudioGen.get_pretrained('facebook/audiogen-medium')
output_dir = Path(output_dir)
output_dir.mkdir(exist_ok=True)

results = []

for spec in sound_specs:
model.set_generation_params(duration=spec.get("duration", 5))

wav = model.generate([spec["description"]])

output_path = output_dir / f"{spec['name']}.wav"
torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=16000)

results.append({
"name": spec["name"],
"path": str(output_path),
"description": spec["description"]
})

return results

Usage


sounds = [
{"name": "explosion", "description": "massive explosion with debris", "duration": 3},
{"name": "footsteps", "description": "footsteps on wooden floor", "duration": 5},
{"name": "door", "description": "wooden door creaking and closing", "duration": 2}
]

results = batch_generate_sounds(sounds, "sound_effects/")

Workflow 3: Gradio demo

import gradio as gr
import torch
import torchaudio
from audiocraft.models import MusicGen

model = MusicGen.get_pretrained('facebook/musicgen-small')

def generate_music(prompt, duration, temperature, cfg_coef):
model.set_generation_params(
duration=duration,
temperature=temperature,
cfg_coef=cfg_coef
)

with torch.no_grad():
wav = model.generate([prompt])

# Save to temp file
path = "temp_output.wav"
torchaudio.save(path, wav[0].cpu(), sample_rate=32000)
return path

demo = gr.Interface(
fn=generate_music,
inputs=[
gr.Textbox(label="Music Description", placeholder="upbeat electronic dance music"),
gr.Slider(1, 30, value=8, label="Duration (seconds)"),
gr.Slider(0.5, 2.0, value=1.0, label="Temperature"),
gr.Slider(1.0, 10.0, value=3.0, label="CFG Coefficient")
],
outputs=gr.Audio(label="Generated Music"),
title="MusicGen Demo"
)

demo.launch()

Performance optimization

Memory optimization

# Use smaller model
model = MusicGen.get_pretrained('facebook/musicgen-small')

Clear cache between generations


torch.cuda.empty_cache()

Generate shorter durations


model.set_generation_params(duration=10) # Instead of 30

Use half precision


model = model.half()

Batch processing efficiency

# Process multiple prompts at once (more efficient)
descriptions = ["prompt1", "prompt2", "prompt3", "prompt4"]
wav = model.generate(descriptions) # Single batch

Instead of


for desc in descriptions:
wav = model.generate([desc]) # Multiple batches (slower)

GPU memory requirements

| Model | FP32 VRAM | FP16 VRAM |
|-------|-----------|-----------|
| musicgen-small | ~4GB | ~2GB |
| musicgen-medium | ~8GB | ~4GB |
| musicgen-large | ~16GB | ~8GB |

Common issues

| Issue | Solution |
|-------|----------|
| CUDA OOM | Use smaller model, reduce duration |
| Poor quality | Increase cfg_coef, better prompts |
| Generation too short | Check max duration setting |
| Audio artifacts | Try different temperature |
| Stereo not working | Use stereo model variant |

References

Resources

  • GitHub: https://github.com/facebookresearch/audiocraft
  • Paper (MusicGen): https://arxiv.org/abs/2306.05284
  • Paper (AudioGen): https://arxiv.org/abs/2209.15352
  • HuggingFace: https://huggingface.co/facebook/musicgen-small
  • Demo: https://huggingface.co/spaces/facebook/MusicGen

Related Skills / 관련 스킬

mlops v1.0.0

modal-serverless-gpu

서버리스 GPU 클라우드 — ML 워크로드 온디맨드 GPU, 모델 API 배포, 자동 스케일링

mlops v1.0.0

evaluating-llms-harness

60개 이상 학술 벤치마크로 LLM 평가 — MMLU, HumanEval, GSM8K, TruthfulQA 등

mlops v1.0.0

weights-and-biases

W&B로 ML 실험 추적 — 자동 로깅, 실시간 시각화, 하이퍼파라미터 스윕, 모델 레지스트리

mlops v1.0.0

huggingface-hub

Hugging Face Hub CLI (hf) — 모델/데이터셋 검색, 다운로드, 업로드, Space 관리