Files
arisingmedia-web-sops/local-image-generation/03-wan-video.md
T
2026-06-09 18:31:59 +02:00

160 lines
6.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 03 — Wan 2.2 Video Pipeline (Image-to-Video)
## Default policy: local generation
Video generation is done locally with Wan 2.2 by default. Google Veo (via
Vertex AI / Gemini API) is NOT used unless the client has explicit budget
allocated for it. Reasons:
- Google Veo costs money per second of video generated (billed per request)
- Local Wan 2.2 is free after one-time model download (~10GB total)
- Quality from Wan 2.2 at 832x480 is sufficient for hero reels
- No API key, no quota limits, no vendor dependency
Use Google Veo only when: client approves a paid media budget, OR the local
workstation is unavailable and a deadline cannot wait for CPU generation time.
## Purpose
Takes FLUX-generated hero stills and animates each into a 3-5 second clip.
Clips are stitched with ffmpeg into a marketing reel for the hero section.
## Model stack
| File | Size | Notes |
|---|---|---|
| Wan2.2-TI2V-5B-Q4_K_M.gguf | 3.2GB | Text+Image to Video, 5B Q4 GGUF |
| umt5_xxl_fp8_e4m3fn_scaled.safetensors | 6.3GB | UMT5-XXL text encoder, fp8 |
| wan_2.1_vae.safetensors | 243MB | Wan VAE (compatible with 2.2) |
## Download (one-time, all public)
```bash
# Wan 2.2 model
wget "https://huggingface.co/QuantStack/Wan2.2-TI2V-5B-GGUF/resolve/main/Wan2.2-TI2V-5B-Q4_K_M.gguf" \
-O ~/ComfyUI/models/diffusion_models/Wan2.2-TI2V-5B-Q4_K_M.gguf
# Text encoder
wget "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors" \
-O ~/ComfyUI/models/clip/umt5_xxl_fp8_e4m3fn_scaled.safetensors
# VAE
wget "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors" \
-O ~/ComfyUI/models/vae/wan_2.1_vae.safetensors
```
## Critical: WanImageToVideo is a conditioning node, NOT a sampler
This is the most important thing to understand about the Wan pipeline. The node
name is misleading. `WanImageToVideo` does NOT run diffusion — it sets up the
conditioning and empty latent. A separate `KSampler` runs the actual diffusion.
Wrong mental model (what most tutorials imply):
```
LoadImage → WanImageToVideo → SaveAnimatedWEBP
```
Correct node graph:
```
UnetLoaderGGUF ─────────────────────────────────────→ KSampler.model
CLIPLoader ──→ CLIPTextEncode (positive) ─→ WanImageToVideo.positive ──→ KSampler.positive
└→ CLIPTextEncode (negative) ─→ WanImageToVideo.negative ──→ KSampler.negative
VAELoader ──→ WanImageToVideo.vae WanImageToVideo.latent ──→ KSampler.latent_image
LoadImage ──→ WanImageToVideo.start_image (optional)
KSampler.samples ──→ VAEDecode ──→ SaveAnimatedWEBP
```
WanImageToVideo outputs three things (in order):
- output[0] = positive CONDITIONING (enhanced with image)
- output[1] = negative CONDITIONING
- output[2] = latent LATENT (sized for video: width × height × frames)
The `start_image` input (optional IMAGE) anchors the first frame. Without it,
video starts from noise. Always pass it for image-to-video.
## Workflow
Correct ComfyUI API node graph (as sent by `gen-video-wan.py`):
```
node 1: UnetLoaderGGUF → Wan2.2-TI2V-5B-Q4_K_M.gguf
node 2: CLIPLoader → umt5_xxl_fp8_e4m3fn_scaled.safetensors (type=wan)
node 3: VAELoader → wan_2.1_vae.safetensors
node 4: LoadImage → FLUX hero still (.webp)
node 5: CLIPTextEncode → motion prompt text (positive)
node 6: CLIPTextEncode → negative prompt text
node 7: WanImageToVideo → positive=[5,0], negative=[6,0], vae=[3,0],
start_image=[4,0], width=832, height=480,
length=25 (or 49), batch_size=1
node 8: KSampler → model=[1,0], positive=[7,0], negative=[7,1],
latent_image=[7,2], steps=20, cfg=6.0,
sampler_name=uni_pc, scheduler=simple, denoise=1.0
node 9: VAEDecode → samples=[8,0], vae=[3,0]
node 10: SaveAnimatedWEBP → images=[9,0], fps=12
```
## Settings
| Setting | Value |
|---|---|
| Resolution | 832×480 (16:9 ~480p) |
| Frames | 49 (~4 seconds at 12fps) |
| Steps | 20 |
| CFG | 6.0 |
| Sampler | uni_pc |
**Frame count constraint:** `length` must follow the pattern 1, 5, 9, 13, 17, 21, 25, 29 ... (step of 4).
ComfyUI enforces this. 49 is valid (1 + 4×12). 50 is not.
**CPU speed on Arising Media workstation (2GB VRAM, CPU inference):**
- ~415 seconds per diffusion step
- 20 steps × 415s = ~2.3 hours per clip
- 6 clips = ~14 hours total for a full reel
- Use 25 frames (not 49) for test runs to halve generation time
- Full reel generation: start before leaving for the day, check next morning
**CLIPVision note:** No CLIPVision models are installed at `~/ComfyUI/models/clip_vision/`.
The `clip_vision_output` input on WanImageToVideo is optional and currently unused.
Image conditioning comes from `start_image` only (VAE-encoded first frame).
This is sufficient for smooth motion — CLIPVision would add semantic image
understanding but is not required.
## Running video generation
```bash
# ComfyUI must be running, FLUX images must be converted to WebP first
cd /home/sirdrez/arisingmedia-websites/{domain}
python3 tools/gen-video-wan.py 2>&1 | tee tools/wan-gen.log
```
Output goes to `assets/videos/clips/` as `.webp` animation files.
## Stitching the reel
```bash
# Create file list
ls assets/videos/clips/*.webp | sort | while read f; do echo "file '$PWD/$f'"; done > tools/clip-list.txt
# Convert webp animations to mp4 first (if needed)
for f in assets/videos/clips/*.webp; do
ffmpeg -i "$f" "${f%.webp}.mp4" -y
done
# Stitch
ls assets/videos/clips/*.mp4 | sort | while read f; do echo "file '$PWD/$f'"; done > tools/clip-list.txt
ffmpeg -f concat -safe 0 -i tools/clip-list.txt -c copy assets/videos/hero/hero-reel-flux.mp4
```
## Reel shot list (lahrcarpetcleaning.com)
| Clip | Source still | Motion prompt |
|---|---|---|
| clip-01 | hero-carpet-cleaning | slow dolly forward across carpet |
| clip-02 | hero-stairs | slow pan upward along staircase |
| clip-03 | hero-upholstery | gentle push in toward sofa |
| clip-04 | hero-commercial | tracking shot down lobby |
| clip-05 | hero-floors | floor-level drift forward |
| clip-06 | hero-clean-result | rack focus across carpet fibers |
6 clips × ~4s = ~24 seconds total reel.