03 — Wan 2.2 Video Pipeline (Image-to-Video)

Default policy: local generation

Video generation is done locally with Wan 2.2 by default. Google Veo (via Vertex AI / Gemini API) is NOT used unless the client has explicit budget allocated for it. Reasons:

Google Veo costs money per second of video generated (billed per request)
Local Wan 2.2 is free after one-time model download (~10GB total)
Quality from Wan 2.2 at 832x480 is sufficient for hero reels
No API key, no quota limits, no vendor dependency

Use Google Veo only when: client approves a paid media budget, OR the local workstation is unavailable and a deadline cannot wait for CPU generation time.

Purpose

Takes FLUX-generated hero stills and animates each into a 3-5 second clip. Clips are stitched with ffmpeg into a marketing reel for the hero section.

Model stack

File	Size	Notes
Wan2.2-TI2V-5B-Q4_K_M.gguf	3.2GB	Text+Image to Video, 5B Q4 GGUF
umt5_xxl_fp8_e4m3fn_scaled.safetensors	6.3GB	UMT5-XXL text encoder, fp8
wan_2.1_vae.safetensors	243MB	Wan VAE (compatible with 2.2)

Download (one-time, all public)

# Wan 2.2 model
wget "https://huggingface.co/QuantStack/Wan2.2-TI2V-5B-GGUF/resolve/main/Wan2.2-TI2V-5B-Q4_K_M.gguf" \
  -O ~/ComfyUI/models/diffusion_models/Wan2.2-TI2V-5B-Q4_K_M.gguf

# Text encoder
wget "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors" \
  -O ~/ComfyUI/models/clip/umt5_xxl_fp8_e4m3fn_scaled.safetensors

# VAE
wget "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors" \
  -O ~/ComfyUI/models/vae/wan_2.1_vae.safetensors

Critical: WanImageToVideo is a conditioning node, NOT a sampler

This is the most important thing to understand about the Wan pipeline. The node name is misleading. WanImageToVideo does NOT run diffusion — it sets up the conditioning and empty latent. A separate KSampler runs the actual diffusion.

Wrong mental model (what most tutorials imply):

LoadImage → WanImageToVideo → SaveAnimatedWEBP

Correct node graph:

UnetLoaderGGUF  ─────────────────────────────────────→ KSampler.model
CLIPLoader ──→ CLIPTextEncode (positive) ─→ WanImageToVideo.positive ──→ KSampler.positive
           └→ CLIPTextEncode (negative) ─→ WanImageToVideo.negative ──→ KSampler.negative
VAELoader ──→ WanImageToVideo.vae                  WanImageToVideo.latent ──→ KSampler.latent_image
LoadImage ──→ WanImageToVideo.start_image (optional)
                                                   KSampler.samples ──→ VAEDecode ──→ SaveAnimatedWEBP

WanImageToVideo outputs three things (in order):

output[0] = positive CONDITIONING (enhanced with image)
output[1] = negative CONDITIONING
output[2] = latent LATENT (sized for video: width × height × frames)

The start_image input (optional IMAGE) anchors the first frame. Without it, video starts from noise. Always pass it for image-to-video.

Workflow

Correct ComfyUI API node graph (as sent by gen-video-wan.py):

node 1: UnetLoaderGGUF    → Wan2.2-TI2V-5B-Q4_K_M.gguf
node 2: CLIPLoader        → umt5_xxl_fp8_e4m3fn_scaled.safetensors (type=wan)
node 3: VAELoader         → wan_2.1_vae.safetensors
node 4: LoadImage         → FLUX hero still (.webp)
node 5: CLIPTextEncode    → motion prompt text (positive)
node 6: CLIPTextEncode    → negative prompt text
node 7: WanImageToVideo   → positive=[5,0], negative=[6,0], vae=[3,0],
                            start_image=[4,0], width=832, height=480,
                            length=25 (or 49), batch_size=1
node 8: KSampler          → model=[1,0], positive=[7,0], negative=[7,1],
                            latent_image=[7,2], steps=20, cfg=6.0,
                            sampler_name=uni_pc, scheduler=simple, denoise=1.0
node 9: VAEDecode         → samples=[8,0], vae=[3,0]
node 10: SaveAnimatedWEBP → images=[9,0], fps=12

Settings

Setting	Value
Resolution	832×480 (16:9 ~480p)
Frames	49 (~4 seconds at 12fps)
Steps	20
CFG	6.0
Sampler	uni_pc

Frame count constraint: length must follow the pattern 1, 5, 9, 13, 17, 21, 25, 29 ... (step of 4). ComfyUI enforces this. 49 is valid (1 + 4×12). 50 is not.

CPU speed on Arising Media workstation (2GB VRAM, CPU inference):

~415 seconds per diffusion step
20 steps × 415s = ~2.3 hours per clip
6 clips = ~14 hours total for a full reel
Use 25 frames (not 49) for test runs to halve generation time
Full reel generation: start before leaving for the day, check next morning

CLIPVision note: No CLIPVision models are installed at ~/ComfyUI/models/clip_vision/. The clip_vision_output input on WanImageToVideo is optional and currently unused. Image conditioning comes from start_image only (VAE-encoded first frame). This is sufficient for smooth motion — CLIPVision would add semantic image understanding but is not required.

Running video generation

# ComfyUI must be running, FLUX images must be converted to WebP first
cd /home/sirdrez/arisingmedia-websites/{domain}
python3 tools/gen-video-wan.py 2>&1 | tee tools/wan-gen.log

Output goes to assets/videos/clips/ as .webp animation files.

Stitching the reel

# Create file list
ls assets/videos/clips/*.webp | sort | while read f; do echo "file '$PWD/$f'"; done > tools/clip-list.txt

# Convert webp animations to mp4 first (if needed)
for f in assets/videos/clips/*.webp; do
  ffmpeg -i "$f" "${f%.webp}.mp4" -y
done

# Stitch
ls assets/videos/clips/*.mp4 | sort | while read f; do echo "file '$PWD/$f'"; done > tools/clip-list.txt
ffmpeg -f concat -safe 0 -i tools/clip-list.txt -c copy assets/videos/hero/hero-reel-flux.mp4

Reel shot list (lahrcarpetcleaning.com)

Clip	Source still	Motion prompt
clip-01	hero-carpet-cleaning	slow dolly forward across carpet
clip-02	hero-stairs	slow pan upward along staircase
clip-03	hero-upholstery	gentle push in toward sofa
clip-04	hero-commercial	tracking shot down lobby
clip-05	hero-floors	floor-level drift forward
clip-06	hero-clean-result	rack focus across carpet fibers

6 clips × ~4s = ~24 seconds total reel.

6.4 KiB Raw Permalink Blame History Unescape Escape