# 03 — Wan 2.2 Video Pipeline (Image-to-Video) ## Default policy: local generation Video generation is done locally with Wan 2.2 by default. Google Veo (via Vertex AI / Gemini API) is NOT used unless the client has explicit budget allocated for it. Reasons: - Google Veo costs money per second of video generated (billed per request) - Local Wan 2.2 is free after one-time model download (~10GB total) - Quality from Wan 2.2 at 832x480 is sufficient for hero reels - No API key, no quota limits, no vendor dependency Use Google Veo only when: client approves a paid media budget, OR the local workstation is unavailable and a deadline cannot wait for CPU generation time. ## Purpose Takes FLUX-generated hero stills and animates each into a 3-5 second clip. Clips are stitched with ffmpeg into a marketing reel for the hero section. ## Model stack | File | Size | Notes | |---|---|---| | Wan2.2-TI2V-5B-Q4_K_M.gguf | 3.2GB | Text+Image to Video, 5B Q4 GGUF | | umt5_xxl_fp8_e4m3fn_scaled.safetensors | 6.3GB | UMT5-XXL text encoder, fp8 | | wan_2.1_vae.safetensors | 243MB | Wan VAE (compatible with 2.2) | ## Download (one-time, all public) ```bash # Wan 2.2 model wget "https://huggingface.co/QuantStack/Wan2.2-TI2V-5B-GGUF/resolve/main/Wan2.2-TI2V-5B-Q4_K_M.gguf" \ -O ~/ComfyUI/models/diffusion_models/Wan2.2-TI2V-5B-Q4_K_M.gguf # Text encoder wget "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors" \ -O ~/ComfyUI/models/clip/umt5_xxl_fp8_e4m3fn_scaled.safetensors # VAE wget "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors" \ -O ~/ComfyUI/models/vae/wan_2.1_vae.safetensors ``` ## Critical: WanImageToVideo is a conditioning node, NOT a sampler This is the most important thing to understand about the Wan pipeline. The node name is misleading. `WanImageToVideo` does NOT run diffusion — it sets up the conditioning and empty latent. A separate `KSampler` runs the actual diffusion. Wrong mental model (what most tutorials imply): ``` LoadImage → WanImageToVideo → SaveAnimatedWEBP ``` Correct node graph: ``` UnetLoaderGGUF ─────────────────────────────────────→ KSampler.model CLIPLoader ──→ CLIPTextEncode (positive) ─→ WanImageToVideo.positive ──→ KSampler.positive └→ CLIPTextEncode (negative) ─→ WanImageToVideo.negative ──→ KSampler.negative VAELoader ──→ WanImageToVideo.vae WanImageToVideo.latent ──→ KSampler.latent_image LoadImage ──→ WanImageToVideo.start_image (optional) KSampler.samples ──→ VAEDecode ──→ SaveAnimatedWEBP ``` WanImageToVideo outputs three things (in order): - output[0] = positive CONDITIONING (enhanced with image) - output[1] = negative CONDITIONING - output[2] = latent LATENT (sized for video: width × height × frames) The `start_image` input (optional IMAGE) anchors the first frame. Without it, video starts from noise. Always pass it for image-to-video. ## Workflow Correct ComfyUI API node graph (as sent by `gen-video-wan.py`): ``` node 1: UnetLoaderGGUF → Wan2.2-TI2V-5B-Q4_K_M.gguf node 2: CLIPLoader → umt5_xxl_fp8_e4m3fn_scaled.safetensors (type=wan) node 3: VAELoader → wan_2.1_vae.safetensors node 4: LoadImage → FLUX hero still (.webp) node 5: CLIPTextEncode → motion prompt text (positive) node 6: CLIPTextEncode → negative prompt text node 7: WanImageToVideo → positive=[5,0], negative=[6,0], vae=[3,0], start_image=[4,0], width=832, height=480, length=25 (or 49), batch_size=1 node 8: KSampler → model=[1,0], positive=[7,0], negative=[7,1], latent_image=[7,2], steps=20, cfg=6.0, sampler_name=uni_pc, scheduler=simple, denoise=1.0 node 9: VAEDecode → samples=[8,0], vae=[3,0] node 10: SaveAnimatedWEBP → images=[9,0], fps=12 ``` ## Settings | Setting | Value | |---|---| | Resolution | 832×480 (16:9 ~480p) | | Frames | 49 (~4 seconds at 12fps) | | Steps | 20 | | CFG | 6.0 | | Sampler | uni_pc | **Frame count constraint:** `length` must follow the pattern 1, 5, 9, 13, 17, 21, 25, 29 ... (step of 4). ComfyUI enforces this. 49 is valid (1 + 4×12). 50 is not. **CPU speed on Arising Media workstation (2GB VRAM, CPU inference):** - ~415 seconds per diffusion step - 20 steps × 415s = ~2.3 hours per clip - 6 clips = ~14 hours total for a full reel - Use 25 frames (not 49) for test runs to halve generation time - Full reel generation: start before leaving for the day, check next morning **CLIPVision note:** No CLIPVision models are installed at `~/ComfyUI/models/clip_vision/`. The `clip_vision_output` input on WanImageToVideo is optional and currently unused. Image conditioning comes from `start_image` only (VAE-encoded first frame). This is sufficient for smooth motion — CLIPVision would add semantic image understanding but is not required. ## Running video generation ```bash # ComfyUI must be running, FLUX images must be converted to WebP first cd /home/sirdrez/arisingmedia-websites/{domain} python3 tools/gen-video-wan.py 2>&1 | tee tools/wan-gen.log ``` Output goes to `assets/videos/clips/` as `.webp` animation files. ## Stitching the reel ```bash # Create file list ls assets/videos/clips/*.webp | sort | while read f; do echo "file '$PWD/$f'"; done > tools/clip-list.txt # Convert webp animations to mp4 first (if needed) for f in assets/videos/clips/*.webp; do ffmpeg -i "$f" "${f%.webp}.mp4" -y done # Stitch ls assets/videos/clips/*.mp4 | sort | while read f; do echo "file '$PWD/$f'"; done > tools/clip-list.txt ffmpeg -f concat -safe 0 -i tools/clip-list.txt -c copy assets/videos/hero/hero-reel-flux.mp4 ``` ## Reel shot list (lahrcarpetcleaning.com) | Clip | Source still | Motion prompt | |---|---|---| | clip-01 | hero-carpet-cleaning | slow dolly forward across carpet | | clip-02 | hero-stairs | slow pan upward along staircase | | clip-03 | hero-upholstery | gentle push in toward sofa | | clip-04 | hero-commercial | tracking shot down lobby | | clip-05 | hero-floors | floor-level drift forward | | clip-06 | hero-clean-result | rack focus across carpet fibers | 6 clips × ~4s = ~24 seconds total reel.