recent updates

2026-06-09 18:31:59 +02:00
parent 398b94965c
commit 94f7a1f72a
42 changed files with 8686 additions and 0 deletions
@@ -0,0 +1,159 @@
+# 03 — Wan 2.2 Video Pipeline (Image-to-Video)
+
+## Default policy: local generation
+
+Video generation is done locally with Wan 2.2 by default. Google Veo (via
+Vertex AI / Gemini API) is NOT used unless the client has explicit budget
+allocated for it. Reasons:
+
+- Google Veo costs money per second of video generated (billed per request)
+- Local Wan 2.2 is free after one-time model download (~10GB total)
+- Quality from Wan 2.2 at 832x480 is sufficient for hero reels
+- No API key, no quota limits, no vendor dependency
+
+Use Google Veo only when: client approves a paid media budget, OR the local
+workstation is unavailable and a deadline cannot wait for CPU generation time.
+
+## Purpose
+
+Takes FLUX-generated hero stills and animates each into a 3-5 second clip.
+Clips are stitched with ffmpeg into a marketing reel for the hero section.
+
+## Model stack
+
+| File | Size | Notes |
+|---|---|---|
+| Wan2.2-TI2V-5B-Q4_K_M.gguf | 3.2GB | Text+Image to Video, 5B Q4 GGUF |
+| umt5_xxl_fp8_e4m3fn_scaled.safetensors | 6.3GB | UMT5-XXL text encoder, fp8 |
+| wan_2.1_vae.safetensors | 243MB | Wan VAE (compatible with 2.2) |
+
+## Download (one-time, all public)
+
+```bash
+# Wan 2.2 model
+wget "https://huggingface.co/QuantStack/Wan2.2-TI2V-5B-GGUF/resolve/main/Wan2.2-TI2V-5B-Q4_K_M.gguf" \
+  -O ~/ComfyUI/models/diffusion_models/Wan2.2-TI2V-5B-Q4_K_M.gguf
+
+# Text encoder
+wget "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors" \
+  -O ~/ComfyUI/models/clip/umt5_xxl_fp8_e4m3fn_scaled.safetensors
+
+# VAE
+wget "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors" \
+  -O ~/ComfyUI/models/vae/wan_2.1_vae.safetensors
+```
+
+## Critical: WanImageToVideo is a conditioning node, NOT a sampler
+
+This is the most important thing to understand about the Wan pipeline. The node
+name is misleading. `WanImageToVideo` does NOT run diffusion — it sets up the
+conditioning and empty latent. A separate `KSampler` runs the actual diffusion.
+
+Wrong mental model (what most tutorials imply):
+```
+LoadImage → WanImageToVideo → SaveAnimatedWEBP
+```
+
+Correct node graph:
+```
+UnetLoaderGGUF  ─────────────────────────────────────→ KSampler.model
+CLIPLoader ──→ CLIPTextEncode (positive) ─→ WanImageToVideo.positive ──→ KSampler.positive
+           └→ CLIPTextEncode (negative) ─→ WanImageToVideo.negative ──→ KSampler.negative
+VAELoader ──→ WanImageToVideo.vae                  WanImageToVideo.latent ──→ KSampler.latent_image
+LoadImage ──→ WanImageToVideo.start_image (optional)
+                                                   KSampler.samples ──→ VAEDecode ──→ SaveAnimatedWEBP
+```
+
+WanImageToVideo outputs three things (in order):
+- output[0] = positive CONDITIONING (enhanced with image)
+- output[1] = negative CONDITIONING
+- output[2] = latent LATENT (sized for video: width × height × frames)
+
+The `start_image` input (optional IMAGE) anchors the first frame. Without it,
+video starts from noise. Always pass it for image-to-video.
+
+## Workflow
+
+Correct ComfyUI API node graph (as sent by `gen-video-wan.py`):
+
+```
+node 1: UnetLoaderGGUF    → Wan2.2-TI2V-5B-Q4_K_M.gguf
+node 2: CLIPLoader        → umt5_xxl_fp8_e4m3fn_scaled.safetensors (type=wan)
+node 3: VAELoader         → wan_2.1_vae.safetensors
+node 4: LoadImage         → FLUX hero still (.webp)
+node 5: CLIPTextEncode    → motion prompt text (positive)
+node 6: CLIPTextEncode    → negative prompt text
+node 7: WanImageToVideo   → positive=[5,0], negative=[6,0], vae=[3,0],
+                            start_image=[4,0], width=832, height=480,
+                            length=25 (or 49), batch_size=1
+node 8: KSampler          → model=[1,0], positive=[7,0], negative=[7,1],
+                            latent_image=[7,2], steps=20, cfg=6.0,
+                            sampler_name=uni_pc, scheduler=simple, denoise=1.0
+node 9: VAEDecode         → samples=[8,0], vae=[3,0]
+node 10: SaveAnimatedWEBP → images=[9,0], fps=12
+```
+
+## Settings
+
+| Setting | Value |
+|---|---|
+| Resolution | 832×480 (16:9 ~480p) |
+| Frames | 49 (~4 seconds at 12fps) |
+| Steps | 20 |
+| CFG | 6.0 |
+| Sampler | uni_pc |
+
+**Frame count constraint:** `length` must follow the pattern 1, 5, 9, 13, 17, 21, 25, 29 ... (step of 4).
+ComfyUI enforces this. 49 is valid (1 + 4×12). 50 is not.
+
+**CPU speed on Arising Media workstation (2GB VRAM, CPU inference):**
+- ~415 seconds per diffusion step
+- 20 steps × 415s = ~2.3 hours per clip
+- 6 clips = ~14 hours total for a full reel
+- Use 25 frames (not 49) for test runs to halve generation time
+- Full reel generation: start before leaving for the day, check next morning
+
+**CLIPVision note:** No CLIPVision models are installed at `~/ComfyUI/models/clip_vision/`.
+The `clip_vision_output` input on WanImageToVideo is optional and currently unused.
+Image conditioning comes from `start_image` only (VAE-encoded first frame).
+This is sufficient for smooth motion — CLIPVision would add semantic image
+understanding but is not required.
+
+## Running video generation
+
+```bash
+# ComfyUI must be running, FLUX images must be converted to WebP first
+cd /home/sirdrez/arisingmedia-websites/{domain}
+python3 tools/gen-video-wan.py 2>&1 | tee tools/wan-gen.log
+```
+
+Output goes to `assets/videos/clips/` as `.webp` animation files.
+
+## Stitching the reel
+
+```bash
+# Create file list
+ls assets/videos/clips/*.webp | sort | while read f; do echo "file '$PWD/$f'"; done > tools/clip-list.txt
+
+# Convert webp animations to mp4 first (if needed)
+for f in assets/videos/clips/*.webp; do
+  ffmpeg -i "$f" "${f%.webp}.mp4" -y
+done
+
+# Stitch
+ls assets/videos/clips/*.mp4 | sort | while read f; do echo "file '$PWD/$f'"; done > tools/clip-list.txt
+ffmpeg -f concat -safe 0 -i tools/clip-list.txt -c copy assets/videos/hero/hero-reel-flux.mp4
+```
+
+## Reel shot list (lahrcarpetcleaning.com)
+
+| Clip | Source still | Motion prompt |
+|---|---|---|
+| clip-01 | hero-carpet-cleaning | slow dolly forward across carpet |
+| clip-02 | hero-stairs | slow pan upward along staircase |
+| clip-03 | hero-upholstery | gentle push in toward sofa |
+| clip-04 | hero-commercial | tracking shot down lobby |
+| clip-05 | hero-floors | floor-level drift forward |
+| clip-06 | hero-clean-result | rack focus across carpet fibers |
+
+6 clips × ~4s = ~24 seconds total reel.