Move With Intention

# 03 — Divi Content Extraction

Parse raw Divi page content from `pages.json` into clean, structured HTML
sections ready to map into AM templates.

## Divi 4 vs Divi 5 — critical difference

### Divi 4 (shortcode-based)

Content is stored as shortcodes in `wp_posts.post_content`:

```
[et_pb_section fb_built="1" admin_label="Hero" _builder_version="4.27.4"
  background_color="#0f5f53" ...]
  [et_pb_row ...]
    [et_pb_column type="4_4" ...]
      [et_pb_text ...]<h1>Move With Intention</h1>[/et_pb_text]
      [et_pb_button button_url="/contact" button_text="Book a Class" /]
    [/et_pb_column]
  [/et_pb_row]
[/et_pb_section]
```

Use `extract_divi4.py` → parses shortcode tree into section/row/module JSON.

### Divi 5 (block-based)

Content is stored as Gutenberg-style block comments:

```html
<!-- wp:divi/section {"id":"section-abc123","attrs":{"backgroundColor":{"value":"#0f5f53"}}} -->
<div class="et_pb_section ...">
  <!-- wp:divi/row ... -->
    <!-- wp:divi/column ... -->
      <!-- wp:divi/text ... -->
        <div class="et_pb_text_inner"><h1>Move With Intention</h1></div>
      <!-- /wp:divi/text -->
    <!-- /wp:divi/column -->
  <!-- /wp:divi/row -->
</div>
<!-- /wp:divi/section -->
```

Use `extract_divi5.py` → strips block wrapper, extracts inner HTML per module.

## Divi 5 extraction script

```bash
python3 /home/sirdrez/arisingmedia-websites/.am-webdesign-sops/wp-divi-pipeline/scripts/extract_divi5.py \
  {domain}/.planning/data/pages.json \
  {domain}/.planning/data/content/
```

Produces one JSON file per page: `content/{slug}.json`

```json
{
  "slug": "about",
  "title": "About VibrantYou Yoga",
  "seo_title": "About VibrantYou Yoga | ...",
  "seo_description": "...",
  "sections": [
    {
      "type": "hero",
      "background_color": "#0f5f53",
      "modules": [
        { "module": "text",   "html": "<h1>Move With Intention</h1>" },
        { "module": "button", "text": "Book a Class", "url": "/contact/" }
      ]
    },
    {
      "type": "standard",
      "modules": [
        { "module": "text", "html": "<h2>Our Story</h2><p>...</p>" },
        { "module": "image", "src": "/assets/images/studio.webp", "alt": "..." }
      ]
    }
  ]
}
```

## ACF fields take priority

If a page has ACF fields (in `pages.json[].acf`), use those over block content.
ACF fields are typically cleaner, pre-authored copy without Divi wrapper noise.

Convention for VYY-specific ACF keys:
- `vyy_hero_headline` → `<h1>` in hero section
- `vyy_hero_subhead`  → `<p class="hero-lead">` in hero
- `vyy_hero_cta_text` → primary CTA button label
- `vyy_hero_cta_url`  → primary CTA button href

Always check `acf` keys before parsing `content_raw`.

## Stripping Divi class/attribute noise

After extraction, run every HTML snippet through the `clean_divi_html()`
function from `divi_to_html.py`:

```python
from divi_to_html import clean_divi_html, rewrite_internal_links

cleaned = clean_divi_html(raw_html)
cleaned = rewrite_internal_links(cleaned, staging_hosts=("vibrantyou.yoga",))
```

This removes:
- `<!-- wp:divi/... -->` block comments
- `data-et-*`, `data-builder-*` attributes
- `et_pb_*`, `divi-builder-*`, `d5_*` class tokens
- Empty `class=""` attributes

## What to extract per section type

| Divi module | Extract | Map to AM element |
|-------------|---------|-------------------|
| `divi/text` | inner HTML | `<section>`, `<p>`, headings as-is |
| `divi/button` | `text`, `url` | `<a class="btn-primary">` |
| `divi/image` | `src`, `alt`, `title` | `<img>` → rewrite to WebP path |
| `divi/blurb` | icon, title, body | `.am-card` component |
| `divi/testimonial` | quote, author, company | `.am-testimonial` component |
| `divi/video` | `src`, poster | `<video>` or YouTube embed |
| `divi/contact_form` | field list | → replace with AM form, see `08` |
| `divi/accordion` | Q+A pairs | `<details><summary>` |
| `divi/fullwidth_header` | title, subhead, CTA | hero section |

## Section background colors → AM section modifiers

Divi 5 stores `backgroundColor` in the block `attrs` JSON.
Map to AM CSS modifier classes:

| Divi background | AM class modifier |
|----------------|------------------|
| `#0f5f53` (dark teal) | `.section--dark` |
| `#1a8a7a` (mid teal)  | `.section--brand` |
| `#f5f5f5` / `#fafafa` | `.section--light` |
| `#ffffff` / none       | `.section--white` |

## Content quality pass (required before HTML build)

After extraction, review every page's content for:

1. **Cut bloated copy** — WordPress sites often have 3x more text than needed.
   Target 30-50% reduction. One clear idea per paragraph.
2. **Remove stale metrics** — "Over 500 students" only stays if it's verifiable.
   Otherwise remove or mark `DRAFT NEEDED`.
3. **Remove plugin artifacts** — Gravity Forms shortcodes `[gravityforms id="1"]`,
   Events Manager tags, Divi shortcode residue that survived extraction.
4. **Improve CTAs** — Replace generic "Learn More" with action-specific text:
   "Book a Free Class", "View the Schedule", "Start Your Practice".
5. **Flag images** — Note every `<img>` that needs a real photo vs stock.

## Next step

Proceed to `04-design-system-extraction.md` to convert Divi theme settings
into AM CSS custom properties, then `05-content-migration.md` to build the
HTML templates.