Files
2026-06-09 18:31:59 +02:00

158 lines
5.2 KiB
Markdown

# 03 — Divi Content Extraction
Parse raw Divi page content from `pages.json` into clean, structured HTML
sections ready to map into AM templates.
## Divi 4 vs Divi 5 — critical difference
### Divi 4 (shortcode-based)
Content is stored as shortcodes in `wp_posts.post_content`:
```
[et_pb_section fb_built="1" admin_label="Hero" _builder_version="4.27.4"
background_color="#0f5f53" ...]
[et_pb_row ...]
[et_pb_column type="4_4" ...]
[et_pb_text ...]<h1>Move With Intention</h1>[/et_pb_text]
[et_pb_button button_url="/contact" button_text="Book a Class" /]
[/et_pb_column]
[/et_pb_row]
[/et_pb_section]
```
Use `extract_divi4.py` → parses shortcode tree into section/row/module JSON.
### Divi 5 (block-based)
Content is stored as Gutenberg-style block comments:
```html
<!-- wp:divi/section {"id":"section-abc123","attrs":{"backgroundColor":{"value":"#0f5f53"}}} -->
<div class="et_pb_section ...">
<!-- wp:divi/row ... -->
<!-- wp:divi/column ... -->
<!-- wp:divi/text ... -->
<div class="et_pb_text_inner"><h1>Move With Intention</h1></div>
<!-- /wp:divi/text -->
<!-- /wp:divi/column -->
<!-- /wp:divi/row -->
</div>
<!-- /wp:divi/section -->
```
Use `extract_divi5.py` → strips block wrapper, extracts inner HTML per module.
## Divi 5 extraction script
```bash
python3 /home/sirdrez/arisingmedia-websites/.am-webdesign-sops/wp-divi-pipeline/scripts/extract_divi5.py \
{domain}/.planning/data/pages.json \
{domain}/.planning/data/content/
```
Produces one JSON file per page: `content/{slug}.json`
```json
{
"slug": "about",
"title": "About VibrantYou Yoga",
"seo_title": "About VibrantYou Yoga | ...",
"seo_description": "...",
"sections": [
{
"type": "hero",
"background_color": "#0f5f53",
"modules": [
{ "module": "text", "html": "<h1>Move With Intention</h1>" },
{ "module": "button", "text": "Book a Class", "url": "/contact/" }
]
},
{
"type": "standard",
"modules": [
{ "module": "text", "html": "<h2>Our Story</h2><p>...</p>" },
{ "module": "image", "src": "/assets/images/studio.webp", "alt": "..." }
]
}
]
}
```
## ACF fields take priority
If a page has ACF fields (in `pages.json[].acf`), use those over block content.
ACF fields are typically cleaner, pre-authored copy without Divi wrapper noise.
Convention for VYY-specific ACF keys:
- `vyy_hero_headline``<h1>` in hero section
- `vyy_hero_subhead``<p class="hero-lead">` in hero
- `vyy_hero_cta_text` → primary CTA button label
- `vyy_hero_cta_url` → primary CTA button href
Always check `acf` keys before parsing `content_raw`.
## Stripping Divi class/attribute noise
After extraction, run every HTML snippet through the `clean_divi_html()`
function from `divi_to_html.py`:
```python
from divi_to_html import clean_divi_html, rewrite_internal_links
cleaned = clean_divi_html(raw_html)
cleaned = rewrite_internal_links(cleaned, staging_hosts=("vibrantyou.yoga",))
```
This removes:
- `<!-- wp:divi/... -->` block comments
- `data-et-*`, `data-builder-*` attributes
- `et_pb_*`, `divi-builder-*`, `d5_*` class tokens
- Empty `class=""` attributes
## What to extract per section type
| Divi module | Extract | Map to AM element |
|-------------|---------|-------------------|
| `divi/text` | inner HTML | `<section>`, `<p>`, headings as-is |
| `divi/button` | `text`, `url` | `<a class="btn-primary">` |
| `divi/image` | `src`, `alt`, `title` | `<img>` → rewrite to WebP path |
| `divi/blurb` | icon, title, body | `.am-card` component |
| `divi/testimonial` | quote, author, company | `.am-testimonial` component |
| `divi/video` | `src`, poster | `<video>` or YouTube embed |
| `divi/contact_form` | field list | → replace with AM form, see `08` |
| `divi/accordion` | Q+A pairs | `<details><summary>` |
| `divi/fullwidth_header` | title, subhead, CTA | hero section |
## Section background colors → AM section modifiers
Divi 5 stores `backgroundColor` in the block `attrs` JSON.
Map to AM CSS modifier classes:
| Divi background | AM class modifier |
|----------------|------------------|
| `#0f5f53` (dark teal) | `.section--dark` |
| `#1a8a7a` (mid teal) | `.section--brand` |
| `#f5f5f5` / `#fafafa` | `.section--light` |
| `#ffffff` / none | `.section--white` |
## Content quality pass (required before HTML build)
After extraction, review every page's content for:
1. **Cut bloated copy** — WordPress sites often have 3x more text than needed.
Target 30-50% reduction. One clear idea per paragraph.
2. **Remove stale metrics** — "Over 500 students" only stays if it's verifiable.
Otherwise remove or mark `DRAFT NEEDED`.
3. **Remove plugin artifacts** — Gravity Forms shortcodes `[gravityforms id="1"]`,
Events Manager tags, Divi shortcode residue that survived extraction.
4. **Improve CTAs** — Replace generic "Learn More" with action-specific text:
"Book a Free Class", "View the Schedule", "Start Your Practice".
5. **Flag images** — Note every `<img>` that needs a real photo vs stock.
## Next step
Proceed to `04-design-system-extraction.md` to convert Divi theme settings
into AM CSS custom properties, then `05-content-migration.md` to build the
HTML templates.