# 03 — Divi Content Extraction
Parse raw Divi page content from `pages.json` into clean, structured HTML
sections ready to map into AM templates.
## Divi 4 vs Divi 5 — critical difference
### Divi 4 (shortcode-based)
Content is stored as shortcodes in `wp_posts.post_content`:
```
[et_pb_section fb_built="1" admin_label="Hero" _builder_version="4.27.4"
background_color="#0f5f53" ...]
[et_pb_row ...]
[et_pb_column type="4_4" ...]
[et_pb_text ...]
Move With Intention
[/et_pb_text]
[et_pb_button button_url="/contact" button_text="Book a Class" /]
[/et_pb_column]
[/et_pb_row]
[/et_pb_section]
```
Use `extract_divi4.py` → parses shortcode tree into section/row/module JSON.
### Divi 5 (block-based)
Content is stored as Gutenberg-style block comments:
```html
```
Use `extract_divi5.py` → strips block wrapper, extracts inner HTML per module.
## Divi 5 extraction script
```bash
python3 /home/sirdrez/arisingmedia-websites/.am-webdesign-sops/wp-divi-pipeline/scripts/extract_divi5.py \
{domain}/.planning/data/pages.json \
{domain}/.planning/data/content/
```
Produces one JSON file per page: `content/{slug}.json`
```json
{
"slug": "about",
"title": "About VibrantYou Yoga",
"seo_title": "About VibrantYou Yoga | ...",
"seo_description": "...",
"sections": [
{
"type": "hero",
"background_color": "#0f5f53",
"modules": [
{ "module": "text", "html": "Move With Intention
" },
{ "module": "button", "text": "Book a Class", "url": "/contact/" }
]
},
{
"type": "standard",
"modules": [
{ "module": "text", "html": "Our Story
...
" },
{ "module": "image", "src": "/assets/images/studio.webp", "alt": "..." }
]
}
]
}
```
## ACF fields take priority
If a page has ACF fields (in `pages.json[].acf`), use those over block content.
ACF fields are typically cleaner, pre-authored copy without Divi wrapper noise.
Convention for VYY-specific ACF keys:
- `vyy_hero_headline` → `` in hero section
- `vyy_hero_subhead` → `
` in hero
- `vyy_hero_cta_text` → primary CTA button label
- `vyy_hero_cta_url` → primary CTA button href
Always check `acf` keys before parsing `content_raw`.
## Stripping Divi class/attribute noise
After extraction, run every HTML snippet through the `clean_divi_html()`
function from `divi_to_html.py`:
```python
from divi_to_html import clean_divi_html, rewrite_internal_links
cleaned = clean_divi_html(raw_html)
cleaned = rewrite_internal_links(cleaned, staging_hosts=("vibrantyou.yoga",))
```
This removes:
- `` block comments
- `data-et-*`, `data-builder-*` attributes
- `et_pb_*`, `divi-builder-*`, `d5_*` class tokens
- Empty `class=""` attributes
## What to extract per section type
| Divi module | Extract | Map to AM element |
|-------------|---------|-------------------|
| `divi/text` | inner HTML | ``, ``, headings as-is |
| `divi/button` | `text`, `url` | `` |
| `divi/image` | `src`, `alt`, `title` | `
` → rewrite to WebP path |
| `divi/blurb` | icon, title, body | `.am-card` component |
| `divi/testimonial` | quote, author, company | `.am-testimonial` component |
| `divi/video` | `src`, poster | `