Files
2026-06-09 18:31:59 +02:00

5.2 KiB

03 — Divi Content Extraction

Parse raw Divi page content from pages.json into clean, structured HTML sections ready to map into AM templates.

Divi 4 vs Divi 5 — critical difference

Divi 4 (shortcode-based)

Content is stored as shortcodes in wp_posts.post_content:

[et_pb_section fb_built="1" admin_label="Hero" _builder_version="4.27.4"
  background_color="#0f5f53" ...]
  [et_pb_row ...]
    [et_pb_column type="4_4" ...]
      [et_pb_text ...]<h1>Move With Intention</h1>[/et_pb_text]
      [et_pb_button button_url="/contact" button_text="Book a Class" /]
    [/et_pb_column]
  [/et_pb_row]
[/et_pb_section]

Use extract_divi4.py → parses shortcode tree into section/row/module JSON.

Divi 5 (block-based)

Content is stored as Gutenberg-style block comments:

<!-- wp:divi/section {"id":"section-abc123","attrs":{"backgroundColor":{"value":"#0f5f53"}}} -->
<div class="et_pb_section ...">
  <!-- wp:divi/row ... -->
    <!-- wp:divi/column ... -->
      <!-- wp:divi/text ... -->
        <div class="et_pb_text_inner"><h1>Move With Intention</h1></div>
      <!-- /wp:divi/text -->
    <!-- /wp:divi/column -->
  <!-- /wp:divi/row -->
</div>
<!-- /wp:divi/section -->

Use extract_divi5.py → strips block wrapper, extracts inner HTML per module.

Divi 5 extraction script

python3 /home/sirdrez/arisingmedia-websites/.am-webdesign-sops/wp-divi-pipeline/scripts/extract_divi5.py \
  {domain}/.planning/data/pages.json \
  {domain}/.planning/data/content/

Produces one JSON file per page: content/{slug}.json

{
  "slug": "about",
  "title": "About VibrantYou Yoga",
  "seo_title": "About VibrantYou Yoga | ...",
  "seo_description": "...",
  "sections": [
    {
      "type": "hero",
      "background_color": "#0f5f53",
      "modules": [
        { "module": "text",   "html": "<h1>Move With Intention</h1>" },
        { "module": "button", "text": "Book a Class", "url": "/contact/" }
      ]
    },
    {
      "type": "standard",
      "modules": [
        { "module": "text", "html": "<h2>Our Story</h2><p>...</p>" },
        { "module": "image", "src": "/assets/images/studio.webp", "alt": "..." }
      ]
    }
  ]
}

ACF fields take priority

If a page has ACF fields (in pages.json[].acf), use those over block content. ACF fields are typically cleaner, pre-authored copy without Divi wrapper noise.

Convention for VYY-specific ACF keys:

  • vyy_hero_headline<h1> in hero section
  • vyy_hero_subhead<p class="hero-lead"> in hero
  • vyy_hero_cta_text → primary CTA button label
  • vyy_hero_cta_url → primary CTA button href

Always check acf keys before parsing content_raw.

Stripping Divi class/attribute noise

After extraction, run every HTML snippet through the clean_divi_html() function from divi_to_html.py:

from divi_to_html import clean_divi_html, rewrite_internal_links

cleaned = clean_divi_html(raw_html)
cleaned = rewrite_internal_links(cleaned, staging_hosts=("vibrantyou.yoga",))

This removes:

  • <!-- wp:divi/... --> block comments
  • data-et-*, data-builder-* attributes
  • et_pb_*, divi-builder-*, d5_* class tokens
  • Empty class="" attributes

What to extract per section type

Divi module Extract Map to AM element
divi/text inner HTML <section>, <p>, headings as-is
divi/button text, url <a class="btn-primary">
divi/image src, alt, title <img> → rewrite to WebP path
divi/blurb icon, title, body .am-card component
divi/testimonial quote, author, company .am-testimonial component
divi/video src, poster <video> or YouTube embed
divi/contact_form field list → replace with AM form, see 08
divi/accordion Q+A pairs <details><summary>
divi/fullwidth_header title, subhead, CTA hero section

Section background colors → AM section modifiers

Divi 5 stores backgroundColor in the block attrs JSON. Map to AM CSS modifier classes:

Divi background AM class modifier
#0f5f53 (dark teal) .section--dark
#1a8a7a (mid teal) .section--brand
#f5f5f5 / #fafafa .section--light
#ffffff / none .section--white

Content quality pass (required before HTML build)

After extraction, review every page's content for:

  1. Cut bloated copy — WordPress sites often have 3x more text than needed. Target 30-50% reduction. One clear idea per paragraph.
  2. Remove stale metrics — "Over 500 students" only stays if it's verifiable. Otherwise remove or mark DRAFT NEEDED.
  3. Remove plugin artifacts — Gravity Forms shortcodes [gravityforms id="1"], Events Manager tags, Divi shortcode residue that survived extraction.
  4. Improve CTAs — Replace generic "Learn More" with action-specific text: "Book a Free Class", "View the Schedule", "Start Your Practice".
  5. Flag images — Note every <img> that needs a real photo vs stock.

Next step

Proceed to 04-design-system-extraction.md to convert Divi theme settings into AM CSS custom properties, then 05-content-migration.md to build the HTML templates.