engineering, open source, case studies,

Rescuing Content from the Abyss: Automating Legacy HTML to Markdown Migration

Sebastian Schkudlara Sebastian Schkudlara Follow Mar 15, 2026 · 2 mins read
Rescuing Content from the Abyss: Automating Legacy HTML to Markdown Migration
Share this

You’ve just inherited a 10-year-old legacy CMS disaster. Ten thousand pages of content. Broken inline styles. Deeply nested tables. Images pointing to paths that no longer exist. And half the content trapped inside React SPAs that a standard scraper simply returns as empty <div> soup.

Congratulations. You have two options: six months of manual cleanup or a smarter tool.


Why Existing Tools Fail

The standard playbook doesn’t hold up in the real world. CLI converters choke on WAFs like Cloudflare. Basic scrapers miss content entirely when JavaScript hasn’t fired. Even sophisticated pipelines lose images, malform table structures, or strip the semantic hierarchy that makes a document navigable.

What’s actually needed is a tool that simulates a human browser, executes JavaScript, defeats bot detection, and outputs clean, GitHub-Flavored Markdown with proper YAML frontmatter — automatically.

That’s exactly what html2md does.

The html2md Arsenal

  • Advanced Stealth: Puppeteer Extra Stealth bypasses aggressive Cloudflare and other WAF protections.
  • Full JS Rendering: Executes complete React, Next.js, and Angular lifecycles before extracting a single byte of content.
  • Smart Extraction: Uses Mozilla's Readability.js with a Python Trafilatura fallback for near-100% content accuracy.
  • Asset Management: Automatically downloads, locally caches, and rewrites image paths — no missing diagrams.

From Weeks to Minutes: A Real Scenario

Imagine migrating 4,000+ documentation pages from an archaic, proprietary system into a modern Markdown-based GitBook repository. No custom DOM scraping scripts. No brittle XPath selectors. Just point html2md at the root domain in Crawl Mode:

./bin/html2md --crawl https://legacy-docs.client.com --depth 5

Within an hour, it maps the entire site tree, bypasses CDN caching layers, renders dynamic content, downloads thousands of diagrams, and packages everything into a structured .zip of clean Markdown files.

What takes weeks of manual engineering labor takes minutes of compute time.


Stop Letting Migration Block Transformation

Data migration should not be the bottleneck of digital transformation. Whether the goal is migrating to a headless CMS, archiving a legacy knowledge base for AI ingestion, or extracting clean Markdown at scale, html2md does the heavy lifting.

The core engine is fully open-source. For enterprise-scale migrations, batch processing infrastructure, or custom pipeline deployments, production-ready hosting options are available.

Bridging Architecture & Execution

Struggling to implement Agentic AI or Enterprise Microservices in your organization? I help CTOs and technical leaders transition from architectural bottlenecks to production-ready systems.

View My Full Profile & Portfolio
Sebastian Schkudlara
Written by Sebastian Schkudlara Follow View Profile →
Hi, I am Sebastian Schkudlara, the author of Jevvellabs. I hope you enjoy my blog!