You’ve just inherited a 10-year-old legacy CMS disaster. Ten thousand pages of content. Broken inline styles. Deeply nested tables. Images pointing to paths that no longer exist. And half the content trapped inside React SPAs that a standard scraper simply returns as empty <div> soup.
Congratulations. You have two options: six months of manual cleanup or a smarter tool.
Why Existing Tools Fail
The standard playbook doesn’t hold up in the real world. CLI converters choke on WAFs like Cloudflare. Basic scrapers miss content entirely when JavaScript hasn’t fired. Even sophisticated pipelines lose images, malform table structures, or strip the semantic hierarchy that makes a document navigable.
What’s actually needed is a tool that simulates a human browser, executes JavaScript, defeats bot detection, and outputs clean, GitHub-Flavored Markdown with proper YAML frontmatter — automatically.
That’s exactly what html2md does.
The html2md Arsenal
- Advanced Stealth: Puppeteer Extra Stealth bypasses aggressive Cloudflare and other WAF protections.
- Full JS Rendering: Executes complete React, Next.js, and Angular lifecycles before extracting a single byte of content.
- Smart Extraction: Uses Mozilla's Readability.js with a Python Trafilatura fallback for near-100% content accuracy.
- Asset Management: Automatically downloads, locally caches, and rewrites image paths — no missing diagrams.
From Weeks to Minutes: A Real Scenario
Imagine migrating 4,000+ documentation pages from an archaic, proprietary system into a modern Markdown-based GitBook repository. No custom DOM scraping scripts. No brittle XPath selectors. Just point html2md at the root domain in Crawl Mode:
./bin/html2md --crawl https://legacy-docs.client.com --depth 5
Within an hour, it maps the entire site tree, bypasses CDN caching layers, renders dynamic content, downloads thousands of diagrams, and packages everything into a structured .zip of clean Markdown files.
What takes weeks of manual engineering labor takes minutes of compute time.
Stop Letting Migration Block Transformation
Data migration should not be the bottleneck of digital transformation. Whether the goal is migrating to a headless CMS, archiving a legacy knowledge base for AI ingestion, or extracting clean Markdown at scale, html2md does the heavy lifting.
The core engine is fully open-source. For enterprise-scale migrations, batch processing infrastructure, or custom pipeline deployments, production-ready hosting options are available.
Sebastian Schkudlara