Website Cloning
Import an entire existing website into your sitemd project. The clone tool scrapes text, images, navigation, forms, embeds, and card layouts — mapping each to the closest sitemd component. Styling is not copied; sitemd's three theme modes apply.
This is not a pixel-perfect replica. It extracts content and structure, then produces sitemd markdown that renders with your chosen theme. The result is a working site you can edit in markdown from day one.
Quick start
Use the /clone skill from any AI coding agent:
/clone https://example.com
The agent checks for Puppeteer, installs it if missing, scrapes the site, and walks you through importing the results.
Or use the CLI directly:
sitemd clone https://example.com
How it works
The clone pipeline runs in four phases:
- Crawl — Discover all pages via
sitemap.xml,robots.txt, and link-following from the homepage - Extract — Parse each page's HTML into structured data using a headless browser (Puppeteer)
- Map — Convert the structured data into sitemd markdown syntax
- Assets — Download images locally and rewrite URLs
The tool returns a structured JSON report. It does not create any files — the agent reviews the report with you, then creates pages with sitemd_pages_create and updates settings by editing the page files directly.
Puppeteer dependency
Website cloning requires a headless browser to render JavaScript-heavy pages. The clone tool uses Puppeteer, which downloads Chromium (~150MB) alongside it.
Puppeteer is not bundled with sitemd. Install it only when you need to clone:
npm install puppeteer
The /clone skill handles this automatically — it checks for Puppeteer before starting and tells the agent to install it if missing. If you never use the clone feature, your project stays lightweight.
Page discovery
The crawler finds pages in this order:
- Fetch
/sitemap.xmland parse all<loc>URLs (including sub-sitemaps) - Fetch
/robots.txtfor additional sitemap references andDisallowrules - BFS from the homepage, following all same-origin
<a href>links - Deduplicate by normalizing URLs (trailing slashes, fragments, query params)
All three sources are combined and deduplicated before extraction begins.
Limits and rate control
| Setting | Default | Description |
|---|---|---|
maxPages |
50 | Maximum pages to crawl (hard cap: 200) |
skipPaths |
[] |
URL prefixes to ignore (e.g. /admin, /api) |
| Delay | 500ms | Between requests (respects Crawl-delay from robots.txt) |
| Timeout | 30s | Per-page load timeout |
The crawler waits for networkidle2 on each page to capture JS-rendered content, then pauses before the next request.
Content extraction
Each page is parsed for structured content. The extractor walks the main content area (<main>, <article>, or the largest content container) and classifies every element:
| HTML element | sitemd output |
|---|---|
<h1> – <h6> |
# – ###### headings |
<p> |
Paragraphs with bold, italic, code, and links preserved |
<ul>, <ol> |
- and 1. lists with nesting |
<img> |
 — queued for asset download |
<pre><code> |
Fenced code blocks with language hint from CSS class |
<table> |
Markdown tables |
<blockquote> |
> blockquotes |
<hr> |
--- |
<iframe> (YouTube, Vimeo, Spotify, etc.) |
embed: URL |
<iframe> (generic) |
Inline HTML passthrough |
<form> |
form: block with extracted fields |
| Grid/flex containers with repeating card-like children | card: / card-text: / card-image: / card-link: |
Button-styled <a> (class contains btn, button, cta) |
button: Label: URL |
<details><summary> |
Inline HTML (sitemd renders it natively) |
Page type detection
The extractor classifies each page by URL pattern and content heuristics:
| URL pattern | Detected type |
|---|---|
/blog/, /posts/, /articles/ |
Blog post |
/docs/, /documentation/, /guide/ |
Documentation |
/changelog |
Changelog |
/roadmap |
Roadmap |
| Everything else | Standard page |
Blog and docs pages are automatically assigned to their respective groups.
Card detection
The extractor looks for parent containers using CSS grid or flexbox with 2+ children. If each child contains an image, heading, text, or link, they're extracted as sitemd cards:
card: Product A
card-text: Our flagship offering with enterprise features.
card-image: /media/clone/product-a.jpg
card-link: Learn More: /products/a
Form extraction
For each <form>, the extractor maps:
<input>types → sitemd field types (shorttext,email,phone,number,date,checkbox,radio,select)<label>text → field labels<select>options → select/radio options- Submit button text →
submitLabel - Form
actionURL →webhook(flagged for manual update)
The generated form uses a placeholder webhook URL. You need to replace it with your actual endpoint (Zapier, Make, n8n, a serverless function, etc.). See Forms.
Site-level extraction
The homepage is also parsed for site-wide settings:
| What | Where it goes |
|---|---|
Site title (from <title> or og:site_name) |
settings/meta.md → title, brandName |
| Meta description | settings/meta.md → description |
Navigation structure (<nav> links + dropdowns) |
settings/header.md → items |
| Footer content (copyright, link groups) | settings/footer.md |
| Social links (detected by domain: github, twitter, linkedin, reddit, etc.) | settings/footer.md → social |
| Accent color (CSS custom properties or button backgrounds) | settings/theme.md → accentColor |
| Logo/favicon | theme/images/ |
Asset downloading
Images referenced in page content are downloaded to media/clone/ and URLs are rewritten to local paths. The asset handler:
- Skips images larger than 5MB
- Deduplicates by source URL (same image on multiple pages is downloaded once)
- Resolves relative URLs against the page's base URL
- Saves logos and favicons to
theme/images/
Disable asset downloading with --no-assets (CLI) or includeAssets: false (MCP tool).
The clone report
The tool returns a JSON report — it does not write files directly. The report contains:
{
"source": "https://example.com",
"crawled": 12,
"site": {
"title": "Example Co",
"description": "We make things",
"accentColor": "#e11d48",
"suggestedMode": "light"
},
"navigation": { "header": [...], "footer": {...} },
"groups": [{ "name": "blog", "items": [...] }],
"pages": [
{
"slug": "/about",
"title": "About Us",
"type": "standard",
"content": "# About Us<br><br>...",
"components": ["cards", "buttons"],
"confidence": 0.85,
"warnings": ["Form action URL needs updating"]
}
],
"assets": { "downloaded": 8, "skipped": 2 },
"warnings": [...],
"unmapped": [{ "url": "/dashboard", "reason": "HTTP 401" }]
}
Each page includes a confidence score (0–1) indicating how well the extraction mapped to sitemd components. Pages below 0.5 are flagged for manual review.
The agent then creates pages via sitemd_pages_create, updates settings, and reports what needs manual attention.
MCP tool
The sitemd_clone tool is available to any MCP-capable agent:
Parameters:
| Name | Required | Description |
|---|---|---|
url |
yes | Website URL to clone |
maxPages |
no | Max pages to crawl (default 50, max 200) |
includeAssets |
no | Download images locally (default true) |
skipPaths |
no | URL path prefixes to skip |
See MCP Server for setup and the full tool list.
CLI command
sitemd clone <url> [--max-pages N] [--skip /path1 /path2] [--no-assets]
When run in a terminal, the CLI pretty-prints a summary of what was found. When piped (non-TTY), it outputs the raw JSON report.
What needs manual attention
After cloning, the agent reports what it couldn't fully automate:
- Form webhook URLs — placeholder URLs need replacing with your actual endpoints
- Auth-gated pages — pages behind login walls (HTTP 401/403) are skipped
- Dynamic content — interactive widgets, custom CMS output, and client-rendered SPAs may produce incomplete extractions
- reCAPTCHA / third-party scripts — stripped during extraction
- Low-confidence pages — complex layouts that didn't map cleanly to sitemd components
The agent guides you through each item and suggests what to configure next.
Related
- MCP Server — all available MCP tools including
sitemd_clone - Forms — configure form webhook URLs after cloning
- Navigation — adjust the imported header and footer
- Groups — organize imported blog/docs pages
- Themes — customize the look after content is imported
- Getting Started — project setup from scratch (alternative to cloning)