Website Cloning

Import an entire existing website into your sitemd project. The clone tool scrapes text, images, navigation, forms, embeds, and card layouts — mapping each to the closest sitemd component. Styling is not copied; sitemd's three theme modes apply.

This is not a pixel-perfect replica. It extracts content and structure, then produces sitemd markdown that renders with your chosen theme. The result is a working site you can edit in markdown from day one.

Quick start

Use the /clone skill from any AI coding agent:

/clone https://example.com

The agent checks for Puppeteer, installs it if missing, scrapes the site, and walks you through importing the results.

Or use the CLI directly:

sitemd clone https://example.com

How it works

The clone pipeline runs in four phases:

Crawl — Discover all pages via sitemap.xml, robots.txt, and link-following from the homepage
Extract — Parse each page's HTML into structured data using a headless browser (Puppeteer)
Map — Convert the structured data into sitemd markdown syntax
Assets — Download images locally and rewrite URLs

The tool returns a structured JSON report. It does not create any files — the agent reviews the report with you, then creates pages with sitemd pages create and updates settings by editing the page files directly.

Puppeteer dependency

Website cloning requires a headless browser to render JavaScript-heavy pages. The clone tool uses Puppeteer, which downloads Chromium (~150MB) alongside it.

Puppeteer is not bundled with sitemd. Install it only when you need to clone:

npm install puppeteer

The /clone skill handles this automatically — it checks for Puppeteer before starting and tells the agent to install it if missing. If you never use the clone feature, your project stays lightweight.

Page discovery

The crawler finds pages in this order:

Fetch /sitemap.xml and parse all <loc> URLs (including sub-sitemaps)
Fetch /robots.txt for additional sitemap references and Disallow rules
BFS from the homepage, following all same-origin <a href> links
Deduplicate by normalizing URLs (trailing slashes, fragments, query params)

All three sources are combined and deduplicated before extraction begins.

Limits and rate control

Setting	Default	Description
`maxPages`	50	Maximum pages to crawl (hard cap: 200)
`skipPaths`	`[]`	URL prefixes to ignore (e.g. `/admin`, `/api`)
Delay	500ms	Between requests (respects `Crawl-delay` from robots.txt)
Timeout	30s	Per-page load timeout

The crawler waits for networkidle2 on each page to capture JS-rendered content, then pauses before the next request.

Content extraction

Each page is parsed for structured content. The extractor walks the main content area (<main>, <article>, or the largest content container) and classifies every element:

HTML element	sitemd output
`<h1>` – `<h6>`	`#` – `######` headings
`<p>`	Paragraphs with bold, italic, `code`, and links preserved
`<ul>`, `<ol>`	`-` and `1.` lists with nesting
`<img>`	`![alt](src)` — queued for asset download
`<pre><code>`	Fenced code blocks with language hint from CSS class
`<table>`	Markdown tables
`<blockquote>`	`>` blockquotes
`<hr>`	`---`
`<iframe>` (YouTube, Vimeo, Spotify, etc.)	`embed: URL`
`<iframe>` (generic)	Inline HTML passthrough
`<form>`	`form:` block with extracted fields
Grid/flex containers with repeating card-like children	`card:` / `card-text:` / `card-image:` / `card-link:`
Button-styled `<a>` (class contains `btn`, `button`, `cta`)	`button: Label: URL`
`<details><summary>`	Inline HTML (sitemd renders it natively)

Page type detection

The extractor classifies each page by URL pattern and content heuristics:

URL pattern	Detected type
`/blog/`, `/posts/`, `/articles/`	Blog post
`/docs/`, `/documentation/`, `/guide/`	Documentation
`/changelog`	Changelog
`/roadmap`	Roadmap
Everything else	Standard page

Blog and docs pages are automatically assigned to their respective groups.

Card detection

The extractor looks for parent containers using CSS grid or flexbox with 2+ children. If each child contains an image, heading, text, or link, they're extracted as sitemd cards:

card: Product A
card-text: Our flagship offering with enterprise features.
card-image: /media/clone/product-a.jpg
card-link: Learn More: /products/a

Form extraction

For each <form>, the extractor maps:

<input> types → sitemd field types (shorttext, email, phone, number, date, checkbox, radio, select)
<label> text → field labels
<select> options → select/radio options
Submit button text → submitLabel
Form action URL → webhook (flagged for manual update)

The generated form uses a placeholder webhook URL. You need to replace it with your actual endpoint (Zapier, Make, n8n, a serverless function, etc.). See Forms.

Site-level extraction

The homepage is also parsed for site-wide settings:

What	Where it goes
Site title (from `<title>` or `og:site_name`)	`settings/meta.md` → `title`, `brandName`
Meta description	`settings/meta.md` → `description`
Navigation structure (`<nav>` links + dropdowns)	`settings/header.md` → `items`
Footer content (copyright, link groups)	`settings/footer.md`
Social links (detected by domain: github, twitter, linkedin, reddit, etc.)	`settings/footer.md` → `social`
Accent color (CSS custom properties or button backgrounds)	`settings/theme.md` → `accentColor`
Logo/favicon	`theme/images/`

Asset downloading

Images referenced in page content are downloaded to media/clone/ and URLs are rewritten to local paths. The asset handler:

Skips images larger than 5MB
Deduplicates by source URL (same image on multiple pages is downloaded once)
Resolves relative URLs against the page's base URL
Saves logos and favicons to theme/images/

Disable asset downloading with --no-assets.

The clone report

The tool returns a JSON report — it does not write files directly. The report contains:

{
  "source": "https://example.com",
  "crawled": 12,
  "site": {
    "title": "Example Co",
    "description": "We make things",
    "accentColor": "#e11d48",
    "suggestedMode": "light"
  },
  "navigation": { "header": [...], "footer": {...} },
  "groups": [{ "name": "blog", "items": [...] }],
  "pages": [
    {
      "slug": "/about",
      "title": "About Us",
      "type": "standard",
      "content": "# About Us<br><br>...",
      "components": ["cards", "buttons"],
      "confidence": 0.85,
      "warnings": ["Form action URL needs updating"]
    }
  ],
  "assets": { "downloaded": 8, "skipped": 2 },
  "warnings": [...],
  "unmapped": [{ "url": "/dashboard", "reason": "HTTP 401" }]
}

Each page includes a confidence score (0–1) indicating how well the extraction mapped to sitemd components. Pages below 0.5 are flagged for manual review.

The agent then creates pages via sitemd pages create, updates settings, and reports what needs manual attention.

CLI command

sitemd clone <url> [--max-pages N] [--skip /path1 /path2] [--no-assets]

When run in a terminal, the CLI pretty-prints a summary of what was found. When piped (non-TTY), it outputs the raw JSON report.

What needs manual attention

After cloning, the agent reports what it couldn't fully automate:

Form webhook URLs — placeholder URLs need replacing with your actual endpoints
Auth-gated pages — pages behind login walls (HTTP 401/403) are skipped
Dynamic content — interactive widgets, custom CMS output, and client-rendered SPAs may produce incomplete extractions
reCAPTCHA / third-party scripts — stripped during extraction
Low-confidence pages — complex layouts that didn't map cleanly to sitemd components

The agent guides you through each item and suggests what to configure next.

CLI — all available CLI commands including sitemd clone
Forms — configure form webhook URLs after cloning
Navigation — adjust the imported header and footer
Groups — organize imported blog/docs pages
Themes — customize the look after content is imported
Getting Started — project setup from scratch (alternative to cloning)