# Website Cloning

Import an entire existing website into your sitemd project. The clone tool scrapes text, images, navigation, forms, embeds, and card layouts — mapping each to the closest sitemd component. Styling is not copied; sitemd's three theme modes apply.

This is not a pixel-perfect replica. It extracts content and structure, then produces sitemd markdown that renders with your chosen theme. The result is a working site you can edit in markdown from day one.

## Quick start

Use the `/clone` skill from any AI coding agent:

```
/clone https://example.com
```

The agent checks for [Puppeteer](https://pptr.dev), installs it if missing, scrapes the site, and walks you through importing the results.

Or use the CLI directly:

```
sitemd clone https://example.com
```

## How it works

The clone pipeline runs in four phases:

1. **Crawl** — Discover all pages via `sitemap.xml`, `robots.txt`, and link-following from the homepage
2. **Extract** — Parse each page's HTML into structured data using a headless browser (Puppeteer)
3. **Map** — Convert the structured data into sitemd markdown syntax
4. **Assets** — Download images locally and rewrite URLs

The tool returns a structured JSON report. It does not create any files — the agent reviews the report with you, then creates pages with `sitemd_pages_create` and updates settings by editing the page files directly.

## Puppeteer dependency

Website cloning requires a headless browser to render JavaScript-heavy pages. The clone tool uses [Puppeteer](https://pptr.dev), which downloads Chromium (~150MB) alongside it.

Puppeteer is not bundled with sitemd. Install it only when you need to clone:

```
npm install puppeteer
```

The `/clone` skill handles this automatically — it checks for Puppeteer before starting and tells the agent to install it if missing. If you never use the clone feature, your project stays lightweight.

## Page discovery

The crawler finds pages in this order:

1. Fetch `/sitemap.xml` and parse all `<loc>` URLs (including sub-sitemaps)
2. Fetch `/robots.txt` for additional sitemap references and `Disallow` rules
3. BFS from the homepage, following all same-origin `<a href>` links
4. Deduplicate by normalizing URLs (trailing slashes, fragments, query params)

All three sources are combined and deduplicated before extraction begins.

### Limits and rate control

| Setting | Default | Description |
| --- | --- | --- |
| `maxPages` | 50 | Maximum pages to crawl (hard cap: 200) |
| `skipPaths` | `[]` | URL prefixes to ignore (e.g. `/admin`, `/api`) |
| Delay | 500ms | Between requests (respects `Crawl-delay` from robots.txt) |
| Timeout | 30s | Per-page load timeout |

The crawler waits for `networkidle2` on each page to capture JS-rendered content, then pauses before the next request.

## Content extraction

Each page is parsed for structured content. The extractor walks the main content area (`<main>`, `<article>`, or the largest content container) and classifies every element:

| HTML element | sitemd output |
| --- | --- |
| `<h1>` – `<h6>` | `#` – `######` headings |
| `<p>` | Paragraphs with **bold**, *italic*, `code`, and [links]() preserved |
| `<ul>`, `<ol>` | `-` and `1.` lists with nesting |
| `<img>` | `![alt](src)` — queued for asset download |
| `<pre><code>` | Fenced code blocks with language hint from CSS class |
| `<table>` | Markdown tables |
| `<blockquote>` | `>` blockquotes |
| `<hr>` | `---` |
| `<iframe>` (YouTube, Vimeo, Spotify, etc.) | `embed: URL` |
| `<iframe>` (generic) | Inline HTML passthrough |
| `<form>` | `form:` block with extracted fields |
| Grid/flex containers with repeating card-like children | `card:` / `card-text:` / `card-image:` / `card-link:` |
| Button-styled `<a>` (class contains `btn`, `button`, `cta`) | `button: Label: URL` |
| `<details><summary>` | Inline HTML (sitemd renders it natively) |

### Page type detection

The extractor classifies each page by URL pattern and content heuristics:

| URL pattern | Detected type |
| --- | --- |
| `/blog/`, `/posts/`, `/articles/` | Blog post |
| `/docs/`, `/documentation/`, `/guide/` | Documentation |
| `/changelog` | Changelog |
| `/roadmap` | Roadmap |
| Everything else | Standard page |

Blog and docs pages are automatically assigned to their respective groups.

### Card detection

The extractor looks for parent containers using CSS grid or flexbox with 2+ children. If each child contains an image, heading, text, or link, they're extracted as sitemd cards:

```markdown
card: Product A
card-text: Our flagship offering with enterprise features.
card-image: /media/clone/product-a.jpg
card-link: Learn More: /products/a
```

### Form extraction

For each `<form>`, the extractor maps:

- `<input>` types → sitemd field types (`shorttext`, `email`, `phone`, `number`, `date`, `checkbox`, `radio`, `select`)
- `<label>` text → field labels
- `<select>` options → select/radio options
- Submit button text → `submitLabel`
- Form `action` URL → `webhook` (flagged for manual update)

The generated form uses a placeholder webhook URL. You need to replace it with your actual endpoint (Zapier, Make, n8n, a serverless function, etc.). See [Forms](/docs/forms).

## Site-level extraction

The homepage is also parsed for site-wide settings:

| What | Where it goes |
| --- | --- |
| Site title (from `<title>` or `og:site_name`) | `settings/meta.md` → `title`, `brandName` |
| Meta description | `settings/meta.md` → `description` |
| Navigation structure (`<nav>` links + dropdowns) | `settings/header.md` → `items` |
| Footer content (copyright, link groups) | `settings/footer.md` |
| Social links (detected by domain: github, twitter, linkedin, reddit, etc.) | `settings/footer.md` → `social` |
| Accent color (CSS custom properties or button backgrounds) | `settings/theme.md` → `accentColor` |
| Logo/favicon | `theme/images/` |

## Asset downloading

Images referenced in page content are downloaded to `media/clone/` and URLs are rewritten to local paths. The asset handler:

- Skips images larger than 5MB
- Deduplicates by source URL (same image on multiple pages is downloaded once)
- Resolves relative URLs against the page's base URL
- Saves logos and favicons to `theme/images/`

Disable asset downloading with `--no-assets` (CLI) or `includeAssets: false` (MCP tool).

## The clone report

The tool returns a JSON report — it does not write files directly. The report contains:

```json
{
  "source": "https://example.com",
  "crawled": 12,
  "site": {
    "title": "Example Co",
    "description": "We make things",
    "accentColor": "#e11d48",
    "suggestedMode": "light"
  },
  "navigation": { "header": [...], "footer": {...} },
  "groups": [{ "name": "blog", "items": [...] }],
  "pages": [
    {
      "slug": "/about",
      "title": "About Us",
      "type": "standard",
      "content": "# About Us\n\n...",
      "components": ["cards", "buttons"],
      "confidence": 0.85,
      "warnings": ["Form action URL needs updating"]
    }
  ],
  "assets": { "downloaded": 8, "skipped": 2 },
  "warnings": [...],
  "unmapped": [{ "url": "/dashboard", "reason": "HTTP 401" }]
}
```

Each page includes a **confidence score** (0–1) indicating how well the extraction mapped to sitemd components. Pages below 0.5 are flagged for manual review.

The agent then creates pages via `sitemd_pages_create`, updates settings, and reports what needs manual attention.

## MCP tool

The `sitemd_clone` tool is available to any MCP-capable agent:

**Parameters:**

| Name | Required | Description |
| --- | --- | --- |
| `url` | yes | Website URL to clone |
| `maxPages` | no | Max pages to crawl (default 50, max 200) |
| `includeAssets` | no | Download images locally (default true) |
| `skipPaths` | no | URL path prefixes to skip |

See [MCP Server](/docs/mcp-server) for setup and the full tool list.

## CLI command

```
sitemd clone <url> [--max-pages N] [--skip /path1 /path2] [--no-assets]
```

When run in a terminal, the CLI pretty-prints a summary of what was found. When piped (non-TTY), it outputs the raw JSON report.

## What needs manual attention

After cloning, the agent reports what it couldn't fully automate:

- **Form webhook URLs** — placeholder URLs need replacing with your actual endpoints
- **Auth-gated pages** — pages behind login walls (HTTP 401/403) are skipped
- **Dynamic content** — interactive widgets, custom CMS output, and client-rendered SPAs may produce incomplete extractions
- **reCAPTCHA / third-party scripts** — stripped during extraction
- **Low-confidence pages** — complex layouts that didn't map cleanly to sitemd components

The agent guides you through each item and suggests what to configure next.

## Related

- [MCP Server](/docs/mcp-server) — all available MCP tools including `sitemd_clone`
- [Forms](/docs/forms) — configure form webhook URLs after cloning
- [Navigation](/docs/navigation) — adjust the imported header and footer
- [Groups](/docs/groups) — organize imported blog/docs pages
- [Themes](/docs/themes) — customize the look after content is imported
- [Getting Started](/docs/getting-started) — project setup from scratch (alternative to cloning)