API Reference · v1 · v0.4 of crawlcrawl

Introduction

The crawlcrawl API is REST. JSON in, JSON out. Bearer-token auth. Endpoints are versioned (/v1) and we won't break the wire format inside a major version.

Base URL:

https://66.163.122.173

A hostname-based URL with Let's Encrypt TLS lands once we have one assigned. Until then the IP works fine — agents just need to skip cert verification (it's the same machine and same key).

Quickstart

The shortest possible exchange: start a crawl, poll until done, fetch a page.

# export your key (issued per project — ask hello@crawlcrawl.com)
export CRAWLCRAWL_KEY="crk_..."

# 1. start a crawl
curl -k -X POST https://66.163.122.173/v1/crawls \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://example.com", "max_pages": 50 }'
# → 202 {"id":43,"status":"queued","url":"https://example.com"}

# 2. poll the run
curl -k https://66.163.122.173/v1/crawls/43 \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

# 3. when status=done, list pages and fetch one as markdown
curl -k "https://66.163.122.173/v1/crawls/43/pages" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

curl -k "https://66.163.122.173/v1/pages/195?format=markdown" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

Authentication

Every request needs Authorization: Bearer crk_.... Keys are scoped per project; we store only the SHA-256 hash and show plaintext exactly once at mint time. Never put a key in browser code.

To mint or rotate a key: email hello@crawlcrawl.com with your project name. Revocation is instant — keys are server-side only.

TLS

The endpoint runs Caddy with a self-signed cert that has the IP in its SubjectAltName. Until we move to a hostname + Let's Encrypt, agents must disable cert verification:

Client	How to skip verify
curl	`-k` or `--insecure`
Node fetch	`process.env.NODE_TLS_REJECT_UNAUTHORIZED = "0"`
Python requests	`verify=False`
Rust reqwest	`.danger_accept_invalid_certs(true)`

Limits

Limits are per-project and configurable per key. The defaults below match what new keys ship with.

Limit	Default	Notes
Concurrent runs	5	Submitting a 6th while 5 are queued/running returns 429.
max_pages per run	100,000	Hard cap; the request is rejected at validation.
depth	1..50	Server-side validation — outside this range returns 400.
concurrency per crawl	unset (spider auto)	Sustained throughput on one LXC is ~3 pages/s; bumping concurrency above ~16 mostly queues at the worker pool.
Sustained system throughput	~3 pages/s	One LXC, public IP. Workload budget today: 100k–200k pages/day across all projects.

Errors

HTTP status codes are honest. Bodies are JSON in this envelope:

{
  "error": {
    "code":    "invalid_input",
    "message": "depth must be 1..=50"
  }
}

Code	HTTP	Meaning
invalid_input	400	Body failed validation (bad URL, depth out of range, etc.).
unauthorized	401	Missing or wrong bearer token.
not_found	404	Resource doesn't exist, or belongs to a different project.
too_many_requests	429	You hit your concurrent-runs cap. Wait for one to finish.
internal	500	Something on our side. Will be in our logs; ping us.

POST /v1/crawls — start a crawl

POST /v1/crawls

Enqueues a crawl. Returns 202 immediately — the worker picks it up from the Postgres queue and starts within seconds.

Body — top-level fields

Field	Type	Description
url	string	Required. Must start with `http://` or `https://`. Host is checked against the malicious-domain blocklist before queueing.
user_agent	string	Optional. If omitted (or starts with `InternalCrawler/`), we rotate to a fresh real-browser UA via `ua_generator`.
proxy_url	string?	Optional. Per-request override of project default. Supports `http://`, `https://`, `socks5://`.
webhook_url	string?	Optional. POSTed once when the run reaches `done` or `failed`. See Webhooks.

Body — crawl params (FLAT, not nested)

These fields go at the top level of the body — not under a params key. The server uses serde flatten; nested params are silently ignored.

Field	Type	Default	Description
max_pages	int	1000	Hard cap; crawl stops when reached. Range 1..=100_000.
concurrency	int?	spider auto	Parallel in-flight requests.
depth	int	5	Max link-following depth from seed. Range 1..=50.
delay_ms	int	250	Delay between requests per host.
subdomains	bool	false	If true, follows links into subdomains of the seed host.
respect_robots	bool	true	Obey robots.txt.
store_html	bool	true	If false, we collect URL/status/bytes/metadata only — useful for cheap site maps.
seed_kind	string?	"url"	Either `"url"` (link-following) or `"sitemap"` (walks `sitemap.xml` + sitemap-index recursively). In sitemap mode you may pass either the origin (`https://example.com`, defaults to `/sitemap.xml`) or the explicit sitemap URL (`https://example.com/sitemap-index.xml`) — both work. Much faster than link-following on large sites.
headers	object?	—	String→string map attached to every request in the crawl (e.g. auth, custom referer).
cookies	string?	—	Set-Cookie-style string: `k=v; k=v`.

Headers

Header	Purpose
Authorization	`Bearer crk_...` — required.
Content-Type	`application/json` — required.
Idempotency-Key	Optional but recommended. See Idempotency.

Response — 202 Accepted

{
  "id":     43,
  "status": "queued",
  "url":    "https://example.com"
}

GET /v1/crawls/{id} — run status

GET /v1/crawls/{id}

Poll every 2–5 seconds while status is queued or running. Or skip polling entirely and use a webhook.

{
  "id":            43,
  "url":           "https://aeoniti.com",
  "status":        "running",
  "page_count":    27,
  "error_count":   0,
  "enqueued_at":   "2026-05-08T16:48:36.151Z",
  "started_at":    "2026-05-08T16:48:36.401Z",
  "finished_at":   null,
  "error_message": null,
  "proxy_url":     null
}

status ∈ queued | running | done | failed | cancelled.

Note: page_count includes pages that returned non-200 statuses (those count as "fetched"). error_count only counts crawler-side failures: DNS, TLS, connect timeout.

DELETE /v1/crawls/{id} — cancel and cascade

DELETE /v1/crawls/{id}

Cancels if running, then cascades — all pages rows for the run are removed. Returns 204 No Content. Subsequent GETs return 404.

GET /v1/crawls/{id}/pages — list pages

GET /v1/crawls/{id}/pages?limit=100&offset=0&status=200

Query	Default	Notes
limit	100	Max 1000.
offset	0	Standard offset paging.
status	—	Filter by HTTP status (e.g. only 200s).

{
  "items": [
    { "id": 195, "url": "https://aeoniti.com/pricing",
      "status": 200, "bytes": 13650,
      "fetched_at": "2026-05-08T16:48:40.595Z" }
  ],
  "next_offset": 14
}

Save the global id per row — that's the path parameter for fetching content.

GET /v1/pages/{id} — fetch one page

GET /v1/pages/{id}?format=markdown

id is the global page id from the list endpoint, not a 1-based index within a run.

format	Returns
html	Raw HTML (decompressed) + metadata. Default.
markdown	Clean markdown via fast_html2md, link-preserving + metadata.
article	llm_readability boilerplate-stripped `article_text` + `article_html` + metadata.
both	html + markdown + article_text + article_html + metadata.

Always returns id, url, status, bytes, fetched_at. metadata shape:

{
  "title":       "Pricing — AEONiti",
  "description": "...",
  "canonical":   "https://aeoniti.com/pricing",
  "og":      { "title": "...", "image": "...", "type": "website" },
  "twitter": { "card": "summary_large_image" },
  "json_ld": [ { "@type": "Product" } ]
}

GET /v1/health · /v1/ready

No auth. Use /v1/health for liveness (process up); /v1/ready verifies Postgres reachable + worker heartbeat current.

curl https://66.163.122.173/v1/health # → 200 "ok"
curl https://66.163.122.173/v1/ready  # → 200 "ready" or 503 + JSON

Webhooks

If you set webhook_url on a crawl, the worker POSTs once when the run reaches a terminal state (done or failed).

Headers

Content-Type: application/json
X-Crawler-Run-Id: 43

Body

{
  "id":            43,
  "project_id":    2,
  "url":           "https://aeoniti.com",
  "status":        "done",
  "page_count":    14,
  "error_count":   0,
  "started_at":    "...",
  "finished_at":   "...",
  "error_message": null,
  "delivered_at":  "..."
}

Retry policy

5 attempts total.
Exponential backoff: 1s, 2s, 4s, 8s.
2xx → status delivered.
4xx → marked dead immediately. Your endpoint rejected it; we don't retry.
5xx / network / timeout → retry. After attempt 5 → dead.

Delivery state persists on the run. You can poll GET /v1/crawls/{id} and read webhook_status, webhook_attempts, webhook_last_error.

No HMAC signing yet — until that ships, treat the webhook body as advisory: re-fetch the run via GET before acting on it.

Idempotency

Pass Idempotency-Key: <your-uuid> on POST to make retries safe. The same key returns the original run (with the same id) instead of creating a duplicate. Keys are scoped per project and persist for 24 hours.

Use this when the network call to start the crawl might retry (queue handlers, lambda invocations, etc.).

Common patterns

Single-page extract (URL → markdown)

{ "url": "https://example.com/post", "max_pages": 1, "depth": 1 }

Then ?format=both on the one resulting page.

Crawl whole site, get clean text only

{
  "url": "https://customer.com",
  "max_pages": 500, "depth": 8,
  "store_html": false
}

Then ?format=article per page → embeddings input.

Fast site map via sitemap.xml

// either form works — origin defaults to /sitemap.xml,
// or pass the explicit URL (handy for /sitemap-index.xml etc.)
{
  "url": "https://customer.com/sitemap.xml",
  "seed_kind": "sitemap",
  "max_pages": 5000,
  "store_html": false
}

Authenticated crawl

{
  "url": "https://app.customer.com",
  "headers": { "Authorization": "Bearer abc123" },
  "cookies": "session=XYZ; csrf=ABC",
  "respect_robots": false
}

Async with webhook (recommended for production)

{
  "url": "https://customer.com",
  "max_pages": 200,
  "concurrency": 4,
  "webhook_url": "https://you.example.com/crawl-done"
}

Known limits

Limit	Workaround
No JS rendering — SPAs that need client-side render won't yield content beyond the shell	Use `seed_kind: "sitemap"` if the site publishes one. Headless-chrome support is on the roadmap.
Self-signed TLS — agents must skip cert verify	Will swap to Let's Encrypt once a hostname points at `66.163.122.173`.
Webhooks are unsigned	Re-verify by calling `GET /v1/crawls/{id}` from your handler.
Single LXC, single Postgres — no HA	Daily pg_dump backups, weekly restore-test drill. R2 off-box backup landing soon.
`store_html=false` means pages can't be re-fetched as markdown later	Decide upfront per crawl.
404/403/429 from target are pages, not errors	Filter the page list by `status` if you only want successes.

Roadmap

Things that aren't built yet, in rough priority order:

POST /v1/map — domain URL discovery via sitemap + robots
HMAC-signed webhooks
SSE streaming for live run logs
Off-box backup push (Cloudflare R2)
Hostname + Let's Encrypt cert
JS rendering via headless chrome (feature flag)
Schema-driven structured extraction (the /extract endpoint)
Self-host bundle (Docker compose + helm chart)

Last updated 2026-05-08. API version v0.4.0.