Introduction
The crawlcrawl API is REST. JSON in, JSON out. Bearer-token auth. Endpoints are versioned (/v1) and we won't break the wire format inside a major version.
Base URL:
A hostname-based URL with Let's Encrypt TLS lands once we have one assigned. Until then the IP works fine — agents just need to skip cert verification (it's the same machine and same key).
Quickstart
The shortest possible exchange: start a crawl, poll until done, fetch a page.
# export your key (issued per project — ask hello@crawlcrawl.com) export CRAWLCRAWL_KEY="crk_..." # 1. start a crawl curl -k -X POST https://66.163.122.173/v1/crawls \ -H "Authorization: Bearer $CRAWLCRAWL_KEY" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com", "max_pages": 50 }' # → 202 {"id":43,"status":"queued","url":"https://example.com"} # 2. poll the run curl -k https://66.163.122.173/v1/crawls/43 \ -H "Authorization: Bearer $CRAWLCRAWL_KEY" # 3. when status=done, list pages and fetch one as markdown curl -k "https://66.163.122.173/v1/crawls/43/pages" \ -H "Authorization: Bearer $CRAWLCRAWL_KEY" curl -k "https://66.163.122.173/v1/pages/195?format=markdown" \ -H "Authorization: Bearer $CRAWLCRAWL_KEY"
Authentication
Every request needs Authorization: Bearer crk_.... Keys are scoped per project; we store only the SHA-256 hash and show plaintext exactly once at mint time. Never put a key in browser code.
To mint or rotate a key: email hello@crawlcrawl.com with your project name. Revocation is instant — keys are server-side only.
TLS
The endpoint runs Caddy with a self-signed cert that has the IP in its SubjectAltName. Until we move to a hostname + Let's Encrypt, agents must disable cert verification:
| Client | How to skip verify |
|---|---|
| curl | -k or --insecure |
| Node fetch | process.env.NODE_TLS_REJECT_UNAUTHORIZED = "0" |
| Python requests | verify=False |
| Rust reqwest | .danger_accept_invalid_certs(true) |
Limits
Limits are per-project and configurable per key. The defaults below match what new keys ship with.
| Limit | Default | Notes |
|---|---|---|
| Concurrent runs | 5 | Submitting a 6th while 5 are queued/running returns 429. |
| max_pages per run | 100,000 | Hard cap; the request is rejected at validation. |
| depth | 1..50 | Server-side validation — outside this range returns 400. |
| concurrency per crawl | unset (spider auto) | Sustained throughput on one LXC is ~3 pages/s; bumping concurrency above ~16 mostly queues at the worker pool. |
| Sustained system throughput | ~3 pages/s | One LXC, public IP. Workload budget today: 100k–200k pages/day across all projects. |
Errors
HTTP status codes are honest. Bodies are JSON in this envelope:
{ "error": { "code": "invalid_input", "message": "depth must be 1..=50" } }
| Code | HTTP | Meaning |
|---|---|---|
| invalid_input | 400 | Body failed validation (bad URL, depth out of range, etc.). |
| unauthorized | 401 | Missing or wrong bearer token. |
| not_found | 404 | Resource doesn't exist, or belongs to a different project. |
| too_many_requests | 429 | You hit your concurrent-runs cap. Wait for one to finish. |
| internal | 500 | Something on our side. Will be in our logs; ping us. |
POST /v1/crawls — start a crawl
Enqueues a crawl. Returns 202 immediately — the worker picks it up from the Postgres queue and starts within seconds.
Body — top-level fields
| Field | Type | Description |
|---|---|---|
| url | string | Required. Must start with http:// or https://. Host is checked against the malicious-domain blocklist before queueing. |
| user_agent | string | Optional. If omitted (or starts with InternalCrawler/), we rotate to a fresh real-browser UA via ua_generator. |
| proxy_url | string? | Optional. Per-request override of project default. Supports http://, https://, socks5://. |
| webhook_url | string? | Optional. POSTed once when the run reaches done or failed. See Webhooks. |
Body — crawl params (FLAT, not nested)
These fields go at the top level of the body — not under a params key. The server uses serde flatten; nested params are silently ignored.
| Field | Type | Default | Description |
|---|---|---|---|
| max_pages | int | 1000 | Hard cap; crawl stops when reached. Range 1..=100_000. |
| concurrency | int? | spider auto | Parallel in-flight requests. |
| depth | int | 5 | Max link-following depth from seed. Range 1..=50. |
| delay_ms | int | 250 | Delay between requests per host. |
| subdomains | bool | false | If true, follows links into subdomains of the seed host. |
| respect_robots | bool | true | Obey robots.txt. |
| store_html | bool | true | If false, we collect URL/status/bytes/metadata only — useful for cheap site maps. |
| seed_kind | string? | "url" | Either "url" (link-following) or "sitemap" (walks sitemap.xml + sitemap-index recursively). In sitemap mode you may pass either the origin (https://example.com, defaults to /sitemap.xml) or the explicit sitemap URL (https://example.com/sitemap-index.xml) — both work. Much faster than link-following on large sites. |
| headers | object? | — | String→string map attached to every request in the crawl (e.g. auth, custom referer). |
| cookies | string? | — | Set-Cookie-style string: k=v; k=v. |
Headers
| Header | Purpose |
|---|---|
| Authorization | Bearer crk_... — required. |
| Content-Type | application/json — required. |
| Idempotency-Key | Optional but recommended. See Idempotency. |
Response — 202 Accepted
{ "id": 43, "status": "queued", "url": "https://example.com" }
GET /v1/crawls/{id} — run status
Poll every 2–5 seconds while status is queued or running. Or skip polling entirely and use a webhook.
{ "id": 43, "url": "https://aeoniti.com", "status": "running", "page_count": 27, "error_count": 0, "enqueued_at": "2026-05-08T16:48:36.151Z", "started_at": "2026-05-08T16:48:36.401Z", "finished_at": null, "error_message": null, "proxy_url": null }
status ∈ queued | running | done | failed | cancelled.
Note: page_count includes pages that returned non-200 statuses (those count as "fetched"). error_count only counts crawler-side failures: DNS, TLS, connect timeout.
DELETE /v1/crawls/{id} — cancel and cascade
Cancels if running, then cascades — all pages rows for the run are removed. Returns 204 No Content. Subsequent GETs return 404.
GET /v1/crawls/{id}/pages — list pages
| Query | Default | Notes |
|---|---|---|
| limit | 100 | Max 1000. |
| offset | 0 | Standard offset paging. |
| status | — | Filter by HTTP status (e.g. only 200s). |
{ "items": [ { "id": 195, "url": "https://aeoniti.com/pricing", "status": 200, "bytes": 13650, "fetched_at": "2026-05-08T16:48:40.595Z" } ], "next_offset": 14 }
Save the global id per row — that's the path parameter for fetching content.
GET /v1/pages/{id} — fetch one page
id is the global page id from the list endpoint, not a 1-based index within a run.
| format | Returns |
|---|---|
| html | Raw HTML (decompressed) + metadata. Default. |
| markdown | Clean markdown via fast_html2md, link-preserving + metadata. |
| article | llm_readability boilerplate-stripped article_text + article_html + metadata. |
| both | html + markdown + article_text + article_html + metadata. |
Always returns id, url, status, bytes, fetched_at. metadata shape:
{ "title": "Pricing — AEONiti", "description": "...", "canonical": "https://aeoniti.com/pricing", "og": { "title": "...", "image": "...", "type": "website" }, "twitter": { "card": "summary_large_image" }, "json_ld": [ { "@type": "Product" } ] }
GET /v1/health · /v1/ready
No auth. Use /v1/health for liveness (process up); /v1/ready verifies Postgres reachable + worker heartbeat current.
curl https://66.163.122.173/v1/health # → 200 "ok" curl https://66.163.122.173/v1/ready # → 200 "ready" or 503 + JSON
Webhooks
If you set webhook_url on a crawl, the worker POSTs once when the run reaches a terminal state (done or failed).
Headers
Content-Type: application/json X-Crawler-Run-Id: 43
Body
{ "id": 43, "project_id": 2, "url": "https://aeoniti.com", "status": "done", "page_count": 14, "error_count": 0, "started_at": "...", "finished_at": "...", "error_message": null, "delivered_at": "..." }
Retry policy
- 5 attempts total.
- Exponential backoff: 1s, 2s, 4s, 8s.
- 2xx → status
delivered. - 4xx → marked
deadimmediately. Your endpoint rejected it; we don't retry. - 5xx / network / timeout → retry. After attempt 5 →
dead.
Delivery state persists on the run. You can poll GET /v1/crawls/{id} and read webhook_status, webhook_attempts, webhook_last_error.
No HMAC signing yet — until that ships, treat the webhook body as advisory: re-fetch the run via GET before acting on it.
Idempotency
Pass Idempotency-Key: <your-uuid> on POST to make retries safe. The same key returns the original run (with the same id) instead of creating a duplicate. Keys are scoped per project and persist for 24 hours.
Use this when the network call to start the crawl might retry (queue handlers, lambda invocations, etc.).
Common patterns
Single-page extract (URL → markdown)
{ "url": "https://example.com/post", "max_pages": 1, "depth": 1 }
Then ?format=both on the one resulting page.
Crawl whole site, get clean text only
{ "url": "https://customer.com", "max_pages": 500, "depth": 8, "store_html": false }
Then ?format=article per page → embeddings input.
Fast site map via sitemap.xml
// either form works — origin defaults to /sitemap.xml, // or pass the explicit URL (handy for /sitemap-index.xml etc.) { "url": "https://customer.com/sitemap.xml", "seed_kind": "sitemap", "max_pages": 5000, "store_html": false }
Authenticated crawl
{ "url": "https://app.customer.com", "headers": { "Authorization": "Bearer abc123" }, "cookies": "session=XYZ; csrf=ABC", "respect_robots": false }
Async with webhook (recommended for production)
{ "url": "https://customer.com", "max_pages": 200, "concurrency": 4, "webhook_url": "https://you.example.com/crawl-done" }
Known limits
| Limit | Workaround |
|---|---|
| No JS rendering — SPAs that need client-side render won't yield content beyond the shell | Use seed_kind: "sitemap" if the site publishes one. Headless-chrome support is on the roadmap. |
| Self-signed TLS — agents must skip cert verify | Will swap to Let's Encrypt once a hostname points at 66.163.122.173. |
| Webhooks are unsigned | Re-verify by calling GET /v1/crawls/{id} from your handler. |
| Single LXC, single Postgres — no HA | Daily pg_dump backups, weekly restore-test drill. R2 off-box backup landing soon. |
store_html=false means pages can't be re-fetched as markdown later | Decide upfront per crawl. |
| 404/403/429 from target are pages, not errors | Filter the page list by status if you only want successes. |
Roadmap
Things that aren't built yet, in rough priority order:
POST /v1/map— domain URL discovery via sitemap + robots- HMAC-signed webhooks
- SSE streaming for live run logs
- Off-box backup push (Cloudflare R2)
- Hostname + Let's Encrypt cert
- JS rendering via headless chrome (feature flag)
- Schema-driven structured extraction (the
/extractendpoint) - Self-host bundle (Docker compose + helm chart)
Last updated 2026-05-08. API version v0.4.0.