API Reference · v1 · v0.4 of crawlcrawl

Introduction

The crawlcrawl API is REST. JSON in, JSON out. Bearer-token auth. Endpoints are versioned (/v1) and we won't break the wire format inside a major version.

Base URL:

https://66.163.122.173

A hostname-based URL with Let's Encrypt TLS lands once we have one assigned. Until then the IP works fine — agents just need to skip cert verification (it's the same machine and same key).

Quickstart

The shortest possible exchange: start a crawl, poll until done, fetch a page.

# export your key (issued per project — ask hello@crawlcrawl.com)
export CRAWLCRAWL_KEY="crk_..."

# 1. start a crawl
curl -k -X POST https://66.163.122.173/v1/crawls \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://example.com", "max_pages": 50 }'
# → 202 {"id":43,"status":"queued","url":"https://example.com"}

# 2. poll the run
curl -k https://66.163.122.173/v1/crawls/43 \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

# 3. when status=done, list pages and fetch one as markdown
curl -k "https://66.163.122.173/v1/crawls/43/pages" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

curl -k "https://66.163.122.173/v1/pages/195?format=markdown" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

Authentication

Every request needs Authorization: Bearer crk_.... Keys are scoped per project; we store only the SHA-256 hash and show plaintext exactly once at mint time. Never put a key in browser code.

To mint or rotate a key: email hello@crawlcrawl.com with your project name. Revocation is instant — keys are server-side only.

TLS

The endpoint runs Caddy with a self-signed cert that has the IP in its SubjectAltName. Until we move to a hostname + Let's Encrypt, agents must disable cert verification:

ClientHow to skip verify
curl-k or --insecure
Node fetchprocess.env.NODE_TLS_REJECT_UNAUTHORIZED = "0"
Python requestsverify=False
Rust reqwest.danger_accept_invalid_certs(true)

Limits

Limits are per-project and configurable per key. The defaults below match what new keys ship with.

LimitDefaultNotes
Concurrent runs5Submitting a 6th while 5 are queued/running returns 429.
max_pages per run100,000Hard cap; the request is rejected at validation.
depth1..50Server-side validation — outside this range returns 400.
concurrency per crawlunset (spider auto)Sustained throughput on one LXC is ~3 pages/s; bumping concurrency above ~16 mostly queues at the worker pool.
Sustained system throughput~3 pages/sOne LXC, public IP. Workload budget today: 100k–200k pages/day across all projects.

Errors

HTTP status codes are honest. Bodies are JSON in this envelope:

{
  "error": {
    "code":    "invalid_input",
    "message": "depth must be 1..=50"
  }
}
CodeHTTPMeaning
invalid_input400Body failed validation (bad URL, depth out of range, etc.).
unauthorized401Missing or wrong bearer token.
not_found404Resource doesn't exist, or belongs to a different project.
too_many_requests429You hit your concurrent-runs cap. Wait for one to finish.
internal500Something on our side. Will be in our logs; ping us.

POST /v1/crawls — start a crawl

POST /v1/crawls

Enqueues a crawl. Returns 202 immediately — the worker picks it up from the Postgres queue and starts within seconds.

Body — top-level fields

FieldTypeDescription
urlstringRequired. Must start with http:// or https://. Host is checked against the malicious-domain blocklist before queueing.
user_agentstringOptional. If omitted (or starts with InternalCrawler/), we rotate to a fresh real-browser UA via ua_generator.
proxy_urlstring?Optional. Per-request override of project default. Supports http://, https://, socks5://.
webhook_urlstring?Optional. POSTed once when the run reaches done or failed. See Webhooks.

Body — crawl params (FLAT, not nested)

These fields go at the top level of the body — not under a params key. The server uses serde flatten; nested params are silently ignored.

FieldTypeDefaultDescription
max_pagesint1000Hard cap; crawl stops when reached. Range 1..=100_000.
concurrencyint?spider autoParallel in-flight requests.
depthint5Max link-following depth from seed. Range 1..=50.
delay_msint250Delay between requests per host.
subdomainsboolfalseIf true, follows links into subdomains of the seed host.
respect_robotsbooltrueObey robots.txt.
store_htmlbooltrueIf false, we collect URL/status/bytes/metadata only — useful for cheap site maps.
seed_kindstring?"url"Either "url" (link-following) or "sitemap" (walks sitemap.xml + sitemap-index recursively). In sitemap mode you may pass either the origin (https://example.com, defaults to /sitemap.xml) or the explicit sitemap URL (https://example.com/sitemap-index.xml) — both work. Much faster than link-following on large sites.
headersobject?String→string map attached to every request in the crawl (e.g. auth, custom referer).
cookiesstring?Set-Cookie-style string: k=v; k=v.

Headers

HeaderPurpose
AuthorizationBearer crk_... — required.
Content-Typeapplication/json — required.
Idempotency-KeyOptional but recommended. See Idempotency.

Response — 202 Accepted

{
  "id":     43,
  "status": "queued",
  "url":    "https://example.com"
}

GET /v1/crawls/{id} — run status

GET /v1/crawls/{id}

Poll every 2–5 seconds while status is queued or running. Or skip polling entirely and use a webhook.

{
  "id":            43,
  "url":           "https://aeoniti.com",
  "status":        "running",
  "page_count":    27,
  "error_count":   0,
  "enqueued_at":   "2026-05-08T16:48:36.151Z",
  "started_at":    "2026-05-08T16:48:36.401Z",
  "finished_at":   null,
  "error_message": null,
  "proxy_url":     null
}

statusqueued | running | done | failed | cancelled.

Note: page_count includes pages that returned non-200 statuses (those count as "fetched"). error_count only counts crawler-side failures: DNS, TLS, connect timeout.

DELETE /v1/crawls/{id} — cancel and cascade

DELETE /v1/crawls/{id}

Cancels if running, then cascades — all pages rows for the run are removed. Returns 204 No Content. Subsequent GETs return 404.

GET /v1/crawls/{id}/pages — list pages

GET /v1/crawls/{id}/pages?limit=100&offset=0&status=200
QueryDefaultNotes
limit100Max 1000.
offset0Standard offset paging.
statusFilter by HTTP status (e.g. only 200s).
{
  "items": [
    { "id": 195, "url": "https://aeoniti.com/pricing",
      "status": 200, "bytes": 13650,
      "fetched_at": "2026-05-08T16:48:40.595Z" }
  ],
  "next_offset": 14
}

Save the global id per row — that's the path parameter for fetching content.

GET /v1/pages/{id} — fetch one page

GET /v1/pages/{id}?format=markdown

id is the global page id from the list endpoint, not a 1-based index within a run.

formatReturns
htmlRaw HTML (decompressed) + metadata. Default.
markdownClean markdown via fast_html2md, link-preserving + metadata.
articlellm_readability boilerplate-stripped article_text + article_html + metadata.
bothhtml + markdown + article_text + article_html + metadata.

Always returns id, url, status, bytes, fetched_at. metadata shape:

{
  "title":       "Pricing — AEONiti",
  "description": "...",
  "canonical":   "https://aeoniti.com/pricing",
  "og":      { "title": "...", "image": "...", "type": "website" },
  "twitter": { "card": "summary_large_image" },
  "json_ld": [ { "@type": "Product" } ]
}

GET /v1/health · /v1/ready

No auth. Use /v1/health for liveness (process up); /v1/ready verifies Postgres reachable + worker heartbeat current.

curl https://66.163.122.173/v1/health # → 200 "ok"
curl https://66.163.122.173/v1/ready  # → 200 "ready" or 503 + JSON

Webhooks

If you set webhook_url on a crawl, the worker POSTs once when the run reaches a terminal state (done or failed).

Headers

Content-Type: application/json
X-Crawler-Run-Id: 43

Body

{
  "id":            43,
  "project_id":    2,
  "url":           "https://aeoniti.com",
  "status":        "done",
  "page_count":    14,
  "error_count":   0,
  "started_at":    "...",
  "finished_at":   "...",
  "error_message": null,
  "delivered_at":  "..."
}

Retry policy

  • 5 attempts total.
  • Exponential backoff: 1s, 2s, 4s, 8s.
  • 2xx → status delivered.
  • 4xx → marked dead immediately. Your endpoint rejected it; we don't retry.
  • 5xx / network / timeout → retry. After attempt 5 → dead.

Delivery state persists on the run. You can poll GET /v1/crawls/{id} and read webhook_status, webhook_attempts, webhook_last_error.

No HMAC signing yet — until that ships, treat the webhook body as advisory: re-fetch the run via GET before acting on it.

Idempotency

Pass Idempotency-Key: <your-uuid> on POST to make retries safe. The same key returns the original run (with the same id) instead of creating a duplicate. Keys are scoped per project and persist for 24 hours.

Use this when the network call to start the crawl might retry (queue handlers, lambda invocations, etc.).

Common patterns

Single-page extract (URL → markdown)

{ "url": "https://example.com/post", "max_pages": 1, "depth": 1 }

Then ?format=both on the one resulting page.

Crawl whole site, get clean text only

{
  "url": "https://customer.com",
  "max_pages": 500, "depth": 8,
  "store_html": false
}

Then ?format=article per page → embeddings input.

Fast site map via sitemap.xml

// either form works — origin defaults to /sitemap.xml,
// or pass the explicit URL (handy for /sitemap-index.xml etc.)
{
  "url": "https://customer.com/sitemap.xml",
  "seed_kind": "sitemap",
  "max_pages": 5000,
  "store_html": false
}

Authenticated crawl

{
  "url": "https://app.customer.com",
  "headers": { "Authorization": "Bearer abc123" },
  "cookies": "session=XYZ; csrf=ABC",
  "respect_robots": false
}

Async with webhook (recommended for production)

{
  "url": "https://customer.com",
  "max_pages": 200,
  "concurrency": 4,
  "webhook_url": "https://you.example.com/crawl-done"
}

Known limits

LimitWorkaround
No JS rendering — SPAs that need client-side render won't yield content beyond the shellUse seed_kind: "sitemap" if the site publishes one. Headless-chrome support is on the roadmap.
Self-signed TLS — agents must skip cert verifyWill swap to Let's Encrypt once a hostname points at 66.163.122.173.
Webhooks are unsignedRe-verify by calling GET /v1/crawls/{id} from your handler.
Single LXC, single Postgres — no HADaily pg_dump backups, weekly restore-test drill. R2 off-box backup landing soon.
store_html=false means pages can't be re-fetched as markdown laterDecide upfront per crawl.
404/403/429 from target are pages, not errorsFilter the page list by status if you only want successes.

Roadmap

Things that aren't built yet, in rough priority order:

  • POST /v1/map — domain URL discovery via sitemap + robots
  • HMAC-signed webhooks
  • SSE streaming for live run logs
  • Off-box backup push (Cloudflare R2)
  • Hostname + Let's Encrypt cert
  • JS rendering via headless chrome (feature flag)
  • Schema-driven structured extraction (the /extract endpoint)
  • Self-host bundle (Docker compose + helm chart)

Last updated 2026-05-08. API version v0.4.0.