v0.4 · production · single LXC, public IP egress

Crawl the web,
/* in one POST */

crawlcrawl is a small HTTP crawler API for the agents you ship. Hand it a URL or sitemap, get back clean markdown, boilerplate-stripped article text, and structured metadata. No browser farm. No proxy reseller markup.

$ curl -X POST https://66.163.122.173/v1/crawls · Rust + spider-rs + Postgres
Live shape

Paste a URL. Get pages.

measured on aeoniti.com: 14 pages / 4.2s
POST /v1/crawls idle
POST 66.163.122.173/v1/crawls
Params
  # Press "Run →" to see the API shape.
Honest demo — uses canned responses with the exact shape the real API returns. Open the full API → endpoint live at 66.163.122.173
The whole API

Six endpoints. That's it.

No SDK to install, no client library to keep up to date. Just JSON over HTTPS. Add a bearer token; you're done.

POST /v1/crawls

Start a crawl

URL or sitemap seed. Configure max_pages, depth, concurrency, headers, cookies. Worker leases the job from a Postgres queue and goes.

returns 202 in <50ms
GET /v1/crawls/{id}

Run status

Poll for queued → running → done|failed|cancelled, with page_count and error_count updated live.

or use a webhook
GET /v1/crawls/{id}/pages

List pages

Paginated index of every page the run touched. Filter by status code. Each row gives you a global page id.

limit + offset paging
GET /v1/pages/{id}

Fetch one page

format=html, markdown, article (boilerplate-stripped), or both. Always includes title, OG, Twitter, JSON-LD, canonical.

~5.7× zstd compression in storage
DELETE /v1/crawls/{id}

Cancel + cascade

Stops a running crawl and cascades the page rows. Cleans up after testing without leaving orphans.

204 on success
GET /v1/health · /v1/ready

Health

/health is process-only (no DB), /ready checks Postgres + worker heartbeat. Good for k8s and uptime probes.

no auth required
Drop in anywhere

No SDK. Just HTTP.

REST in, JSON out, bearer auth. If you can write a fetch call, you have a client. We won't ship a JS lib that breaks every release.

typescript python rust curl
typescript python curl
// no install — Node 18+ has fetch
// self-signed cert until LE; remove this line once we move to a hostname
process.env.NODE_TLS_REJECT_UNAUTHORIZED = "0";

const r = await fetch("https://66.163.122.173/v1/crawls", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.CRAWLCRAWL_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    url: "https://example.com",
    max_pages: 50,
    concurrency: 4,
  }),
});
const { id } = await r.json(); // → { id, status: "queued", url }
Honesty section

What this is — and what it isn't.

Today
On the roadmap
✓ HTTP fetching at scale
JS rendering (headless chrome)
✓ Clean markdown via fast_html2md
Schema-driven AI extraction
✓ Article extraction via llm_readability
/v1/map — domain URL discovery
✓ Sitemap-driven crawl mode
SSE log streaming for live runs
✓ Webhook delivery + 5× retry
HMAC signing on webhooks
✓ Idempotency keys
Hash-pinned reproducible snapshots
✓ Custom headers + cookies (auth crawls)
Self-host bundle (Docker / k8s)
✓ Per-project quotas + audit log
Multi-region egress IPs

No SOC 2. No "trusted by 12,000 teams". No 800 req/s burst. We have one LXC, a public IP, and a worker that crawls about 3 pages/sec sustained. That's enough for the workload it serves today.

What people use it for

Built for agents that read the web.

01 / RAG

Knowledge bases

Crawl docs sites and changelogs into chunked markdown. Article mode strips nav/footer/sidebar so embeddings index meaning, not chrome.

02 / Agents

Web-browsing tools

Give your agent a web_read tool that returns markdown plus extracted metadata, not 4MB of script tags. Token bills stay sane.

03 / Monitoring

Site change detection

Crawl on a schedule, diff content hashes, fire a webhook when it changes. No cron loops in your code.

04 / Onboarding

Prospect research

Hand a domain to an agent, let it map the site (sitemap mode), then summarize the company in one pass. ~4s for small sites.

05 / SEO / AEO

Audit feeds

Pull every page's title, canonical, OG, JSON-LD in one pass. Feed it into your audit pipeline. Format=html if you need raw markup back.

06 / Compliance

robots-aware by default

respect_robots=true is the default. Malicious-domain blocklist refuses seeds proactively. Override only when you control the target.

One POST.
Pages out the other end.

No dashboard. No npm install. No "contact sales". Just an HTTP endpoint that does its job.

Read the API → Get a key