v0.4 · production · single LXC, public IP egress

Crawl the web,
/* in one POST */

crawlcrawl is a small HTTP crawler API for the agents you ship. Hand it a URL or sitemap, get back clean markdown, boilerplate-stripped article text, and structured metadata. No browser farm. No proxy reseller markup.

See a real run → Read the API

$ curl -X POST https://66.163.122.173/v1/crawls · Rust + spider-rs + Postgres

Live shape

Paste a URL. Get pages.

measured on aeoniti.com: 14 pages / 4.2s

POST /v1/crawls idle

POST 66.163.122.173/v1/crawls

Params

markdown article respect_robots sitemap mode

  # Press "Run →" to see the API shape.

Honest demo — uses canned responses with the exact shape the real API returns. Open the full API → endpoint live at 66.163.122.173

The whole API

Six endpoints. That's it.

No SDK to install, no client library to keep up to date. Just JSON over HTTPS. Add a bearer token; you're done.

POST /v1/crawls

Start a crawl

URL or sitemap seed. Configure max_pages, depth, concurrency, headers, cookies. Worker leases the job from a Postgres queue and goes.

returns 202 in <50ms

GET /v1/crawls/{id}

Run status

Poll for queued → running → done|failed|cancelled, with page_count and error_count updated live.

or use a webhook

GET /v1/crawls/{id}/pages

List pages

Paginated index of every page the run touched. Filter by status code. Each row gives you a global page id.

limit + offset paging

GET /v1/pages/{id}

Fetch one page

format=html, markdown, article (boilerplate-stripped), or both. Always includes title, OG, Twitter, JSON-LD, canonical.

~5.7× zstd compression in storage

DELETE /v1/crawls/{id}

Cancel + cascade

Stops a running crawl and cascades the page rows. Cleans up after testing without leaving orphans.

204 on success

GET /v1/health · /v1/ready

Health

/health is process-only (no DB), /ready checks Postgres + worker heartbeat. Good for k8s and uptime probes.

no auth required

Drop in anywhere

No SDK. Just HTTP.

REST in, JSON out, bearer auth. If you can write a fetch call, you have a client. We won't ship a JS lib that breaks every release.

typescript python rust curl

typescript python curl

// no install — Node 18+ has fetch
// self-signed cert until LE; remove this line once we move to a hostname
process.env.NODE_TLS_REJECT_UNAUTHORIZED = "0";

const r = await fetch("https://66.163.122.173/v1/crawls", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.CRAWLCRAWL_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    url: "https://example.com",
    max_pages: 50,
    concurrency: 4,
  }),
});
const { id } = await r.json(); // → { id, status: "queued", url }

# pip install requests
import requests, os

r = requests.post(
    "https://66.163.122.173/v1/crawls",
    headers={"Authorization": f"Bearer {os.environ['CRAWLCRAWL_KEY']}"},
    json={
        "url": "https://example.com",
        "max_pages": 50,
        "webhook_url": "https://you.example.com/done",
    },
    verify=False,  # self-signed cert until LE
)
run = r.json()
print(run["id"])

curl -k -X POST https://66.163.122.173/v1/crawls \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "max_pages": 50,
    "concurrency": 4,
    "webhook_url": "https://you.example.com/done"
  }'

# → 202 Accepted
# {"id":43,"status":"queued","url":"https://example.com"}

Honesty section

What this is — and what it isn't.

Today

On the roadmap

✓ HTTP fetching at scale

JS rendering (headless chrome)

✓ Clean markdown via fast_html2md

Schema-driven AI extraction

✓ Article extraction via llm_readability

/v1/map — domain URL discovery

✓ Sitemap-driven crawl mode

SSE log streaming for live runs

✓ Webhook delivery + 5× retry

HMAC signing on webhooks

✓ Idempotency keys

Hash-pinned reproducible snapshots

✓ Custom headers + cookies (auth crawls)

Self-host bundle (Docker / k8s)

✓ Per-project quotas + audit log

Multi-region egress IPs

No SOC 2. No "trusted by 12,000 teams". No 800 req/s burst. We have one LXC, a public IP, and a worker that crawls about 3 pages/sec sustained. That's enough for the workload it serves today.

What people use it for

Built for agents that read the web.

01 / RAG

Knowledge bases

Crawl docs sites and changelogs into chunked markdown. Article mode strips nav/footer/sidebar so embeddings index meaning, not chrome.

02 / Agents

Web-browsing tools

Give your agent a web_read tool that returns markdown plus extracted metadata, not 4MB of script tags. Token bills stay sane.

03 / Monitoring

Site change detection

Crawl on a schedule, diff content hashes, fire a webhook when it changes. No cron loops in your code.

04 / Onboarding

Prospect research

Hand a domain to an agent, let it map the site (sitemap mode), then summarize the company in one pass. ~4s for small sites.

05 / SEO / AEO

Audit feeds

Pull every page's title, canonical, OG, JSON-LD in one pass. Feed it into your audit pipeline. Format=html if you need raw markup back.

06 / Compliance

robots-aware by default

respect_robots=true is the default. Malicious-domain blocklist refuses seeds proactively. Override only when you control the target.

Crawl the web, /* in one POST */