crawlcrawl is a small HTTP crawler API for the agents you ship. Hand it a URL or sitemap, get back clean markdown, boilerplate-stripped article text, and structured metadata. No browser farm. No proxy reseller markup.
# Press "Run →" to see the API shape.
No SDK to install, no client library to keep up to date. Just JSON over HTTPS. Add a bearer token; you're done.
URL or sitemap seed. Configure max_pages, depth, concurrency, headers, cookies. Worker leases the job from a Postgres queue and goes.
Poll for queued → running → done|failed|cancelled, with page_count and error_count updated live.
Paginated index of every page the run touched. Filter by status code. Each row gives you a global page id.
format=html, markdown, article (boilerplate-stripped), or both. Always includes title, OG, Twitter, JSON-LD, canonical.
Stops a running crawl and cascades the page rows. Cleans up after testing without leaving orphans.
/health is process-only (no DB), /ready checks Postgres + worker heartbeat. Good for k8s and uptime probes.
REST in, JSON out, bearer auth. If you can write a fetch call, you have a client. We won't ship a JS lib that breaks every release.
// no install — Node 18+ has fetch // self-signed cert until LE; remove this line once we move to a hostname process.env.NODE_TLS_REJECT_UNAUTHORIZED = "0"; const r = await fetch("https://66.163.122.173/v1/crawls", { method: "POST", headers: { "Authorization": `Bearer ${process.env.CRAWLCRAWL_KEY}`, "Content-Type": "application/json", }, body: JSON.stringify({ url: "https://example.com", max_pages: 50, concurrency: 4, }), }); const { id } = await r.json(); // → { id, status: "queued", url }
No SOC 2. No "trusted by 12,000 teams". No 800 req/s burst. We have one LXC, a public IP, and a worker that crawls about 3 pages/sec sustained. That's enough for the workload it serves today.
Crawl docs sites and changelogs into chunked markdown. Article mode strips nav/footer/sidebar so embeddings index meaning, not chrome.
Give your agent a web_read tool that returns markdown plus extracted metadata, not 4MB of script tags. Token bills stay sane.
Crawl on a schedule, diff content hashes, fire a webhook when it changes. No cron loops in your code.
Hand a domain to an agent, let it map the site (sitemap mode), then summarize the company in one pass. ~4s for small sites.
Pull every page's title, canonical, OG, JSON-LD in one pass. Feed it into your audit pipeline. Format=html if you need raw markup back.
respect_robots=true is the default. Malicious-domain blocklist refuses seeds proactively. Override only when you control the target.
No dashboard. No npm install. No "contact sales". Just an HTTP endpoint that does its job.