API Reference · v1

Introduction

Building a client SDK or importing into Postman / Stoplight? Use the machine-readable spec instead of this narrative page: live spec view · GET /openapi.json.

The crawlcrawl API is REST. JSON in, JSON out. Bearer-token auth. Endpoints are versioned (/v1) and we won't break the wire format inside a major version.

Base URL:

https://api.crawlcrawl.com

Public hostname with valid TLS — standard cert verification works, no client tweaks required.

Quickstart

The shortest possible exchange: start a crawl, poll until done, fetch a page.

# export your key (issued per project — ask [email protected])
export CRAWLCRAWL_KEY="crk_..."

# 1. start a crawl
curl -X POST https://api.crawlcrawl.com/v1/crawls \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://example.com", "max_pages": 50 }'
# → 202 {"id":43,"status":"queued","url":"https://example.com"}

# 2. poll the run
curl https://api.crawlcrawl.com/v1/crawls/43 \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

# 3. when status=done, list pages and fetch one as markdown
curl "https://api.crawlcrawl.com/v1/crawls/43/pages" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

curl "https://api.crawlcrawl.com/v1/pages/195?format=markdown" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

Authentication

Every request needs Authorization: Bearer crk_.... Keys are scoped per project; we store only the SHA-256 hash and show plaintext exactly once at mint time. Never put a key in browser code.

To mint or rotate a key: email [email protected] with your project name. Revocation is instant — keys are server-side only.

TLS

HTTPS only on api.crawlcrawl.com. No special client config required; standard cert verification works.

Limits

Limits are per-project and configurable per key. The defaults below match what new keys ship with.

LimitDefaultNotes
Concurrent runs5Submitting a 6th while 5 are queued/running returns 429.
max_pages per run100,000Hard cap; the request is rejected at validation.
depth1..50Server-side validation — outside this range returns 400.
concurrency per crawlunset (auto)Server-side adaptive concurrency picks the right level for the workload. Bumping above ~16 has diminishing returns on most sites.

Errors

HTTP status codes are honest. Bodies are JSON in this envelope:

{
  "error": {
    "code":    "invalid_input",
    "message": "depth must be 1..=50"
  }
}
CodeHTTPMeaning
invalid_input400Body failed validation (bad URL, depth out of range, etc.).
unauthorized401Missing or wrong bearer token.
not_found404Resource doesn't exist, or belongs to a different project.
too_many_requests429You hit your concurrent-runs cap. Wait for one to finish.
internal500Something on our side. Will be in our logs; ping us.

POST /v1/crawls — start a crawl

POST /v1/crawls

Enqueues a crawl. Returns 202 immediately; the run starts within seconds.

curl -X POST https://api.crawlcrawl.com/v1/crawls \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://example.com", "max_pages": 50, "depth": 2 }'
# → 202 { id, status: "queued", url, enqueued_at }

Body — top-level fields

FieldTypeDescription
urlstringRequired. Must start with http:// or https://. Host is checked against the malicious-domain blocklist before queueing.
user_agentstringOptional. If omitted (or starts with InternalCrawler/), we rotate to a fresh real-browser UA via ua_generator.
proxy_urlstring?Optional. Per-request override of project default. Supports http://, https://, socks5://.
proxy_poolstring?Optional. Route the whole crawl through a managed proxy pool without bringing your own credentials. One of "datacenter" | "isp" | "residential" | "mobile". The server resolves the upstream credential and proxy URL automatically. Combine with proxy_country for geo-routing. Mutually exclusive with manual proxy_url; if both are set, proxy_url wins.
proxy_countrystring?Optional, only with proxy_pool. ISO-2 country code (e.g. "us", "de", "in") — fetches as if from that country. 199+ countries supported. Critical for geo-localized SEO audits.
webhook_urlstring?Optional. POSTed once when the run reaches done or failed. See Webhooks.

Body — crawl params (FLAT, not nested)

These fields go at the top level of the body — not under a params key. The server uses serde flatten; nested params are silently ignored.

FieldTypeDefaultDescription
max_pagesint1000Hard cap; crawl stops when reached. Range 1..=100_000.
concurrencyint?autoParallel in-flight requests.
depthint5Max link-following depth from seed. Range 1..=50.
delay_msint250Delay between requests per host.
subdomainsboolfalseIf true, follows links into subdomains of the seed host.
respect_robotsbooltrueObey robots.txt.
store_htmlbooltrueIf false, we collect URL/status/bytes/metadata only — useful for cheap site maps.
seed_kindstring?"url"Either "url" (link-following) or "sitemap" (walks sitemap.xml + sitemap-index recursively). In sitemap mode you may pass either the origin (https://example.com, defaults to /sitemap.xml) or the explicit sitemap URL (https://example.com/sitemap-index.xml) — both work. Much faster than link-following on large sites.
headersobject?String→string map attached to every request in the crawl (e.g. auth, custom referer).
cookiesstring?Set-Cookie-style string: k=v; k=v.

Headers

HeaderPurpose
AuthorizationBearer crk_... — required.
Content-Typeapplication/json — required.
Idempotency-KeyOptional but recommended. See Idempotency.

Response — 202 Accepted

{
  "id":     43,
  "status": "queued",
  "url":    "https://example.com"
}

GET /v1/crawls/{id} — run status

GET /v1/crawls/{id}

Poll every 2–5 seconds while status is queued or running. Or skip polling entirely and use a webhook.

curl https://api.crawlcrawl.com/v1/crawls/43 \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# →
{
  "id":            43,
  "url":           "https://aeoniti.com",
  "status":        "running",
  "page_count":    27,
  "error_count":   0,
  "enqueued_at":   "2026-05-08T16:48:36.151Z",
  "started_at":    "2026-05-08T16:48:36.401Z",
  "finished_at":   null,
  "error_message": null,
  "proxy_url":     null
}

statusqueued | running | done | failed | cancelled.

Note: page_count includes pages that returned non-200 statuses (those count as "fetched"). error_count only counts crawler-side failures: DNS, TLS, connect timeout.

DELETE /v1/crawls/{id} — cancel and cascade

DELETE /v1/crawls/{id}

Cancels if running, then cascades — all pages rows for the run are removed. Returns 204 No Content. Subsequent GETs return 404.

curl -X DELETE https://api.crawlcrawl.com/v1/crawls/43 \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 204 No Content

GET /v1/crawls/{id}/pages — list pages

GET /v1/crawls/{id}/pages?limit=100&offset=0&status=200
curl "https://api.crawlcrawl.com/v1/crawls/43/pages?limit=100&status=200" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
QueryDefaultNotes
limit100Max 1000.
offset0Standard offset paging.
statusFilter by HTTP status (e.g. only 200s).
{
  "items": [
    { "id": 195, "url": "https://aeoniti.com/pricing",
      "status": 200, "bytes": 13650,
      "fetched_at": "2026-05-08T16:48:40.595Z" }
  ],
  "next_offset": 14
}

Save the global id per row — that's the path parameter for fetching content.

GET /v1/pages/{id} — fetch one page

GET /v1/pages/{id}?format=markdown

id is the global page id from the list endpoint, not a 1-based index within a run.

curl "https://api.crawlcrawl.com/v1/pages/195?format=markdown" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
formatReturns
htmlRaw HTML (decompressed) + metadata. Default.
markdownClean, link-preserving markdown with structured metadata.
articleBoilerplate-stripped article_text + article_html + metadata.
bothhtml + markdown + article_text + article_html + metadata.

Always returns id, url, status, bytes, fetched_at. metadata shape:

{
  "title":       "Pricing — AEONiti",
  "description": "...",
  "canonical":   "https://aeoniti.com/pricing",
  "og":      { "title": "...", "image": "...", "type": "website" },
  "twitter": { "card": "summary_large_image" },
  "json_ld": [ { "@type": "Product" } ]
}

GET /v1/health · /v1/ready

No auth. Use /v1/health for liveness; /v1/ready verifies the full request path is operational end-to-end.

curl https://api.crawlcrawl.com/v1/health # → 200 "ok"
curl https://api.crawlcrawl.com/v1/ready  # → 200 "ready" or 503 + JSON

Webhooks

If you set webhook_url on a crawl, the worker POSTs once when the run reaches a terminal state (done or failed).

Headers

Content-Type: application/json
X-Crawler-Run-Id: 43

Body

{
  "id":            43,
  "project_id":    2,
  "url":           "https://aeoniti.com",
  "status":        "done",
  "page_count":    14,
  "error_count":   0,
  "started_at":    "...",
  "finished_at":   "...",
  "error_message": null,
  "delivered_at":  "..."
}

Retry policy

  • 5 attempts total.
  • Exponential backoff: 1s, 2s, 4s, 8s.
  • 2xx → status delivered.
  • 4xx → marked dead immediately. Your endpoint rejected it; we don't retry.
  • 5xx / network / timeout → retry. After attempt 5 → dead.

Delivery state persists on the run. You can poll GET /v1/crawls/{id} and read webhook_status, webhook_attempts, webhook_last_error.

No HMAC signing yet — until that ships, treat the webhook body as advisory: re-fetch the run via GET before acting on it.

Idempotency

Pass Idempotency-Key: <your-uuid> on POST to make retries safe. The same key returns the original run (with the same id) instead of creating a duplicate. Keys are scoped per project and persist for 24 hours.

Use this when the network call to start the crawl might retry (queue handlers, lambda invocations, etc.).

Common patterns

Single-page extract (URL → markdown)

{ "url": "https://example.com/post", "max_pages": 1, "depth": 1 }

Then ?format=both on the one resulting page.

Crawl whole site, get clean text only

{
  "url": "https://customer.com",
  "max_pages": 500, "depth": 8,
  "store_html": false
}

Then ?format=article per page → embeddings input.

Fast site map via sitemap.xml

// either form works — origin defaults to /sitemap.xml,
// or pass the explicit URL (handy for /sitemap-index.xml etc.)
{
  "url": "https://customer.com/sitemap.xml",
  "seed_kind": "sitemap",
  "max_pages": 5000,
  "store_html": false
}

Authenticated crawl

{
  "url": "https://app.customer.com",
  "headers": { "Authorization": "Bearer abc123" },
  "cookies": "session=XYZ; csrf=ABC",
  "respect_robots": false
}

Async with webhook (recommended for production)

{
  "url": "https://customer.com",
  "max_pages": 200,
  "concurrency": 4,
  "webhook_url": "https://you.example.com/crawl-done"
}

Behaviour notes

BehaviourHow to handle it
For SPAs that need client-side rendering on the multi-page pathUse POST /v1/cloud/crawl (every page runs through JS rendering + anti-bot). On the single-URL /v1/scan path, set cloud_mode: "auto" or "browser" for per-request escalation.
Sitemap-first crawls are the fastest path for sites that publish oneUse seed_kind: "sitemap" for sites with a published sitemap.
Pass store_html=true if you want to re-render pages laterDecide upfront per crawl; pages without stored HTML cannot be re-fetched as markdown.
4xx/5xx pages from the target are returned as pages, not errorsFilter the page list by status if you only want successes.

POST /v1/scan — synchronous single-URL scan

POST /v1/scan

Sub-second URL → markdown + signals + AI-bot policy + llms.txt. No queue, no polling.

Body

curl -X POST https://api.crawlcrawl.com/v1/scan \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://example.com", "only_main_content": true }'
# → 200 { url, status, markdown, metadata, signals, llms_txt, ai_bot_policy }
FieldTypeDefaultDescription
urlstringRequired.
user_agentstring?rotated UACustom UA.
max_age_secondsint?If a stored page row for this URL exists within this many seconds, return it (cache hit).
cloud_modestring?project defaultnone | auto | unblocker | browser.
metadata_onlyboolfalseSkip markdown/article extraction; return only metadata + signals subset. ~5× faster.
only_main_contentboolfalseStrip boilerplate (nav, footer, sidebars) before converting to markdown.
include_linksboolfalseReturn full anchor list [{to, anchor, rel, nofollow}].
screenshot_inlineboolfalseInclude base64 PNG (counts as +1 cloud op). Surfaces screenshot_error on failure.

POST /v1/scan/bulk — parallel multi-URL scan

POST/v1/scan/bulk

Fan out 1–100 URLs in a single call. Each URL is scanned in parallel server-side; each item in the response carries the same payload as /v1/scan (markdown, article, article_html, metadata, signals, llms.txt, AI-bot policy). Use this when you already know the URLs you want and need them fetched fast.

curl -X POST https://api.crawlcrawl.com/v1/scan/bulk \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://example.com",
      "https://example.com/about",
      "https://example.com/pricing"
    ],
    "concurrency": 25
  }'

Body

FieldTypeDefaultDescription
urlsstring[]Required. 1–100 absolute URLs. Each one is SSRF-checked; private/loopback/metadata hosts are rejected before any URL fires.
concurrencyint?64Server-side parallelism. Min 1, max 64. Default is the full cap — lower it only if the target site rate-limits aggressively against your IP.
metadata_onlyboolfalseSkip markdown / article / article_html extraction. Each item still returns metadata, signals, content_hash, status, bytes. ~3× faster wall time, ~48× smaller response — use this when your pipeline only needs AEO/SEO signals.
max_age_secondsint?Per-URL cache window. If a stored page row exists for a URL and is younger than this, that row is returned (no refetch). Applied to every URL in the batch.
user_agentstring?rotated UAUA override applied uniformly to every URL in the batch.

Response

# 200 — one item per submitted URL, in submission order
{
  "items": [
    {
      "url": "https://example.com",
      "ok": true,
      "result": {
        "url":           "https://example.com",
        "final_url":     "https://example.com/",
        "status":        200,
        "bytes":         5821,
        "content_hash":  "a1b2c3...",
        "markdown":      "# Example...",
        "article":       "Example...",         // readability-extracted plain text
        "article_html":  "<article>...",      // readability-extracted clean HTML
        "metadata":      { "title": "...", "description": "...", "og": {...} },
        "signals":       { "headings": [...], "jsonld": [...], ... },
        "ai_bot_policy": {...},
        "llms_txt":      "...",
        "llms_full_txt": "..."
      }
    },
    {
      "url": "https://example.com/blocked",
      "ok": false,
      "error": "fetch failed: status 403"
    }
  ]
}

Discovery + bulk-fetch pattern

If you have a seed URL and want all article content in parallel (instead of polling one page at a time), combine /v1/crawls for discovery with /v1/scan/bulk for parallel content fetch:

# 1) Discover the URLs. store_html=false makes this cheap; we only want the link graph.
RUN=$(curl -s -X POST https://api.crawlcrawl.com/v1/crawls \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -d '{ "url": "https://example.com", "max_pages": 25, "store_html": false }' \
  | jq -r .id)

# 2) Poll until done (typically 5-30s depending on site size).
while [ "$(curl -s https://api.crawlcrawl.com/v1/crawls/$RUN \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" | jq -r .status)" != "done" ]; do
  sleep 2
done

# 3) Pull the discovered URLs.
URLS=$(curl -s "https://api.crawlcrawl.com/v1/crawls/$RUN/pages?limit=25" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" | jq -c '[.items[].url]')

# 4) Fetch all 25 in ONE bulk call instead of 25 serial GETs.
curl -X POST https://api.crawlcrawl.com/v1/scan/bulk \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d "{ \"urls\": $URLS, \"concurrency\": 25 }"
# → 200 { items: [...] } — every result already includes article + markdown + metadata.

This avoids issuing one GET /v1/pages/{id}?format=article per page in series — the bulk endpoint's result items already carry article and article_html alongside the markdown.

Notes & limits

  • Hard cap: 100 URLs per request. Submitting more returns 400.
  • Concurrency cap: 64. Requests above the cap are clamped to the cap (no error).
  • Every URL counts as 1 page against your daily/monthly quota, regardless of cache state.
  • Item order matches submission order, so you can correlate by index if needed.
  • Failures are partial — the batch never aborts on one bad URL. Always check the ok field per item.
  • Error strings are sanitised to never leak internal state. Common ones: "fetch failed: ...", "urls[n] failed SSRF check ...", "daily page cap exceeded".
GET/v1/crawls/{id}/links

Lazy-extracts anchors from stored HTML. Supports ?from_page_id=N to scope to one page, plus ?limit&offset.

curl "https://api.crawlcrawl.com/v1/crawls/43/links?limit=500" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { items: [{ from_page_id, to_url, anchor, rel, nofollow }], next_offset }

GET /v1/crawls/{id}/orphans — pages with no inbound link

GET/v1/crawls/{id}/orphans

Returns crawled pages that no other crawled page links to. Excludes the seed.

curl https://api.crawlcrawl.com/v1/crawls/43/orphans \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { items: [{ id, url, status, fetched_at }] }

GET /v1/crawls/{old}/diff/{new} — crawl diff

GET/v1/crawls/{old_id}/diff/{new_id}

Compares two runs' pages by content_hash. Returns added, removed, changed arrays plus an unchanged count. Both runs must belong to your project.

curl https://api.crawlcrawl.com/v1/crawls/9123/diff/9128 \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# →
{
  "old_run_id": 9123, "new_run_id": 9128,
  "summary": { "added": 2, "removed": 0, "changed": 3, "unchanged": 42 },
  "changed": [
    { "url": "https://x.com/pricing",
      "old_hash": "abc...", "new_hash": "def..." }
  ]
}

POST /v1/cloud/scrape

POST/v1/cloud/scrape
curl -X POST https://api.crawlcrawl.com/v1/cloud/scrape \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://protected.example", "return_format": "markdown", "chrome": true }'
# → 200 { content, cost_usd, elapsed_ms }

# Body shape:
{ "url": "https://protected.example",
  "return_format": "markdown",  // markdown | raw | text | commonmark
  "chrome": true,             // optional browser-rendered fetch
  "force_backend": "auto"     // auto (default) | local | vendor | proxy
}

Direct anti-bot scrape. Returns content, cost_usd, elapsed_ms.

Cheap path: set "force_backend": "proxy" to route through the bandwidth-billed HTTP proxy pool instead of full Chrome render. Substantially cheaper per page for plain HTML; no JS execution. Additional params:

{ "url": "https://target.example/page",
  "force_backend": "proxy",
  "proxy_pool": "isp",        // datacenter | isp (default) | residential | mobile
  "proxy_country": "de"      // optional ISO-2 country for geo-routing
}
# → 200 { content, status, backend: "proxy", pool, bytes_transferred, cost_usd, elapsed_ms }

See /v1/cloud/proxy-fetch for the standalone endpoint with the same proxy mechanism and full pool/pricing table.

POST /v1/cloud/crawl

POST/v1/cloud/crawl

Multi-page anti-bot crawl. Returns a list of {url, content} entries.

curl -X POST https://api.crawlcrawl.com/v1/cloud/crawl \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://target.example", "limit": 50, "return_format": "markdown", "chrome": false }'
# → 200 { items: [{ url, content, status }], cost_usd, elapsed_ms }
POST/v1/cloud/search

SERP-style search. Returns URLs with full page content when return_format is set.

curl -X POST https://api.crawlcrawl.com/v1/cloud/search \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "query": "vector database open source 2026", "limit": 10, "return_format": "markdown" }'
# → 200 { results: [{ url, title, content }], cost_usd, elapsed_ms }
POST/v1/cloud/links

Fast anti-bot link discovery. Returns up to limit outbound links from the URL.

curl -X POST https://api.crawlcrawl.com/v1/cloud/links \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://target.example", "limit": 200 }'
# → 200 { links: ["https://..."], cost_usd, elapsed_ms }

POST /v1/cloud/unblock

POST/v1/cloud/unblock

Bypass anti-bot protections (Cloudflare, Akamai, PerimeterX, captchas). Returns the unblocked response body. Use when a target site fights back and a plain scrape gets a challenge page.

curl -X POST https://api.crawlcrawl.com/v1/cloud/unblock \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://protected.example/data", "return_format": "markdown", "country": "us" }'
# → 200 { content, cost_usd, elapsed_ms }

# Body shape:
{ "url": "https://protected.example/data",
  "return_format": "markdown",  // markdown | raw | text
  "country": "us"             // optional, 190+ countries supported
}

POST /v1/cloud/render

POST/v1/cloud/render

Run a URL in a real Chrome browser and return the resolved page. Use for single-page apps, hydration-heavy sites, and post-JavaScript HTML. Expect 30–60s elapsed; this is a heavy call.

curl -X POST https://api.crawlcrawl.com/v1/cloud/render \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://spa.example.com", "wait_for": "networkidle", "return_format": "raw" }'
# → 200 { content, cost_usd, elapsed_ms }

# Body shape:
{ "url": "https://spa.example.com",
  "wait_for": "networkidle",    // load | domcontentloaded | networkidle
  "return_format": "raw"         // raw HTML, markdown, or text
}

Render replaces the deprecated /v1/cloud/screenshot path. Set return_format: "png" if you specifically need image bytes.

POST /v1/cloud/transform

POST/v1/cloud/transform

HTML or PDF bytes in, clean markdown out. No URL fetch — you supply the raw content. Use when you already have the document and just want the formatting cleanup.

curl -X POST https://api.crawlcrawl.com/v1/cloud/transform \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "data": "<html>...</html>", "input_kind": "html", "return_format": "markdown" }'
# → 200 { content, status, return_format, engine, elapsed_ms, cost_usd }

# Body shape:
{ "data": "<html>...",           // HTML string or base64-encoded PDF bytes
  "input_kind": "html",           // "html" (default) | "pdf_base64"
  "return_format": "markdown",    // "markdown" (default) | "text" | "raw"
  "readability": true             // strip nav/footer/ads via readability heuristics, default true
}

POST /v1/cloud/screenshot — deprecated

POST/v1/cloud/screenshot

Deprecated. Use POST /v1/cloud/render with return_format: "png" instead. Existing integrations continue to work; new builds should target render.

POST /v1/cloud/proxy-fetch

POST/v1/cloud/proxy-fetch

Cheap HTML fetch through a rotating proxy pool. Bills by bytes transferred — no chrome render, no JS execution. ~13× cheaper than /v1/cloud/scrape for plain HTML where you don't need a rendered DOM.

curl -X POST https://api.crawlcrawl.com/v1/cloud/proxy-fetch \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://target.example/page", "pool": "isp", "country": "us" }'
# → 200 { url, status, content, bytes_transferred, elapsed_ms, cost_usd, pool, country }

# Body shape:
{ "url": "https://target.example/page",
  "pool": "isp",        // datacenter | isp (default) | residential | mobile
  "country": "us"       // optional ISO-2 country code for geo-routing
}

Pool selection guide — pricing observed against live billing 2026-05-19, accurate within 5%:

Pool Per request Per GB When to use
datacenter$0.0000070$0.05Cheapest. Public sites with no anti-bot.
isp$0.0000050$0.22Default. Balanced speed + reliability.
residential$0.0000050$0.12Anti-bot retail/social. (LinkedIn still wins.)
mobile$0.0000050$0.20Sites that whitelist mobile carriers.

Indicative end-to-end cost for a 200 KB page:

  • isp: ~$0.000049/page  →  $4.90 per 100K pages
  • residential: ~$0.000028/page  →  $2.80 per 100K pages
  • Compare: /v1/cloud/scrape on the same page is ~$0.000296  →  $29.60 per 100K pages.

Limitations — what you give up for the price:

  • No JS rendering. Single-page apps return shell HTML. Use /v1/cloud/render or chrome:true on /v1/cloud/scrape for hydrated DOM.
  • No automatic anti-bot bypass. Sites with deep fingerprinting (e.g. LinkedIn HTTP 999) still block residential IPs. Use /v1/cloud/unblock for those.
  • Cost is byte-based, billed locally to ~5% accuracy. Final reconciliation is against your spider invoice; cost_usd in the response is a conservative estimate.

GET /v1/cloud/scrapers

GET/v1/cloud/scrapers

List the per-domain pre-configured scrapers available on the account. Each entry returns a domain, supported paths, and the structured-data shape you should expect from /v1/cloud/fetch.

curl https://api.crawlcrawl.com/v1/cloud/scrapers \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { scrapers: [{ domain, paths, data_shape }] }

POST /v1/cloud/fetch/{domain}[/{path}]

POST/v1/cloud/fetch/{domain}
POST/v1/cloud/fetch/{domain}/{path}

Run a pre-built scraper for a popular site. The output shape is fixed per domain, so you skip selector maintenance and parser scripting. Use GET /v1/cloud/scrapers to discover what's available.

curl -X POST https://api.crawlcrawl.com/v1/cloud/fetch/example.com/search \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "params": { "query": "crawler API", "limit": 25 } }'
# → 200 { data: [...], cost_usd, elapsed_ms }

GET /v1/cloud/balance

GET/v1/cloud/balance

Account-wide remaining cloud credit balance.

curl https://api.crawlcrawl.com/v1/cloud/balance \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { balance_usd, currency: "USD" }

Actors — SEO and content intelligence

Opinionated bundles for the most common SEO-agency and content-team workflows. Each actor is a single POST with a URL in the body. Bills as standard pages on your plan: one page per URL processed. No new SKU, no separate billing.

POST /v1/actors/extract-article

POST/v1/actors/extract-article

Clean article body in text or markdown, plus full metadata: title, author, date, description, sitename, language, image, categories, tags, word count, reading time. Same extraction algorithm Mozilla Reader View ships, ported to Rust.

curl -X POST https://api.crawlcrawl.com/v1/actors/extract-article \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://example.com/blog/post", "format": "markdown" }'
# → 200 { actor, url, elapsed_ms, data: { format, content, word_count,
#                                       reading_time_seconds, metadata, comments } }

# Body shape:
{
  "url": "https://example.com/blog/post",
  "format": "markdown",           // or "text"
  "include_comments": false,        // Reddit/HN-style sites
  "wpm": 238                         // reading-time math
}

Typical latency 200ms to 1.5s depending on source-site weight and TLS handshake.

POST /v1/actors/audit-onpage

POST/v1/actors/audit-onpage

25+ on-page SEO checks in a single call. Returns a structured findings list, severity-graded (error / warning / info), with per-finding samples and a summary. Rules cover title length and keyword stuffing, meta description presence and length, H1 count, heading hierarchy, image alt and dimensions (CLS), canonical tag, viewport tag, robots noindex, Open Graph and Twitter cards, favicon, JSON-LD presence, mixed content, inline-style overuse, content thin-ness, and text-to-HTML ratio.

curl -X POST https://api.crawlcrawl.com/v1/actors/audit-onpage \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://your-client-site.com/" }'

# Body shape: { "url": "https://..." }
# → 200 {
#     actor, url, elapsed_ms,
#     data: {
#       summary: { errors, warnings, info },
#       findings: [ { rule, severity, message, count?, samples[] }, ... ],
#       stats: { word_count, link_count, image_count, heading_count,
#                html_bytes, text_bytes, content_ratio, has_json_ld }
#     }
#   }

Sub-second on most pages. Real example: 5,773-word site with 145 images audited in 1.1s, surfacing 3 errors and 3 warnings the team could fix the same afternoon.

POST/v1/actors/check-links
curl -X POST https://api.crawlcrawl.com/v1/actors/check-links \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://your-client-site.com/blog/post", "cloud_retry": true }'

Broken-link discovery. Two input modes: pass a single url and we extract all outbound links from the page and check each one, or pass a pre-collected list of up to 200 URLs in urls. Per-link result includes final URL, HTTP status code, status label (ok / broken / excluded), and an error reason when applicable. Each checked URL counts as one page.

Set cloud_retry: true to re-fetch every link classified as broken through a cloud backend. This rescues the classic false positives in any link audit: LinkedIn returning HTTP 999 to non-browser user agents, Cloudflare anti-bot challenges, paywalled news sites that 403 generic crawlers, and any URL that requires JavaScript to resolve. Each rescued link is flagged recovered_via_cloud: true in the per-link result. Capped at 50 retries per request.

Retry mode choice — pick speed/cost vs JS-rendering capability:

  • cloud_retry_mode: "chrome" (default) — full headless browser. Best for Cloudflare JS challenges and SPA URLs. ~$0.0017/retry, ~60s for a typical 78-link page.
  • cloud_retry_mode: "proxy" — residential proxy pool. Substantially cheaper per retry than Chrome render and lower latency, but no JS execution. Fixes datacenter-IP blocks; doesn't fix deep fingerprinting.
# Extract from a page with chrome-based rescue (default)
{ "url": "https://your-client-site.com/blog/post", "max_concurrent": 16, "cloud_retry": true }

# Same, but use the cheap proxy retry path
{ "url": "https://your-client-site.com/blog/post",
  "cloud_retry": true,
  "cloud_retry_mode": "proxy",
  "proxy_pool": "residential" // default for proxy retries
}

# Or check a known list
{ "urls": ["https://a.com", "https://b.com", "..."] }

# → 200 { actor, url, elapsed_ms,
#       data: { source_url, total, ok, broken, excluded,
#               cloud_retried, cloud_recovered, cloud_cost_usd,
#               items: [ { url, status_code, status,
#                          recovered_via_cloud, error }, ... ] } }

POST /v1/actors/structured-data

POST/v1/actors/structured-data

Extract six metadata syntaxes from any page in one call: JSON-LD, Microdata, RDFa, OpenGraph, Dublin Core, and Microformats. Uniform output shape across syntaxes, so a downstream Schema.org consumer sees one consistent structure. Built on extruct, the same library Zyte and most production scraping stacks use.

curl -X POST https://api.crawlcrawl.com/v1/actors/structured-data \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://your-client-site.com/product/123" }'
# → 200 {
#     actor, url, elapsed_ms,
#     data: {
#       counts: { "json-ld": N, microdata: N, rdfa: N,
#                  opengraph: N, dublincore: N, microformat: N },
#       data: { "json-ld": [ ... ], microdata: [ ... ], ... }
#     }
#   }

Verifies that Schema.org rich results are emitted correctly on every page in a client's catalog. Combine with POST /v1/crawls to audit thousands of product pages in one job.

POST /v1/actors/render-diff

POST/v1/actors/render-diff

Compare what a non-rendering crawler sees against what a real browser sees after JavaScript runs. Returns the difference in links, headings, JSON-LD blocks, and visible text. The headline metric ai_bot_blind_pct quantifies how much of the page is invisible to AI crawlers that don't execute JavaScript — the single most useful number for AEO audits in 2026.

curl -X POST https://api.crawlcrawl.com/v1/actors/render-diff \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://your-client-site.com/spa-page" }'
# → 200 {
#     actor, url, elapsed_ms,
#     data: {
#       static:   { word_count, link_count, heading_count, json_ld_count, html_bytes },
#       rendered: { word_count, link_count, heading_count, json_ld_count, html_bytes },
#       delta: {
#         links_only_after_render: [ ... up to 25 samples ... ],
#         headings_only_after_render: [ ... ],
#         json_ld_only_after_render: N,
#         text_chars_only_after_render: N,
#         ai_bot_blind_pct: 0.42,
#         assessment: "moderate — significant content requires JS"
#       },
#       rendered_cost_usd: 0.0017
#     }
#   }

Run this on every important landing page in a quarterly AEO audit. Pages with ai_bot_blind_pct above 0.2 should ship server-rendered HTML for the content that matters to LLMs.

POST/v1/actors/internal-link-graph

Run PageRank and weakly-connected-component analysis over the internal-link graph of an existing crawl. Surfaces which URLs are the actual authority hubs on your site, which pages donate or receive the most internal links, the anchor text distribution per target, and any orphan clusters disjoint from the main site graph. The single most actionable output of an internal-linking audit.

curl -X POST https://api.crawlcrawl.com/v1/actors/internal-link-graph \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "crawl_id": 30676, "top_n": 50 }'
# → 200 {
#     actor, url: "crawl:30676", elapsed_ms,
#     data: {
#       pagerank:              [ { url, score }, ... ],
#       top_donors:            [ { url, edges }, ... ],
#       top_recipients:        [ { url, edges }, ... ],
#       anchor_text_by_target: [ { url, anchors: [ {text, count}, ... ] }, ... ],
#       orphan_clusters:       [ { size, urls: [...] }, ... ],
#       stats: { pages, edges, components, iterations, converged }
#     }
#   }

Pair this with a weekly crawl to track how internal authority shifts as content is added. Cap is 100,000 pages per call — suitable for every agency client portfolio.

POST /v1/actors/sitemap-audit

POST/v1/actors/sitemap-audit

"Sitemap Health" check. Auto-discovers your sitemap(s) via robots.txt and the /sitemap.xml fallback, recurses sitemap-indexes one level deep, then probes every URL for HTTP status, canonical, X-Robots-Tag, and meta-robots. Buckets every URL into ok, redirect, client_error, server_error, noindex, canonicalised_away, network_error — the seven buckets every SEO weekly report needs.

# Preview the sitemap before paying for probes (returns the URL list, no billing)
curl -X POST https://api.crawlcrawl.com/v1/actors/sitemap-audit \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://your-client-site.com", "dry_run": true }'

# Full audit. Each probed URL = 1 page-credit. Default cap is 100 — set max_urls explicitly
# for larger audits. Hard ceiling is 50,000.
curl -X POST https://api.crawlcrawl.com/v1/actors/sitemap-audit \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://your-client-site.com", "max_urls": 5000 }'
# → 200 {
#     actor, url, elapsed_ms,
#     data: {
#       sitemaps_found:       [ "...sitemap.xml", "...sitemap-products.xml", ... ],
#       total_urls_in_sitemap, total_urls_probed,
#       pages_billed,         // explicit billing signal: equals total_urls_probed
#       summary: { ok, redirect, client_error, server_error,
#                  noindex, canonicalised_away, network_error },
#       items: [ { url, status_code, bucket, canonical, x_robots_tag,
#                  meta_robots, lastmod }, ... ]
#     }
#   }

Billing: 1 page-credit per probed URL. Default max_urls is 100 (low on purpose to avoid surprise bills). Hard ceiling is 50,000. Use "dry_run": true to discover sitemaps + count URLs without probing — useful for previewing cost before kicking off a real audit.

cron field — recurring monitors

Add cron to your POST /v1/crawls body to register a recurring monitor instead of a one-shot run. Standard 5-field UTC cron expression.

{
  "url": "https://competitor.com/pricing",
  "max_pages": 1,
  "cron": "0 */6 * * *",                // every 6 hours
  "webhook_url": "https://yourapp.com/changes",
  "webhook_events": ["crawl.diff_detected"],
  "return_only_changed": true
}
# → 201 { monitor_id, schedule, next_fires_at, url, info }

Monitor cap per tier (3 / 25 / 200). Tick fires at the closest UTC schedule match. previous_run_id is auto-linked between consecutive ticks.

webhook_events

Array of event types to deliver. Defaults to ["crawl.done"].

EventWhen it fires
crawl.doneRun reaches terminal state (done or failed).
crawl.diff_detectedRecurring monitor: previous run exists, diff is non-zero. Body includes diff: {added, removed, changed, unchanged} + previous_run_id.

return_only_changed

For recurring monitors only. When true, skip webhook delivery if the diff vs previous run shows zero changes. webhook_status is set to 'suppressed_no_change'. First run in a chain (no previous) sets 'suppressed_no_baseline'.

GET /v1/crons — list active monitors

GET/v1/crons

Returns each monitor's id, schedule, url, enabled, last_fired_at, next_fires_at, last_run_id.

curl https://api.crawlcrawl.com/v1/crons \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { items: [{ id, schedule, url, enabled, last_fired_at, next_fires_at, last_run_id }] }

PATCH /v1/crons/{id} — edit a monitor

PATCH/v1/crons/{id}

Update a monitor's schedule, enabled state, or webhook target without recreating it. Partial payload — only the fields you send are changed.

curl -X PATCH https://api.crawlcrawl.com/v1/crons/42 \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "schedule": "0 9 * * 1", "enabled": false }'

# Body shape:
{ "schedule": "0 9 * * 1",         // new cron (UTC); omit to keep current
  "enabled": false,                // pause without deleting; omit to keep current
  "webhook_url": "https://yourapp.com/new"
}
# → 200 { id, schedule, enabled, next_fires_at, ... }

DELETE /v1/crons/{id} — remove a monitor

DELETE/v1/crons/{id}
curl -X DELETE https://api.crawlcrawl.com/v1/crons/42 \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 204 No Content

GET /v1/usage — current period usage

GET/v1/usage

Returns tier name, configured caps, today and month-to-date counters, percentages, and seconds until UTC midnight reset. Two integer counters: pages (standard fetches) and cloud_pages (anti-bot edge network).

curl https://api.crawlcrawl.com/v1/usage \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200
{
  "tier": "pro",
  "caps": {
    "pages_per_day":       2000,
    "pages_per_month":     50000,
    "cloud_pages_per_day": 1000
  },
  "today": {
    "date":            "2026-05-17",
    "pages":           184,
    "cloud_pages":     7,                  // 5-result search = 5; 1 cloud_scrape = 1; 1 cloud_render = 1
    "cloud_cost_usd":  0.0021,
    "pages_pct":       9.2,
    "cloud_pages_pct": 0.7
  },
  "month": {
    "pages":           1278,
    "cloud_pages":     144,
    "cloud_cost_usd":  0.0087,
    "pages_pct":       2.6
  },
  "retry_after_seconds": 37889
}

Billing is append-only: every chargeable call writes a row to usage_events with an idempotency token, so retries never double-bill. Search bills per result returned; everything else bills 1 page (or 1 page per page crawled for multi-page crawls).

GET /v1/usage/history

GET/v1/usage/history?days=30

Daily buckets for the last N days (max 365).

curl "https://api.crawlcrawl.com/v1/usage/history?days=30" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { days: [{ date, pages, cloud_pages, cloud_cost_usd }] }

GET /v1/keys — list API keys

GET/v1/keys

Hash prefix only — plaintext is never recoverable. Includes created_at, last_used_at, revoked_at, expires_at.

curl https://api.crawlcrawl.com/v1/keys \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { items: [{ prefix, label, created_at, last_used_at, revoked_at, expires_at }] }

POST /v1/keys/rotate

POST/v1/keys/rotate
curl -X POST https://api.crawlcrawl.com/v1/keys/rotate \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "label": "ci-runner", "grace_seconds": 86400 }'
# → 200 { api_key, prefix, label, grace_seconds }
# Plaintext shown ONCE.

DELETE /v1/keys/{prefix}

DELETE/v1/keys/{prefix}

Refuses if it would leave the project with no active key (returns 409).

curl -X DELETE https://api.crawlcrawl.com/v1/keys/crk_abcd \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 204 No Content  (or 409 if this is your last active key)

GET /v1/logs — audit feed

GET/v1/logs?limit=100&status_min=400&status_max=499

Last N requests for this project with method, path, status, duration_ms, client_ip, request_id.

curl "https://api.crawlcrawl.com/v1/logs?limit=100&status_min=400&status_max=499" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { items: [{ ts, method, path, status, duration_ms, client_ip, request_id }] }

GET /v1/webhook/secret

GET/v1/webhook/secret

Returns the project's HMAC-SHA256 secret in hex. Lazy-generated on first call. Use to verify X-Crawler-Signature on incoming webhooks.

curl https://api.crawlcrawl.com/v1/webhook/secret \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { secret_hex: "a1b2..." }

GET /v1/robots-policy

GET/v1/robots-policy?url=https://example.com

Returns parsed AI-bot policy + raw robots.txt + llms.txt for the host. Cached per-host (TTL 1h). Cheap.

curl "https://api.crawlcrawl.com/v1/robots-policy?url=https://example.com" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { url, robots_txt, llms_txt, ai_bot_policy: { GPTBot: "allowed", ClaudeBot: "allowed", ... } }

POST /v1/llms-txt-build

POST/v1/llms-txt-build

Crawl a domain, return a properly-formatted llms.txt file ready to publish. Synchronous (waits for crawl up to wait_seconds; default 60, max 180).

curl -X POST https://api.crawlcrawl.com/v1/llms-txt-build \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://customer.com", "max_pages": 30, "site_name": "Customer Inc", "summary": "B2B widget company", "wait_seconds": 60 }'
# → 200 text/plain  (the llms.txt content)
# → 408 Request Timeout if crawl exceeds wait_seconds (poll /v1/crawls/{run_id})

GET /v1/health/cloud

GET/v1/health/cloud

Returns whether anti-bot routing is configured, current account balance, and this project's last 24h cloud usage.

curl https://api.crawlcrawl.com/v1/health/cloud \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { configured, balance_usd, last_24h: { cloud_pages, cloud_cost_usd } }

Tooling

Two artifacts are kept in sync with this reference. Use either to skip the boilerplate.

ArtifactUse it forDownload
postman.json Postman / Insomnia / Bruno collection. Pre-populated requests with sample bodies and {{base_url}} + {{api_key}} variables. Download
openapi.json OpenAPI 3.1 spec. Auto-generate client libs (Python / Node / Go / Rust) via openapi-generator, register as a ChatGPT custom GPT action, or import into any tool that speaks OpenAPI. Download

Last updated 2026-05-23. API surface is versioned at /v1. Wire format is stable.