API Reference · v1

Introduction

Building a client SDK or importing into Postman / Stoplight? Use the machine-readable spec instead of this narrative page: live spec view · GET /openapi.json.

The crawlcrawl API is REST. JSON in, JSON out. Bearer-token auth. Endpoints are versioned (/v1) and we won't break the wire format inside a major version.

Base URL:

https://api.crawlcrawl.com

Public hostname with valid TLS — standard cert verification works, no client tweaks required.

Quickstart

The shortest possible exchange: start a crawl, poll until done, fetch a page.

# export your key (issued per project — ask [email protected])
export CRAWLCRAWL_KEY="crk_..."

# 1. start a crawl
curl -X POST https://api.crawlcrawl.com/v1/crawls \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://example.com", "max_pages": 50 }'
# → 202 {"id":43,"status":"queued","url":"https://example.com"}

# 2. poll the run
curl https://api.crawlcrawl.com/v1/crawls/43 \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

# 3. when status=done, list pages and fetch one as markdown
curl "https://api.crawlcrawl.com/v1/crawls/43/pages" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

curl "https://api.crawlcrawl.com/v1/pages/195?format=markdown" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

Authentication

Every request needs Authorization: Bearer crk_.... Keys are scoped per project; we store only the SHA-256 hash and show plaintext exactly once at mint time. Never put a key in browser code.

To mint or rotate a key: email [email protected] with your project name. Revocation is instant — keys are server-side only.

TLS

HTTPS only on api.crawlcrawl.com. No special client config required; standard cert verification works.

Limits

Limits are per-project and configurable per key. The defaults below match what new keys ship with.

Limit	Default	Notes
Concurrent runs	5	Submitting a 6th while 5 are queued/running returns 429.
max_pages per run	100,000	Hard cap; the request is rejected at validation.
depth	1..50	Server-side validation — outside this range returns 400.
concurrency per crawl	unset (auto)	Server-side adaptive concurrency picks the right level for the workload. Bumping above ~16 has diminishing returns on most sites.

Errors

HTTP status codes are honest. Bodies are JSON in this envelope:

{
  "error": {
    "code":    "invalid_input",
    "message": "depth must be 1..=50"
  }
}

Code	HTTP	Meaning
invalid_input	400	Body failed validation (bad URL, depth out of range, etc.).
unauthorized	401	Missing or wrong bearer token.
not_found	404	Resource doesn't exist, or belongs to a different project.
too_many_requests	429	You hit your concurrent-runs cap. Wait for one to finish.
internal	500	Something on our side. Will be in our logs; ping us.

POST /v1/crawls — start a crawl

POST /v1/crawls

Enqueues a crawl. Returns 202 immediately; the run starts within seconds.

curl -X POST https://api.crawlcrawl.com/v1/crawls \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://example.com", "max_pages": 50, "depth": 2 }'
# → 202 { id, status: "queued", url, enqueued_at }

Body — top-level fields

Field	Type	Description
url	string	Required. Must start with `http://` or `https://`. Host is checked against the malicious-domain blocklist before queueing.
user_agent	string	Optional. If omitted (or starts with `InternalCrawler/`), we rotate to a fresh real-browser UA via `ua_generator`.
proxy_url	string?	Optional. Per-request override of project default. Supports `http://`, `https://`, `socks5://`.
proxy_pool	string?	Optional. Route the whole crawl through a managed proxy pool without bringing your own credentials. One of `"datacenter"` \| `"isp"` \| `"residential"` \| `"mobile"`. The server resolves the upstream credential and proxy URL automatically. Combine with `proxy_country` for geo-routing. Mutually exclusive with manual `proxy_url`; if both are set, `proxy_url` wins.
proxy_country	string?	Optional, only with `proxy_pool`. ISO-2 country code (e.g. `"us"`, `"de"`, `"in"`) — fetches as if from that country. 199+ countries supported. Critical for geo-localized SEO audits.
webhook_url	string?	Optional. POSTed once when the run reaches `done` or `failed`. See Webhooks.

Body — crawl params (FLAT, not nested)

These fields go at the top level of the body — not under a params key. The server uses serde flatten; nested params are silently ignored.

Field	Type	Default	Description
max_pages	int	1000	Hard cap; crawl stops when reached. Range 1..=100_000.
concurrency	int?	auto	Parallel in-flight requests.
depth	int	5	Max link-following depth from seed. Range 1..=50.
delay_ms	int	250	Delay between requests per host.
subdomains	bool	false	If true, follows links into subdomains of the seed host.
respect_robots	bool	true	Obey robots.txt.
store_html	bool	true	If false, we collect URL/status/bytes/metadata only — useful for cheap site maps.
seed_kind	string?	"url"	Either `"url"` (link-following) or `"sitemap"` (walks `sitemap.xml` + sitemap-index recursively). In sitemap mode you may pass either the origin (`https://example.com`, defaults to `/sitemap.xml`) or the explicit sitemap URL (`https://example.com/sitemap-index.xml`) — both work. Much faster than link-following on large sites.
headers	object?	—	String→string map attached to every request in the crawl (e.g. auth, custom referer).
cookies	string?	—	Set-Cookie-style string: `k=v; k=v`.

Headers

Header	Purpose
Authorization	`Bearer crk_...` — required.
Content-Type	`application/json` — required.
Idempotency-Key	Optional but recommended. See Idempotency.

Response — 202 Accepted

{
  "id":     43,
  "status": "queued",
  "url":    "https://example.com"
}

GET /v1/crawls/{id} — run status

GET /v1/crawls/{id}

Poll every 2–5 seconds while status is queued or running. Or skip polling entirely and use a webhook.

curl https://api.crawlcrawl.com/v1/crawls/43 \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# →
{
  "id":            43,
  "url":           "https://aeoniti.com",
  "status":        "running",
  "page_count":    27,
  "error_count":   0,
  "enqueued_at":   "2026-05-08T16:48:36.151Z",
  "started_at":    "2026-05-08T16:48:36.401Z",
  "finished_at":   null,
  "error_message": null,
  "proxy_url":     null
}

status ∈ queued | running | done | failed | cancelled.

Note: page_count includes pages that returned non-200 statuses (those count as "fetched"). error_count only counts crawler-side failures: DNS, TLS, connect timeout.

DELETE /v1/crawls/{id} — cancel and cascade

DELETE /v1/crawls/{id}

Cancels if running, then cascades — all pages rows for the run are removed. Returns 204 No Content. Subsequent GETs return 404.

curl -X DELETE https://api.crawlcrawl.com/v1/crawls/43 \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 204 No Content

GET /v1/crawls/{id}/pages — list pages

GET /v1/crawls/{id}/pages?limit=100&offset=0&status=200

curl "https://api.crawlcrawl.com/v1/crawls/43/pages?limit=100&status=200" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

Query	Default	Notes
limit	100	Max 1000.
offset	0	Standard offset paging.
status	—	Filter by HTTP status (e.g. only 200s).

{
  "items": [
    { "id": 195, "url": "https://aeoniti.com/pricing",
      "status": 200, "bytes": 13650,
      "fetched_at": "2026-05-08T16:48:40.595Z" }
  ],
  "next_offset": 14
}

Save the global id per row — that's the path parameter for fetching content.

GET /v1/pages/{id} — fetch one page

GET /v1/pages/{id}?format=markdown

id is the global page id from the list endpoint, not a 1-based index within a run.

curl "https://api.crawlcrawl.com/v1/pages/195?format=markdown" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

format	Returns
html	Raw HTML (decompressed) + metadata. Default.
markdown	Clean, link-preserving markdown with structured metadata.
article	Boilerplate-stripped `article_text` + `article_html` + metadata.
both	html + markdown + article_text + article_html + metadata.

Always returns id, url, status, bytes, fetched_at. metadata shape:

{
  "title":       "Pricing — AEONiti",
  "description": "...",
  "canonical":   "https://aeoniti.com/pricing",
  "og":      { "title": "...", "image": "...", "type": "website" },
  "twitter": { "card": "summary_large_image" },
  "json_ld": [ { "@type": "Product" } ]
}

GET /v1/health · /v1/ready

No auth. Use /v1/health for liveness; /v1/ready verifies the full request path is operational end-to-end.

curl https://api.crawlcrawl.com/v1/health # → 200 "ok"
curl https://api.crawlcrawl.com/v1/ready  # → 200 "ready" or 503 + JSON

Webhooks

If you set webhook_url on a crawl, the worker POSTs once when the run reaches a terminal state (done or failed).

Headers

Content-Type: application/json
X-Crawler-Run-Id: 43

Body

{
  "id":            43,
  "project_id":    2,
  "url":           "https://aeoniti.com",
  "status":        "done",
  "page_count":    14,
  "error_count":   0,
  "started_at":    "...",
  "finished_at":   "...",
  "error_message": null,
  "delivered_at":  "..."
}

Retry policy

5 attempts total.
Exponential backoff: 1s, 2s, 4s, 8s.
2xx → status delivered.
4xx → marked dead immediately. Your endpoint rejected it; we don't retry.
5xx / network / timeout → retry. After attempt 5 → dead.

Delivery state persists on the run. You can poll GET /v1/crawls/{id} and read webhook_status, webhook_attempts, webhook_last_error.

No HMAC signing yet — until that ships, treat the webhook body as advisory: re-fetch the run via GET before acting on it.

Idempotency

Pass Idempotency-Key: <your-uuid> on POST to make retries safe. The same key returns the original run (with the same id) instead of creating a duplicate. Keys are scoped per project and persist for 24 hours.

Use this when the network call to start the crawl might retry (queue handlers, lambda invocations, etc.).

Common patterns

Single-page extract (URL → markdown)

{ "url": "https://example.com/post", "max_pages": 1, "depth": 1 }

Then ?format=both on the one resulting page.

Crawl whole site, get clean text only

{
  "url": "https://customer.com",
  "max_pages": 500, "depth": 8,
  "store_html": false
}

Then ?format=article per page → embeddings input.

Fast site map via sitemap.xml

// either form works — origin defaults to /sitemap.xml,
// or pass the explicit URL (handy for /sitemap-index.xml etc.)
{
  "url": "https://customer.com/sitemap.xml",
  "seed_kind": "sitemap",
  "max_pages": 5000,
  "store_html": false
}

Authenticated crawl

{
  "url": "https://app.customer.com",
  "headers": { "Authorization": "Bearer abc123" },
  "cookies": "session=XYZ; csrf=ABC",
  "respect_robots": false
}

Async with webhook (recommended for production)

{
  "url": "https://customer.com",
  "max_pages": 200,
  "concurrency": 4,
  "webhook_url": "https://you.example.com/crawl-done"
}

Behaviour notes

Behaviour	How to handle it
For SPAs that need client-side rendering on the multi-page path	Use `POST /v1/cloud/crawl` (every page runs through JS rendering + anti-bot). On the single-URL `/v1/scan` path, set `cloud_mode: "auto"` or `"browser"` for per-request escalation.
Sitemap-first crawls are the fastest path for sites that publish one	Use `seed_kind: "sitemap"` for sites with a published sitemap.
Pass `store_html=true` if you want to re-render pages later	Decide upfront per crawl; pages without stored HTML cannot be re-fetched as markdown.
4xx/5xx pages from the target are returned as pages, not errors	Filter the page list by `status` if you only want successes.

POST /v1/scan — synchronous single-URL scan

POST /v1/scan

Sub-second URL → markdown + signals + AI-bot policy + llms.txt. No queue, no polling.

Body

curl -X POST https://api.crawlcrawl.com/v1/scan \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://example.com", "only_main_content": true }'
# → 200 { url, status, markdown, metadata, signals, llms_txt, ai_bot_policy }

Field	Type	Default	Description
url	string	—	Required.
user_agent	string?	rotated UA	Custom UA.
max_age_seconds	int?	—	If a stored page row for this URL exists within this many seconds, return it (cache hit).
cloud_mode	string?	project default	`none` \| `auto` \| `unblocker` \| `browser`.
metadata_only	bool	false	Skip markdown/article extraction; return only metadata + signals subset. ~5× faster.
only_main_content	bool	false	Strip boilerplate (nav, footer, sidebars) before converting to markdown.
include_links	bool	false	Return full anchor list `[{to, anchor, rel, nofollow}]`.
screenshot_inline	bool	false	Include base64 PNG (counts as +1 cloud op). Surfaces `screenshot_error` on failure.

POST /v1/scan/bulk — parallel multi-URL scan

POST/v1/scan/bulk

Fan out 1–100 URLs in a single call. Each URL is scanned in parallel server-side; each item in the response carries the same payload as /v1/scan (markdown, article, article_html, metadata, signals, llms.txt, AI-bot policy). Use this when you already know the URLs you want and need them fetched fast.

curl -X POST https://api.crawlcrawl.com/v1/scan/bulk \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://example.com",
      "https://example.com/about",
      "https://example.com/pricing"
    ],
    "concurrency": 25
  }'

Body

Field	Type	Default	Description
urls	string[]	—	Required. 1–100 absolute URLs. Each one is SSRF-checked; private/loopback/metadata hosts are rejected before any URL fires.
concurrency	int?	64	Server-side parallelism. Min 1, max 64. Default is the full cap — lower it only if the target site rate-limits aggressively against your IP.
metadata_only	bool	false	Skip markdown / article / article_html extraction. Each item still returns metadata, signals, content_hash, status, bytes. ~3× faster wall time, ~48× smaller response — use this when your pipeline only needs AEO/SEO signals.
max_age_seconds	int?	—	Per-URL cache window. If a stored page row exists for a URL and is younger than this, that row is returned (no refetch). Applied to every URL in the batch.
user_agent	string?	rotated UA	UA override applied uniformly to every URL in the batch.

Response

# 200 — one item per submitted URL, in submission order
{
  "items": [
    {
      "url": "https://example.com",
      "ok": true,
      "result": {
        "url":           "https://example.com",
        "final_url":     "https://example.com/",
        "status":        200,
        "bytes":         5821,
        "content_hash":  "a1b2c3...",
        "markdown":      "# Example...",
        "article":       "Example...",         // readability-extracted plain text
        "article_html":  "<article>...",      // readability-extracted clean HTML
        "metadata":      { "title": "...", "description": "...", "og": {...} },
        "signals":       { "headings": [...], "jsonld": [...], ... },
        "ai_bot_policy": {...},
        "llms_txt":      "...",
        "llms_full_txt": "..."
      }
    },
    {
      "url": "https://example.com/blocked",
      "ok": false,
      "error": "fetch failed: status 403"
    }
  ]
}

Discovery + bulk-fetch pattern

If you have a seed URL and want all article content in parallel (instead of polling one page at a time), combine /v1/crawls for discovery with /v1/scan/bulk for parallel content fetch:

# 1) Discover the URLs. store_html=false makes this cheap; we only want the link graph.
RUN=$(curl -s -X POST https://api.crawlcrawl.com/v1/crawls \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -d '{ "url": "https://example.com", "max_pages": 25, "store_html": false }' \
  | jq -r .id)

# 2) Poll until done (typically 5-30s depending on site size).
while [ "$(curl -s https://api.crawlcrawl.com/v1/crawls/$RUN \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" | jq -r .status)" != "done" ]; do
  sleep 2
done

# 3) Pull the discovered URLs.
URLS=$(curl -s "https://api.crawlcrawl.com/v1/crawls/$RUN/pages?limit=25" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" | jq -c '[.items[].url]')

# 4) Fetch all 25 in ONE bulk call instead of 25 serial GETs.
curl -X POST https://api.crawlcrawl.com/v1/scan/bulk \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d "{ \"urls\": $URLS, \"concurrency\": 25 }"
# → 200 { items: [...] } — every result already includes article + markdown + metadata.

This avoids issuing one GET /v1/pages/{id}?format=article per page in series — the bulk endpoint's result items already carry article and article_html alongside the markdown.

Notes & limits

Hard cap: 100 URLs per request. Submitting more returns 400.
Concurrency cap: 64. Requests above the cap are clamped to the cap (no error).
Every URL counts as 1 page against your daily/monthly quota, regardless of cache state.
Item order matches submission order, so you can correlate by index if needed.
Failures are partial — the batch never aborts on one bad URL. Always check the ok field per item.
Error strings are sanitised to never leak internal state. Common ones: "fetch failed: ...", "urls[n] failed SSRF check ...", "daily page cap exceeded".

GET /v1/crawls/{id}/links — link graph

GET/v1/crawls/{id}/links

Lazy-extracts anchors from stored HTML. Supports ?from_page_id=N to scope to one page, plus ?limit&offset.

curl "https://api.crawlcrawl.com/v1/crawls/43/links?limit=500" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { items: [{ from_page_id, to_url, anchor, rel, nofollow }], next_offset }

GET /v1/crawls/{id}/orphans — pages with no inbound link

GET/v1/crawls/{id}/orphans

Returns crawled pages that no other crawled page links to. Excludes the seed.

curl https://api.crawlcrawl.com/v1/crawls/43/orphans \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { items: [{ id, url, status, fetched_at }] }

GET /v1/crawls/{old}/diff/{new} — crawl diff

GET/v1/crawls/{old_id}/diff/{new_id}

Compares two runs' pages by content_hash. Returns added, removed, changed arrays plus an unchanged count. Both runs must belong to your project.

curl https://api.crawlcrawl.com/v1/crawls/9123/diff/9128 \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# →
{
  "old_run_id": 9123, "new_run_id": 9128,
  "summary": { "added": 2, "removed": 0, "changed": 3, "unchanged": 42 },
  "changed": [
    { "url": "https://x.com/pricing",
      "old_hash": "abc...", "new_hash": "def..." }
  ]
}

POST /v1/cloud/scrape

POST/v1/cloud/scrape

curl -X POST https://api.crawlcrawl.com/v1/cloud/scrape \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://protected.example", "return_format": "markdown", "chrome": true }'
# → 200 { content, cost_usd, elapsed_ms }

# Body shape:
{ "url": "https://protected.example",
  "return_format": "markdown",  // markdown | raw | text | commonmark
  "chrome": true,             // optional browser-rendered fetch
  "force_backend": "auto"     // auto (default) | local | vendor | proxy
}

Direct anti-bot scrape. Returns content, cost_usd, elapsed_ms.

Cheap path: set "force_backend": "proxy" to route through the bandwidth-billed HTTP proxy pool instead of full Chrome render. Substantially cheaper per page for plain HTML; no JS execution. Additional params:

{ "url": "https://target.example/page",
  "force_backend": "proxy",
  "proxy_pool": "isp",        // datacenter | isp (default) | residential | mobile
  "proxy_country": "de"      // optional ISO-2 country for geo-routing
}
# → 200 { content, status, backend: "proxy", pool, bytes_transferred, cost_usd, elapsed_ms }

See /v1/cloud/proxy-fetch for the standalone endpoint with the same proxy mechanism and full pool/pricing table.

POST /v1/cloud/crawl

POST/v1/cloud/crawl

Multi-page anti-bot crawl. Returns a list of {url, content} entries.

curl -X POST https://api.crawlcrawl.com/v1/cloud/crawl \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://target.example", "limit": 50, "return_format": "markdown", "chrome": false }'
# → 200 { items: [{ url, content, status }], cost_usd, elapsed_ms }

POST /v1/cloud/search

POST/v1/cloud/search

SERP-style search. Returns URLs with full page content when return_format is set.

curl -X POST https://api.crawlcrawl.com/v1/cloud/search \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "query": "vector database open source 2026", "limit": 10, "return_format": "markdown" }'
# → 200 { results: [{ url, title, content }], cost_usd, elapsed_ms }

POST /v1/cloud/links

POST/v1/cloud/links

Fast anti-bot link discovery. Returns up to limit outbound links from the URL.

curl -X POST https://api.crawlcrawl.com/v1/cloud/links \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://target.example", "limit": 200 }'
# → 200 { links: ["https://..."], cost_usd, elapsed_ms }

POST /v1/cloud/unblock

POST/v1/cloud/unblock

Bypass anti-bot protections (Cloudflare, Akamai, PerimeterX, captchas). Returns the unblocked response body. Use when a target site fights back and a plain scrape gets a challenge page.

curl -X POST https://api.crawlcrawl.com/v1/cloud/unblock \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://protected.example/data", "return_format": "markdown", "country": "us" }'
# → 200 { content, cost_usd, elapsed_ms }

# Body shape:
{ "url": "https://protected.example/data",
  "return_format": "markdown",  // markdown | raw | text
  "country": "us"             // optional, 190+ countries supported
}

POST /v1/cloud/render

POST/v1/cloud/render

Run a URL in a real Chrome browser and return the resolved page. Use for single-page apps, hydration-heavy sites, and post-JavaScript HTML. Expect 30–60s elapsed; this is a heavy call.

curl -X POST https://api.crawlcrawl.com/v1/cloud/render \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://spa.example.com", "wait_for": "networkidle", "return_format": "raw" }'
# → 200 { content, cost_usd, elapsed_ms }

# Body shape:
{ "url": "https://spa.example.com",
  "wait_for": "networkidle",    // load | domcontentloaded | networkidle
  "return_format": "raw"         // raw HTML, markdown, or text
}

Render replaces the deprecated /v1/cloud/screenshot path. Set return_format: "png" if you specifically need image bytes.

POST /v1/cloud/transform

POST/v1/cloud/transform

HTML or PDF bytes in, clean markdown out. No URL fetch — you supply the raw content. Use when you already have the document and just want the formatting cleanup.

curl -X POST https://api.crawlcrawl.com/v1/cloud/transform \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "data": "<html>...</html>", "input_kind": "html", "return_format": "markdown" }'
# → 200 { content, status, return_format, engine, elapsed_ms, cost_usd }

# Body shape:
{ "data": "<html>...",           // HTML string or base64-encoded PDF bytes
  "input_kind": "html",           // "html" (default) | "pdf_base64"
  "return_format": "markdown",    // "markdown" (default) | "text" | "raw"
  "readability": true             // strip nav/footer/ads via readability heuristics, default true
}

POST /v1/cloud/screenshot — deprecated

POST/v1/cloud/screenshot

Deprecated. Use POST /v1/cloud/render with return_format: "png" instead. Existing integrations continue to work; new builds should target render.

POST /v1/cloud/proxy-fetch

POST/v1/cloud/proxy-fetch

Cheap HTML fetch through a rotating proxy pool. Bills by bytes transferred — no chrome render, no JS execution. ~13× cheaper than /v1/cloud/scrape for plain HTML where you don't need a rendered DOM.

curl -X POST https://api.crawlcrawl.com/v1/cloud/proxy-fetch \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://target.example/page", "pool": "isp", "country": "us" }'
# → 200 { url, status, content, bytes_transferred, elapsed_ms, cost_usd, pool, country }

# Body shape:
{ "url": "https://target.example/page",
  "pool": "isp",        // datacenter | isp (default) | residential | mobile
  "country": "us"       // optional ISO-2 country code for geo-routing
}

Pool selection guide — pricing observed against live billing 2026-05-19, accurate within 5%:

Pool	Per request	Per GB	When to use
`datacenter`	$0.0000070	$0.05	Cheapest. Public sites with no anti-bot.
`isp`	$0.0000050	$0.22	Default. Balanced speed + reliability.
`residential`	$0.0000050	$0.12	Anti-bot retail/social. (LinkedIn still wins.)
`mobile`	$0.0000050	$0.20	Sites that whitelist mobile carriers.

Indicative end-to-end cost for a 200 KB page:

isp: ~$0.000049/page → $4.90 per 100K pages
residential: ~$0.000028/page → $2.80 per 100K pages
Compare: /v1/cloud/scrape on the same page is ~$0.000296 → $29.60 per 100K pages.

Limitations — what you give up for the price:

No JS rendering. Single-page apps return shell HTML. Use /v1/cloud/render or chrome:true on /v1/cloud/scrape for hydrated DOM.
No automatic anti-bot bypass. Sites with deep fingerprinting (e.g. LinkedIn HTTP 999) still block residential IPs. Use /v1/cloud/unblock for those.
Cost is byte-based, billed locally to ~5% accuracy. Final reconciliation is against your spider invoice; cost_usd in the response is a conservative estimate.

GET /v1/cloud/scrapers

GET/v1/cloud/scrapers

List the per-domain pre-configured scrapers available on the account. Each entry returns a domain, supported paths, and the structured-data shape you should expect from /v1/cloud/fetch.

curl https://api.crawlcrawl.com/v1/cloud/scrapers \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { scrapers: [{ domain, paths, data_shape }] }

POST /v1/cloud/fetch/{domain}[/{path}]

POST/v1/cloud/fetch/{domain}

POST/v1/cloud/fetch/{domain}/{path}

Run a pre-built scraper for a popular site. The output shape is fixed per domain, so you skip selector maintenance and parser scripting. Use GET /v1/cloud/scrapers to discover what's available.

curl -X POST https://api.crawlcrawl.com/v1/cloud/fetch/example.com/search \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "params": { "query": "crawler API", "limit": 25 } }'
# → 200 { data: [...], cost_usd, elapsed_ms }

GET /v1/cloud/balance

GET/v1/cloud/balance

Account-wide remaining cloud credit balance.

curl https://api.crawlcrawl.com/v1/cloud/balance \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { balance_usd, currency: "USD" }

Actors — SEO and content intelligence

Opinionated bundles for the most common SEO-agency and content-team workflows. Each actor is a single POST with a URL in the body. Bills as standard pages on your plan: one page per URL processed. No new SKU, no separate billing.

POST /v1/actors/extract-article

POST/v1/actors/extract-article

Clean article body in text or markdown, plus full metadata: title, author, date, description, sitename, language, image, categories, tags, word count, reading time. Same extraction algorithm Mozilla Reader View ships, ported to Rust.

curl -X POST https://api.crawlcrawl.com/v1/actors/extract-article \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://example.com/blog/post", "format": "markdown" }'
# → 200 { actor, url, elapsed_ms, data: { format, content, word_count,
#                                       reading_time_seconds, metadata, comments } }

# Body shape:
{
  "url": "https://example.com/blog/post",
  "format": "markdown",           // or "text"
  "include_comments": false,        // Reddit/HN-style sites
  "wpm": 238                         // reading-time math
}

Typical latency 200ms to 1.5s depending on source-site weight and TLS handshake.

POST /v1/actors/audit-onpage

POST/v1/actors/audit-onpage

25+ on-page SEO checks in a single call. Returns a structured findings list, severity-graded (error / warning / info), with per-finding samples and a summary. Rules cover title length and keyword stuffing, meta description presence and length, H1 count, heading hierarchy, image alt and dimensions (CLS), canonical tag, viewport tag, robots noindex, Open Graph and Twitter cards, favicon, JSON-LD presence, mixed content, inline-style overuse, content thin-ness, and text-to-HTML ratio.

curl -X POST https://api.crawlcrawl.com/v1/actors/audit-onpage \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://your-client-site.com/" }'

# Body shape: { "url": "https://..." }
# → 200 {
#     actor, url, elapsed_ms,
#     data: {
#       summary: { errors, warnings, info },
#       findings: [ { rule, severity, message, count?, samples[] }, ... ],
#       stats: { word_count, link_count, image_count, heading_count,
#                html_bytes, text_bytes, content_ratio, has_json_ld }
#     }
#   }

Sub-second on most pages. Real example: 5,773-word site with 145 images audited in 1.1s, surfacing 3 errors and 3 warnings the team could fix the same afternoon.

POST /v1/actors/check-links

POST/v1/actors/check-links

curl -X POST https://api.crawlcrawl.com/v1/actors/check-links \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://your-client-site.com/blog/post", "cloud_retry": true }'

Broken-link discovery. Two input modes: pass a single url and we extract all outbound links from the page and check each one, or pass a pre-collected list of up to 200 URLs in urls. Per-link result includes final URL, HTTP status code, status label (ok / broken / excluded), and an error reason when applicable. Each checked URL counts as one page.

Set cloud_retry: true to re-fetch every link classified as broken through a cloud backend. This rescues the classic false positives in any link audit: LinkedIn returning HTTP 999 to non-browser user agents, Cloudflare anti-bot challenges, paywalled news sites that 403 generic crawlers, and any URL that requires JavaScript to resolve. Each rescued link is flagged recovered_via_cloud: true in the per-link result. Capped at 50 retries per request.

Retry mode choice — pick speed/cost vs JS-rendering capability:

cloud_retry_mode: "chrome" (default) — full headless browser. Best for Cloudflare JS challenges and SPA URLs. ~$0.0017/retry, ~60s for a typical 78-link page.
cloud_retry_mode: "proxy" — residential proxy pool. Substantially cheaper per retry than Chrome render and lower latency, but no JS execution. Fixes datacenter-IP blocks; doesn't fix deep fingerprinting.

# Extract from a page with chrome-based rescue (default)
{ "url": "https://your-client-site.com/blog/post", "max_concurrent": 16, "cloud_retry": true }

# Same, but use the cheap proxy retry path
{ "url": "https://your-client-site.com/blog/post",
  "cloud_retry": true,
  "cloud_retry_mode": "proxy",
  "proxy_pool": "residential" // default for proxy retries
}

# Or check a known list
{ "urls": ["https://a.com", "https://b.com", "..."] }

# → 200 { actor, url, elapsed_ms,
#       data: { source_url, total, ok, broken, excluded,
#               cloud_retried, cloud_recovered, cloud_cost_usd,
#               items: [ { url, status_code, status,
#                          recovered_via_cloud, error }, ... ] } }

POST /v1/actors/structured-data

POST/v1/actors/structured-data

Extract six metadata syntaxes from any page in one call: JSON-LD, Microdata, RDFa, OpenGraph, Dublin Core, and Microformats. Uniform output shape across syntaxes, so a downstream Schema.org consumer sees one consistent structure. Built on extruct, the same library Zyte and most production scraping stacks use.

curl -X POST https://api.crawlcrawl.com/v1/actors/structured-data \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://your-client-site.com/product/123" }'
# → 200 {
#     actor, url, elapsed_ms,
#     data: {
#       counts: { "json-ld": N, microdata: N, rdfa: N,
#                  opengraph: N, dublincore: N, microformat: N },
#       data: { "json-ld": [ ... ], microdata: [ ... ], ... }
#     }
#   }

Verifies that Schema.org rich results are emitted correctly on every page in a client's catalog. Combine with POST /v1/crawls to audit thousands of product pages in one job.

POST /v1/actors/render-diff

POST/v1/actors/render-diff

Compare what a non-rendering crawler sees against what a real browser sees after JavaScript runs. Returns the difference in links, headings, JSON-LD blocks, and visible text. The headline metric ai_bot_blind_pct quantifies how much of the page is invisible to AI crawlers that don't execute JavaScript — the single most useful number for AEO audits in 2026.

curl -X POST https://api.crawlcrawl.com/v1/actors/render-diff \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://your-client-site.com/spa-page" }'
# → 200 {
#     actor, url, elapsed_ms,
#     data: {
#       static:   { word_count, link_count, heading_count, json_ld_count, html_bytes },
#       rendered: { word_count, link_count, heading_count, json_ld_count, html_bytes },
#       delta: {
#         links_only_after_render: [ ... up to 25 samples ... ],
#         headings_only_after_render: [ ... ],
#         json_ld_only_after_render: N,
#         text_chars_only_after_render: N,
#         ai_bot_blind_pct: 0.42,
#         assessment: "moderate — significant content requires JS"
#       },
#       rendered_cost_usd: 0.0017
#     }
#   }

Run this on every important landing page in a quarterly AEO audit. Pages with ai_bot_blind_pct above 0.2 should ship server-rendered HTML for the content that matters to LLMs.

POST /v1/actors/internal-link-graph

POST/v1/actors/internal-link-graph

Run PageRank and weakly-connected-component analysis over the internal-link graph of an existing crawl. Surfaces which URLs are the actual authority hubs on your site, which pages donate or receive the most internal links, the anchor text distribution per target, and any orphan clusters disjoint from the main site graph. The single most actionable output of an internal-linking audit.

curl -X POST https://api.crawlcrawl.com/v1/actors/internal-link-graph \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "crawl_id": 30676, "top_n": 50 }'
# → 200 {
#     actor, url: "crawl:30676", elapsed_ms,
#     data: {
#       pagerank:              [ { url, score }, ... ],
#       top_donors:            [ { url, edges }, ... ],
#       top_recipients:        [ { url, edges }, ... ],
#       anchor_text_by_target: [ { url, anchors: [ {text, count}, ... ] }, ... ],
#       orphan_clusters:       [ { size, urls: [...] }, ... ],
#       stats: { pages, edges, components, iterations, converged }
#     }
#   }

Pair this with a weekly crawl to track how internal authority shifts as content is added. Cap is 100,000 pages per call — suitable for every agency client portfolio.

POST /v1/actors/sitemap-audit

POST/v1/actors/sitemap-audit

"Sitemap Health" check. Auto-discovers your sitemap(s) via robots.txt and the /sitemap.xml fallback, recurses sitemap-indexes one level deep, then probes every URL for HTTP status, canonical, X-Robots-Tag, and meta-robots. Buckets every URL into ok, redirect, client_error, server_error, noindex, canonicalised_away, network_error — the seven buckets every SEO weekly report needs.

# Preview the sitemap before paying for probes (returns the URL list, no billing)
curl -X POST https://api.crawlcrawl.com/v1/actors/sitemap-audit \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://your-client-site.com", "dry_run": true }'

# Full audit. Each probed URL = 1 page-credit. Default cap is 100 — set max_urls explicitly
# for larger audits. Hard ceiling is 50,000.
curl -X POST https://api.crawlcrawl.com/v1/actors/sitemap-audit \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://your-client-site.com", "max_urls": 5000 }'
# → 200 {
#     actor, url, elapsed_ms,
#     data: {
#       sitemaps_found:       [ "...sitemap.xml", "...sitemap-products.xml", ... ],
#       total_urls_in_sitemap, total_urls_probed,
#       pages_billed,         // explicit billing signal: equals total_urls_probed
#       summary: { ok, redirect, client_error, server_error,
#                  noindex, canonicalised_away, network_error },
#       items: [ { url, status_code, bucket, canonical, x_robots_tag,
#                  meta_robots, lastmod }, ... ]
#     }
#   }

Billing: 1 page-credit per probed URL. Default max_urls is 100 (low on purpose to avoid surprise bills). Hard ceiling is 50,000. Use "dry_run": true to discover sitemaps + count URLs without probing — useful for previewing cost before kicking off a real audit.

cron field — recurring monitors

Add cron to your POST /v1/crawls body to register a recurring monitor instead of a one-shot run. Standard 5-field UTC cron expression.

{
  "url": "https://competitor.com/pricing",
  "max_pages": 1,
  "cron": "0 */6 * * *",                // every 6 hours
  "webhook_url": "https://yourapp.com/changes",
  "webhook_events": ["crawl.diff_detected"],
  "return_only_changed": true
}
# → 201 { monitor_id, schedule, next_fires_at, url, info }

Monitor cap per tier (3 / 25 / 200). Tick fires at the closest UTC schedule match. previous_run_id is auto-linked between consecutive ticks.

webhook_events

Array of event types to deliver. Defaults to ["crawl.done"].

Event	When it fires
crawl.done	Run reaches terminal state (done or failed).
crawl.diff_detected	Recurring monitor: previous run exists, diff is non-zero. Body includes `diff: {added, removed, changed, unchanged}` + `previous_run_id`.

return_only_changed

For recurring monitors only. When true, skip webhook delivery if the diff vs previous run shows zero changes. webhook_status is set to 'suppressed_no_change'. First run in a chain (no previous) sets 'suppressed_no_baseline'.

GET /v1/crons — list active monitors

GET/v1/crons

Returns each monitor's id, schedule, url, enabled, last_fired_at, next_fires_at, last_run_id.

curl https://api.crawlcrawl.com/v1/crons \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { items: [{ id, schedule, url, enabled, last_fired_at, next_fires_at, last_run_id }] }

PATCH /v1/crons/{id} — edit a monitor

PATCH/v1/crons/{id}

Update a monitor's schedule, enabled state, or webhook target without recreating it. Partial payload — only the fields you send are changed.

curl -X PATCH https://api.crawlcrawl.com/v1/crons/42 \ -H "Authorization: Bearer $CRAWLCRAWL_KEY" \ -H "Content-Type: application/json" \ -d '{ "schedule": "0 9 * * 1", "enabled": false }' # Body shape:

{ "schedule": "0 9 * * 1",         // new cron (UTC); omit to keep current
  "enabled": false,                // pause without deleting; omit to keep current
  "webhook_url": "https://yourapp.com/new"
}
# → 200 { id, schedule, enabled, next_fires_at, ... }

DELETE /v1/crons/{id} — remove a monitor

DELETE/v1/crons/{id}

curl -X DELETE https://api.crawlcrawl.com/v1/crons/42 \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 204 No Content

GET /v1/usage — current period usage

GET/v1/usage

Returns tier name, configured caps, today and month-to-date counters, percentages, and seconds until UTC midnight reset. Two integer counters: pages (standard fetches) and cloud_pages (anti-bot edge network).

curl https://api.crawlcrawl.com/v1/usage \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200
{
  "tier": "pro",
  "caps": {
    "pages_per_day":       2000,
    "pages_per_month":     50000,
    "cloud_pages_per_day": 1000
  },
  "today": {
    "date":            "2026-05-17",
    "pages":           184,
    "cloud_pages":     7,                  // 5-result search = 5; 1 cloud_scrape = 1; 1 cloud_render = 1
    "cloud_cost_usd":  0.0021,
    "pages_pct":       9.2,
    "cloud_pages_pct": 0.7
  },
  "month": {
    "pages":           1278,
    "cloud_pages":     144,
    "cloud_cost_usd":  0.0087,
    "pages_pct":       2.6
  },
  "retry_after_seconds": 37889
}

Billing is append-only: every chargeable call writes a row to usage_events with an idempotency token, so retries never double-bill. Search bills per result returned; everything else bills 1 page (or 1 page per page crawled for multi-page crawls).

GET /v1/usage/history

GET/v1/usage/history?days=30

Daily buckets for the last N days (max 365).

curl "https://api.crawlcrawl.com/v1/usage/history?days=30" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { days: [{ date, pages, cloud_pages, cloud_cost_usd }] }

GET /v1/keys — list API keys

GET/v1/keys

Hash prefix only — plaintext is never recoverable. Includes created_at, last_used_at, revoked_at, expires_at.

curl https://api.crawlcrawl.com/v1/keys \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { items: [{ prefix, label, created_at, last_used_at, revoked_at, expires_at }] }

POST /v1/keys/rotate

POST/v1/keys/rotate

curl -X POST https://api.crawlcrawl.com/v1/keys/rotate \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "label": "ci-runner", "grace_seconds": 86400 }'
# → 200 { api_key, prefix, label, grace_seconds }
# Plaintext shown ONCE.

DELETE /v1/keys/{prefix}

DELETE/v1/keys/{prefix}

Refuses if it would leave the project with no active key (returns 409).

curl -X DELETE https://api.crawlcrawl.com/v1/keys/crk_abcd \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 204 No Content  (or 409 if this is your last active key)

GET /v1/logs — audit feed

GET/v1/logs?limit=100&status_min=400&status_max=499

Last N requests for this project with method, path, status, duration_ms, client_ip, request_id.

curl "https://api.crawlcrawl.com/v1/logs?limit=100&status_min=400&status_max=499" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { items: [{ ts, method, path, status, duration_ms, client_ip, request_id }] }

GET /v1/webhook/secret

GET/v1/webhook/secret

Returns the project's HMAC-SHA256 secret in hex. Lazy-generated on first call. Use to verify X-Crawler-Signature on incoming webhooks.

curl https://api.crawlcrawl.com/v1/webhook/secret \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { secret_hex: "a1b2..." }

GET /v1/robots-policy

GET/v1/robots-policy?url=https://example.com

Returns parsed AI-bot policy + raw robots.txt + llms.txt for the host. Cached per-host (TTL 1h). Cheap.

curl "https://api.crawlcrawl.com/v1/robots-policy?url=https://example.com" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { url, robots_txt, llms_txt, ai_bot_policy: { GPTBot: "allowed", ClaudeBot: "allowed", ... } }

POST /v1/llms-txt-build

POST/v1/llms-txt-build

Crawl a domain, return a properly-formatted llms.txt file ready to publish. Synchronous (waits for crawl up to wait_seconds; default 60, max 180).

curl -X POST https://api.crawlcrawl.com/v1/llms-txt-build \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://customer.com", "max_pages": 30, "site_name": "Customer Inc", "summary": "B2B widget company", "wait_seconds": 60 }'
# → 200 text/plain  (the llms.txt content)
# → 408 Request Timeout if crawl exceeds wait_seconds (poll /v1/crawls/{run_id})

GET /v1/health/cloud

GET/v1/health/cloud

Returns whether anti-bot routing is configured, current account balance, and this project's last 24h cloud usage.

curl https://api.crawlcrawl.com/v1/health/cloud \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"
# → 200 { configured, balance_usd, last_24h: { cloud_pages, cloud_cost_usd } }

Tooling

Two artifacts are kept in sync with this reference. Use either to skip the boilerplate.

Artifact	Use it for	Download
postman.json	Postman / Insomnia / Bruno collection. Pre-populated requests with sample bodies and `{{base_url}}` + `{{api_key}}` variables.	Download
openapi.json	OpenAPI 3.1 spec. Auto-generate client libs (Python / Node / Go / Rust) via `openapi-generator`, register as a ChatGPT custom GPT action, or import into any tool that speaks OpenAPI.	Download

Last updated 2026-05-23. API surface is versioned at /v1. Wire format is stable.