RAG Ingestion — Crawl Any Site to Clean Markdown for Vector Databases

The problem most teams hit by month two

Building the first version of a RAG pipeline takes a week. Building the version that stays current, deduplicates intelligently, and survives source-site changes takes a quarter. Most of that quarter is plumbing: HTTP requests, HTML parsing, markdown conversion, deduplication, change monitoring, scheduling, retry logic, and the inevitable rewrite when the first design choices stop scaling.

crawlcrawl absorbs the plumbing into one API. Send one POST with a URL and a page limit; receive clean markdown for every page along with the structured-data signals the source author already wrote into the HTML. Scheduled refresh, change-detection diff, and dataset storage are included at every paid tier. The whole pipeline shrinks from a quarter of engineering work to a configuration file.

What crawlcrawl provides for RAG specifically

Most general-purpose crawlers were not designed for retrieval-augmented generation. They return HTML, leave the structure extraction to you, and have no concept of "refresh this corpus daily and tell me what changed." crawlcrawl was designed around the question RAG teams actually need answered.

Markdown by default, with headings, lists, tables, and code blocks preserved for clean chunking.
Structured signals in the same response — schema.org, Open Graph, JSON-LD, hreflang, canonical, robots-meta — so chunks can be filtered by metadata at retrieval time.
Multi-page crawl with link discovery, depth control, and sitemap-driven seeding.
JavaScript rendering when needed, automatically, without configuration on your side.
Scheduled crawls with cron-style cadence so the corpus stays fresh without you running an orchestrator.
Change-detection diff between runs so the ingestion pipeline re-embeds only the pages that actually changed.
HMAC-signed webhooks with retry for pipelines that prefer event-driven updates over polling.

The simplest possible setup

curl -X POST https://api.crawlcrawl.com/v1/crawls \
  -H "Authorization: Bearer crk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "max_pages": 200
  }'

# returns
{ "id": 504, "status": "queued", "url": "https://docs.example.com" }

That is the entire ingestion call for a one-time corpus build. The run completes asynchronously; you fetch the results when ready. Each page in the response includes markdown, metadata, schema-extracted JSON-LD, and a stable content hash for downstream deduplication.

Adding scheduled refresh

For continuous corpora that need to track upstream changes, add a cron expression and a webhook URL. The crawl now repeats on schedule; the webhook fires only when content actually changed (via return_only_changed: true); your ingestion pipeline embeds only the deltas.

curl -X POST https://api.crawlcrawl.com/v1/crawls \
  -d '{
    "url": "https://docs.example.com",
    "max_pages": 200,
    "cron": "0 3 * * *",
    "webhook_url": "https://yourapp.com/hooks/rag-refresh",
    "return_only_changed": true
  }'

The webhook payload includes the run ID. Your handler retrieves the pages via GET /v1/crawls/{id}/pages, runs them through your embedding model, and updates the vector store. Pages that did not change are not in the response, so the embedding cost stays proportional to actual deltas rather than total corpus size.

Using the diff endpoint for change-aware re-ingestion

For workloads that prefer pulling over push, the diff endpoint compares two runs of the same site and returns added, removed, and changed pages. Many teams pair daily crawls with weekly diff sweeps to catch slower-moving changes.

GET /v1/crawls/{old}/diff/{new}

# returns
{
  "added":   [ { "url": "...", "content_hash": "..." }, ... ],
  "removed": [ { "url": "...", "content_hash": "..." }, ... ],
  "changed": [ { "url": "...", "old_hash": "...", "new_hash": "..." }, ... ]
}

What customers run on this pattern

"Our learning platform needs current documentation from forty networking vendors. crawlcrawl pulls clean markdown with structured signals on schedule, sends webhooks when content changes, and stores datasets we query directly. Our content team went from chasing PDFs to reviewing diffs."
— Amit Tanwar, Founder, Networkers Home

Pricing for RAG ingestion

One page fetched equals one credit. Structured-data extraction, JavaScript rendering, and global routing are included; they do not multiply credit cost. Pro at $8/mo covers 5,000 pages. Studio at $42 covers 100,000. Agency at $167 covers 500,000. Half the price of Firecrawl at every tier. A full documentation-site refresh typically needs 50 to 500 pages, so most production RAG ingestion lands cleanly inside the Pro or Studio tiers. See full pricing →

RAG ingestion.