RAG ingestion — crawlcrawl

The problem

Most teams build custom pipelines: HTTP requests, HTML parsing, markdown conversion, deduplication, and change monitoring. This takes weeks to implement and breaks when source sites change.

What crawlcrawl provides

Send one POST to /v1/crawls with the URL and max_pages. We follow internal links, render JavaScript when needed, deduplicate, and return clean markdown. Each page is a JSON object in the response.

Code

curl -X POST \
  https://api.crawlcrawl.com/v1/crawls \
  -H 'Authorization: Bearer crk_...' \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://docs.example.com", "max_pages": 10}'

[
  {
    "url": "https://docs.example.com/page1",
    "markdown": "# Heading\n\nThis is a paragraph."
  },
  {
    "url": "https://docs.example.com/page2",
    "markdown": "## Subheading\n\nThis is another paragraph."
  }
]

Keep data fresh

Schedule a daily crawl with webhook_url and return_only_changed: true. The webhook fires only when content changes.

Cost

Agency tier supports 17,000 pages per day. A full documentation site refresh typically requires 50-500 pages.

RAG ingestion.

The problem

What crawlcrawl provides

Code

Keep data fresh

Cost

Start ingesting in 5 minutes.