Crawl any documentation site, blog, or knowledge base to clean markdown for your vector database.
Most teams build custom pipelines: HTTP requests, HTML parsing, markdown conversion, deduplication, and change monitoring. This takes weeks to implement and breaks when source sites change.
Send one POST to /v1/crawls with the URL and max_pages. We follow internal links, render JavaScript when needed, deduplicate, and return clean markdown. Each page is a JSON object in the response.
curl -X POST \
https://api.crawlcrawl.com/v1/crawls \
-H 'Authorization: Bearer crk_...' \
-H 'Content-Type: application/json' \
-d '{"url": "https://docs.example.com", "max_pages": 10}'
[
{
"url": "https://docs.example.com/page1",
"markdown": "# Heading\n\nThis is a paragraph."
},
{
"url": "https://docs.example.com/page2",
"markdown": "## Subheading\n\nThis is another paragraph."
}
]
Schedule a daily crawl with webhook_url and return_only_changed: true. The webhook fires only when content changes.
Agency tier supports 17,000 pages per day. A full documentation site refresh typically requires 50-500 pages.