The body, nothing else

Generic markdown converters keep navigation menus, footers, ad blocks, "related posts" sidebars, and disqus comments. For RAG ingestion, news aggregation, or sentiment analysis, that noise destroys recall and inflates token cost. extract-article runs trafilatura against the page and returns just the body plus structured metadata.

curl -X POST https://api.crawlcrawl.com/v1/actors/extract-article \
  -H "Authorization: Bearer crk_..." \
  -d '{"url":"https://www.geeksforgeeks.org/dsa/binary-search/"}'

# → 200
{
  "actor": "extract-article",
  "url": "https://www.geeksforgeeks.org/dsa/binary-search/",
  "elapsed_ms": 129,
  "data": {
    "metadata": {
      "title": "Binary Search - GeeksforGeeks",
      "author": "Sandeep Jain",
      "date": "2014-01-28"
    },
    "word_count": 2798,
    "text": "Binary Search is a searching algorithm ..."
  }
}

Why trafilatura

Trafilatura is the academic gold-standard for body extraction. On the standardised CleanEval benchmark it outperforms readability.js, Newspaper3k, and Goose in precision and recall. We use the Rust port (rs-trafilatura) so a single call returns in ~130 ms with no Python sidecar.

Returned fields

metadata.title — page title, normalized.
metadata.author — extracted from meta tags, schema.org Article, or article-level byline.
metadata.date — publish date in ISO-8601. Sourced from JSON-LD datePublished, OpenGraph article:published_time, or visible byline dates in fallback order.
word_count — body word count (post-boilerplate).
text — extracted article body as plain text.

When to use it

RAG ingestion. Chunking the body without nav/footer noise cuts token usage 30–60% on most news sites and produces materially better retrieval. Pair with /v1/crawls for full-site ingest.

Editorial monitoring. Author and date in the same response make it trivial to detect new-publication events on a competitor blog without parsing HTML yourself.

Sentiment / classification pipelines. A clean body is a clean input; you avoid the failure mode where your classifier learns to read footer copyright notices.

Pricing

One page-credit per call. The $42 Studio tier includes 50,000 page-credits a month. Compare to Diffbot's Article API at $0.10–$0.25/call — extract-article is roughly 300× cheaper. See full pricing →

Article Extraction.

The body, nothing else

Why trafilatura

Returned fields

When to use it

Pricing

Where it fits

Build your RAG on clean bodies, not nav menus.