ACTOR · /v1/actors/extract-article

Article Extraction.

Drop in a URL. Get back author, publish date, and the article body — boilerplate-stripped, sidebar-free, comment-free. Built on trafilatura, the highest-precision body extractor in production today.

The body, nothing else

Generic markdown converters keep navigation menus, footers, ad blocks, "related posts" sidebars, and disqus comments. For RAG ingestion, news aggregation, or sentiment analysis, that noise destroys recall and inflates token cost. extract-article runs trafilatura against the page and returns just the body plus structured metadata.

curl -X POST https://api.crawlcrawl.com/v1/actors/extract-article \
  -H "Authorization: Bearer crk_..." \
  -d '{"url":"https://www.geeksforgeeks.org/dsa/binary-search/"}'

# → 200
{
  "actor": "extract-article",
  "url": "https://www.geeksforgeeks.org/dsa/binary-search/",
  "elapsed_ms": 129,
  "data": {
    "metadata": {
      "title": "Binary Search - GeeksforGeeks",
      "author": "Sandeep Jain",
      "date": "2014-01-28"
    },
    "word_count": 2798,
    "text": "Binary Search is a searching algorithm ..."
  }
}

Why trafilatura

Trafilatura is the academic gold-standard for body extraction. On the standardised CleanEval benchmark it outperforms readability.js, Newspaper3k, and Goose in precision and recall. We use the Rust port (rs-trafilatura) so a single call returns in ~130 ms with no Python sidecar.

Returned fields

When to use it

RAG ingestion. Chunking the body without nav/footer noise cuts token usage 30–60% on most news sites and produces materially better retrieval. Pair with /v1/crawls for full-site ingest.

Editorial monitoring. Author and date in the same response make it trivial to detect new-publication events on a competitor blog without parsing HTML yourself.

Sentiment / classification pipelines. A clean body is a clean input; you avoid the failure mode where your classifier learns to read footer copyright notices.

Pricing

One page-credit per call. The $42 Studio tier includes 50,000 page-credits a month. Compare to Diffbot's Article API at $0.10–$0.25/call — extract-article is roughly 300× cheaper. See full pricing →

Where it fits

Build your RAG on clean bodies, not nav menus.

$42/mo for 100,000 extractions. ~300× cheaper than Diffbot Article API.

Get an API key — free