Using the Trafilatura API

Introduction

Get the last version of the cutting-edge Trafilatura download and extraction package straight as an API

Its component Htmldate can be used separately to extract original and updated publication dates from a URL or a web page.

The following endpoints offer a limited access for testing and prototyping. They accept URL or HTML data as input.

A single URL parameter is required, pass it as JSON parameter in a POST request:

Pass the HTML file as a string:

Optional: the server supports GZip compression of the request.

The endpoints return JSON with a field consisting of either "result" or "error":

The result is a string in the desired format or an empty string if no text has been found.

The API focuses on the open web, it offers no access to text behind paywalls or logins. Use the demo to get a glimpse of the extracted content.
The download utility makes no effort to hide itself, sending too many requests to the same URL or website may alter the results.
The demo versions are rate-limited, for larger volumes see the authenticated API.

Additional parameters can be passed as arguments.

Example:

{
  "url": "https://www.example.org",
  "args": {
    "output_format": "xml",
    "include_links": True,
  }
}

Notable parameters:

record_id: Add an ID to the metadata.
no_fallback: Skip the backup extraction with readability-lxml and justext.
favor_precision: prefer less text but correct extraction.
favor_recall: when unsure, prefer more text.
include_comments: Extract comments along with the main text.
output_format: Define an output format: ‘txt’, ‘csv’, ‘json’, ‘xml’, or ‘xmltei’.
include_tables: Take into account information within the HTML <table> element.
include_images: Take images into account (experimental).
include_formatting: Keep structural elements related to formatting (only valuable if output_format is set to XML).
include_links: Keep links along with their targets (experimental).
deduplicate: Remove duplicate segments and documents.
date_extraction_params: Provide extraction parameters to htmldate as dict().
prune_xpath: Provide an XPath expression to prune the tree before extraction.
config: Directly provide a configparser configuration.

For more information see the documentation page usage with Python.

Input data follows the same pattern.

Parameters of interest:

The data you send through the API is your responsiblity only, it is not kept.

By using the endpoints you consent to the following terms:

A list of IPs used to connect to this website may be kept for a few days to prevent abuse.
Error logs may be kept for a few weeks for debugging purposes. These merely include metadata about a request and not the actual content.

There is no tracking on this page.

For feedback and contact information, see the Github pages of the trafilatura repository and htmldate repository.