Using the Trafilatura API

  1. Introduction

  2. Endpoints

  3. Limitations

  4. Options

  5. Logs and data protection

  6. Contact

Introduction

Get the last version of the cutting-edge Trafilatura download and extraction package straight as an API

Its component Htmldate can be used separately to extract original and updated publication dates from a URL or a web page.

Endpoints

The following endpoints offer a limited access for testing and prototyping. They accept URL or HTML data as input.

Provide URLs

A single URL parameter is required, pass it as JSON parameter in a POST request:

  • url parameter: {"url": "https://www.example.org"}

Bring your own data

Pass the HTML file as a string:

  • htmldata parameter: {"htmldata": "<html><head/><body/></html>"}

Optional: the server supports GZip compression of the request.

Response

The endpoints return JSON with a field consisting of either "result" or "error":

  • {"result": "Example text"} (extract)

  • {"result": "2021-06-20"} (htmldate)

  • {"error": "error message"}

The result is a string in the desired format or an empty string if no text has been found.

Limitations

Options

Trafilatura configuration

Additional parameters can be passed as arguments.

Example:

{
  "url": "https://www.example.org",
  "args": {
    "output_format": "xml",
    "include_links": True,
  }
}

Notable parameters:

  • record_id: Add an ID to the metadata.

  • no_fallback: Skip the backup extraction with readability-lxml and justext.

  • favor_precision: prefer less text but correct extraction.

  • favor_recall: when unsure, prefer more text.

  • include_comments: Extract comments along with the main text.

  • output_format: Define an output format: ‘txt’, ‘csv’, ‘json’, ‘xml’, or ‘xmltei’.

  • include_tables: Take into account information within the HTML <table> element.

  • include_images: Take images into account (experimental).

  • include_formatting: Keep structural elements related to formatting (only valuable if output_format is set to XML).

  • include_links: Keep links along with their targets (experimental).

  • deduplicate: Remove duplicate segments and documents.

  • date_extraction_params: Provide extraction parameters to htmldate as dict().

  • prune_xpath: Provide an XPath expression to prune the tree before extraction.

  • config: Directly provide a configparser configuration.

For more information see the documentation page usage with Python.

Date extraction with Htmldate

Input data follows the same pattern.

Parameters of interest:

  • outputformat: defaults to %Y-%m-%d

  • extensive_search: defaults to True

  • original_date: True or False, original vs. updated publication date

For more see htmldate's documentation.

Logs and data protection

The data you send through the API is your responsiblity only, it is not kept.

By using the endpoints you consent to the following terms:

There is no tracking on this page.

Contact

For feedback and contact information, see the Github pages of the trafilatura repository and htmldate repository.