Using the Trafilatura API
Introduction
Get the last version of the cutting-edge Trafilatura download and extraction package straight as an API
Fast URL download, or use HTML file as input
Extraction of key elements like main text and metadata
Configurable output
Its component Htmldate can be used separately to extract original and updated publication dates from a URL or a web page.
Endpoints
The following endpoints offer a limited access for testing and prototyping. They accept URL or HTML data as input.
Extraction of main text and metadata: /extract-demo
Date extraction: /htmldate-demo
Provide URLs
A single URL parameter is required, pass it as JSON parameter in a POST request:
url parameter: {"url": "https://www.example.org"}
Bring your own data
Pass the HTML file as a string:
htmldata parameter: {"htmldata": "<html><head/><body/></html>"}
Optional: the server supports GZip compression of the request.
Response
The endpoints return JSON with a field consisting of either "result" or "error":
{"result": "Example text"} (extract)
{"result": "2021-06-20"} (htmldate)
{"error": "error message"}
The result is a string in the desired format or an empty string if no text has been found.
Limitations
The API focuses on the open web, it offers no access to text behind paywalls or logins. Use the demo to get a glimpse of the extracted content.
The download utility makes no effort to hide itself, sending too many requests to the same URL or website may alter the results.
The demo versions are rate-limited, for larger volumes see the authenticated API.
Options
Trafilatura configuration
Additional parameters can be passed as arguments.
Example:
{
"url": "https://www.example.org",
"args": {
"output_format": "xml",
"include_links": True,
}
}
Notable parameters:
record_id: Add an ID to the metadata.
no_fallback: Skip the backup extraction with readability-lxml and justext.
favor_precision: prefer less text but correct extraction.
favor_recall: when unsure, prefer more text.
include_comments: Extract comments along with the main text.
output_format: Define an output format: ‘txt’, ‘csv’, ‘json’, ‘xml’, or ‘xmltei’.
include_tables: Take into account information within the HTML <table> element.
include_images: Take images into account (experimental).
include_formatting: Keep structural elements related to formatting (only valuable if output_format is set to XML).
include_links: Keep links along with their targets (experimental).
deduplicate: Remove duplicate segments and documents.
date_extraction_params: Provide extraction parameters to htmldate as dict().
prune_xpath: Provide an XPath expression to prune the tree before extraction.
config: Directly provide a configparser configuration.
For more information see the documentation page usage with Python.
Date extraction with Htmldate
Input data follows the same pattern.
Parameters of interest:
outputformat: defaults to %Y-%m-%d
extensive_search: defaults to True
original_date: True or False, original vs. updated publication date
For more see htmldate's documentation.
Logs and data protection
The data you send through the API is your responsiblity only, it is not kept.
By using the endpoints you consent to the following terms:
A list of IPs used to connect to this website may be kept for a few days to prevent abuse.
Error logs may be kept for a few weeks for debugging purposes. These merely include metadata about a request and not the actual content.
There is no tracking on this page.
Contact
For feedback and contact information, see the Github pages of the trafilatura repository and htmldate repository.