Skip to main content

Extract Content

Extracts structured content and statistics from a web page, including headers, paragraphs, links, and images.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.
info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

  • URL - The full URL of the web page from which to extract content.

Output

  • Content - An array of content objects, each containing the type of content (h1, h2, h3, h4, h5, p, list, etc.), the actual content text, and any associated links or images.
  • Stats - Statistics about the extracted content, including word count, headers count, links count, and images count.

How It Works

The Extract Content node analyzes a web page and extracts structured content by:

  1. Visiting the specified URL and retrieving the HTML content
  2. Parsing the HTML to identify relevant content elements (headers, paragraphs, lists, etc.)
  3. Extracting text content, links, and images from these elements
  4. Calculating statistics about the extracted content
  5. Returning both the structured content and statistics as separate outputs

Requirements

  • A valid URL to a web page accessible from the internet
  • The web page must return valid HTML content

Error Handling

The node will return specific errors in the following cases:

  • Empty or invalid URL input
  • Network errors when connecting to the specified URL
  • Errors parsing the HTML content of the page
  • Content not found on the page

Usage Notes

  • The node extracts content from the main content area of a page by finding the deepest common ancestor of h2 tags
  • If fewer than 3 h2 tags are found, the node falls back to extracting all relevant content from the page
  • Content is extracted for the following HTML elements: h1, h2, h3, h4, h5, p, ol, ul, div, span, and figure
  • For each content element, associated links and images are also extracted when present
  • The stats output provides useful metrics about the extracted content, such as word count and number of links
  • The extracted content is structured to preserve the semantic meaning of the original HTML elements