Extract Content

Extracts structured content and statistics from a web page, including headers, paragraphs, links, and images.

Common Properties

Name - The custom name of the node.
Color - The custom color of the node.
Delay Before (sec) - Waits in seconds before executing the node.
Delay After (sec) - Waits in seconds after executing node.
Continue On Error - Automation will continue regardless of any error. The default value is false.

info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Content - An array of content objects, each containing the type of content (h1, h2, h3, h4, h5, p, list, etc.), the actual content text, and any associated links or images.
Stats - Statistics about the extracted content, including word count, headers count, links count, and images count.

The Extract Content node analyzes a web page and extracts structured content by:

Visiting the specified URL and retrieving the HTML content
Parsing the HTML to identify relevant content elements (headers, paragraphs, lists, etc.)
Extracting text content, links, and images from these elements
Calculating statistics about the extracted content
Returning both the structured content and statistics as separate outputs

The node will return specific errors in the following cases:

The node extracts content from the main content area of a page by finding the deepest common ancestor of h2 tags
If fewer than 3 h2 tags are found, the node falls back to extracting all relevant content from the page
Content is extracted for the following HTML elements: h1, h2, h3, h4, h5, p, ol, ul, div, span, and figure
For each content element, associated links and images are also extracted when present
The stats output provides useful metrics about the extracted content, such as word count and number of links
The extracted content is structured to preserve the semantic meaning of the original HTML elements