Extract Content
Extracts structured content and statistics from a web page, including headers, paragraphs, links, and images.
Common Properties
- Name - The custom name of the node.
- Color - The custom color of the node.
- Delay Before (sec) - Waits in seconds before executing the node.
- Delay After (sec) - Waits in seconds after executing node.
- Continue On Error - Automation will continue regardless of any error. The default value is false.
info
If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.
Inputs
- URL - The full URL of the web page from which to extract content.
Output
- Content - An array of content objects, each containing the type of content (h1, h2, h3, h4, h5, p, list, etc.), the actual content text, and any associated links or images.
- Stats - Statistics about the extracted content, including word count, headers count, links count, and images count.
How It Works
The Extract Content node analyzes a web page and extracts structured content by:
- Visiting the specified URL and retrieving the HTML content
- Parsing the HTML to identify relevant content elements (headers, paragraphs, lists, etc.)
- Extracting text content, links, and images from these elements
- Calculating statistics about the extracted content
- Returning both the structured content and statistics as separate outputs
Requirements
- A valid URL to a web page accessible from the internet
- The web page must return valid HTML content
Error Handling
The node will return specific errors in the following cases:
- Empty or invalid URL input
- Network errors when connecting to the specified URL
- Errors parsing the HTML content of the page
- Content not found on the page
Usage Notes
- The node extracts content from the main content area of a page by finding the deepest common ancestor of h2 tags
- If fewer than 3 h2 tags are found, the node falls back to extracting all relevant content from the page
- Content is extracted for the following HTML elements: h1, h2, h3, h4, h5, p, ol, ul, div, span, and figure
- For each content element, associated links and images are also extracted when present
- The stats output provides useful metrics about the extracted content, such as word count and number of links
- The extracted content is structured to preserve the semantic meaning of the original HTML elements