Skip to main content

Extract Text

Extracts all text content from HTML elements, removing HTML tags and returning plain text.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.
info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

  • HTML Element - The HTML content from which to extract text.

Options

This node does not have any options.

Output

  • Text - The extracted plain text content from the HTML element.

How It Works

The Extract Text node parses HTML content and extracts all text content while removing HTML tags and markup. When executed, the node:

  1. Retrieves the HTML Element input variable
  2. Validates that the HTML content is not empty
  3. Loads the HTML content into a goquery document
  4. Uses the Text() method to extract all text content from the document
  5. Removes all HTML tags, scripts, styles, and other markup
  6. Preserves the text content and whitespace as much as possible
  7. Sets the extracted plain text as the output variable

Requirements

  • Valid HTML content
  • Non-empty HTML Element input

Error Handling

The node will return specific errors in the following cases:

  • Empty or invalid HTML Element input - "HTML Element input cannot be empty"
  • Malformed HTML that cannot be parsed
  • Issues with goquery document creation

Usage Notes

  • Extracts text from all elements within the provided HTML, including nested elements
  • Removes all HTML tags, scripts, and styles
  • Preserves text content and basic whitespace
  • Does not preserve the original formatting or structure of the HTML
  • Useful for cleaning HTML content to get plain text for further processing
  • Can handle complex HTML structures with multiple nested elements
  • The extracted text maintains the order of content as it appears in the HTML
  • Works with partial HTML fragments as well as complete HTML documents
  • Useful for web scraping, content analysis, and text processing tasks
  • The output text can be further processed by other text manipulation nodes
  • Does not decode HTML entities - they will appear as-is in the output