Normalize Text

Normalizes text using lemmatization, stemming, or POS (Part-of-Speech) tagging operations. This preprocessing node helps standardize text before analysis by reducing words to their base forms or adding grammatical context.

Common Properties

Name - The custom name of the node.
Color - The custom color of the node.
Delay Before (sec) - Waits in seconds before executing the node.
Delay After (sec) - Waits in seconds after executing node.
Continue On Error - Automation will continue regardless of any error. The default value is false.

info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

Text - Text content to normalize using the selected operation. Can be a string containing articles, documents, or any text data.

Options

Operation - Select the text normalization operation. Options:
- Lemmatization - Reduces words to their dictionary base form (lemma) using linguistic analysis. Example: "running" → "run", "better" → "good"
- Stemming - Reduces words to their root form using algorithmic rules. Example: "running" → "run", "better" → "better"
- POS Tagging - Adds part-of-speech tags to each word. Example: "running fast" → "running/VERB fast/ADV"
Default: Lemmatization.
Language - Language model to use for lemmatization and POS tagging. Supported languages: Catalan, Chinese, Croatian, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Japanese, Korean, Lithuanian, Macedonian, Norwegian Bokmål, Polish, Portuguese, Romanian, Russian, Slovenian, Spanish, Swedish, Turkish, Ukrainian. Default: English.

Note: Stemming uses a different set of language codes internally.
Advanced Model - You can enter a spaCy model name for advanced use cases. This option will override the Language choice. Leave empty to use the Language selection.

Output

Lemmatized Text - The normalized text result (despite the name, this contains the result of whichever operation is selected).

How It Works

The Normalize Text node uses NLP libraries to standardize text:

Lemmatization

Uses spaCy language models for linguistic analysis
Considers word context and meaning
Produces valid dictionary words
More accurate but slower than stemming
Best for: semantic analysis, content understanding

Stemming

Uses NLTK's Snowball Stemmer
Applies algorithmic rules to strip word endings
Faster but may produce non-words
Best for: quick preprocessing, search applications

POS Tagging

Uses spaCy language models to identify word roles
Adds tags like NOUN, VERB, ADJ, ADV
Useful for grammatical analysis and filtering
Best for: understanding text structure, filtering by word type

Language Model Information

The node automatically downloads required language models on first use. Models are cached locally for subsequent runs.

Model Size: Large models (~500MB per language) for high accuracy First Run: May take time to download models Subsequent Runs: Fast, uses cached models

Practical Examples

Example 1: Basic Lemmatization for Keyword Matching

Normalize text before counting keywords to handle word variations:

// Original text with various word forms
msg.text = "The company is optimizing their processes. They optimized performance and will optimize efficiency.";

// Configure Normalize Text:
// Operation: Lemmatization
// Language: English

// After Normalize Text
msg.normalizedText = msg.result;
// Result: "the company be optimize their process . they optimize performance and will optimize efficiency ."

// Now count keywords on normalized text
msg.text = msg.normalizedText;
msg.keywords = ["optimize"];

// After Count Occurrences
console.log(`"optimize" appears ${msg.counts.optimize} times`); // Will count all variations

Example 2: Stemming for Fast Text Processing

Use stemming for quick preprocessing in large-scale analysis:

// Content to analyze
msg.text = "Running businesses requires careful planning. The runner runs daily to maintain fitness.";

// Configure Normalize Text:
// Operation: Stemming
// Language: English (uses 'english' stemmer internally)

// After Normalize Text
console.log("Stemmed:", msg.result);
// Result: "run busi requir care plan . the runner run daili to maintain fit ."

// Note: "running", "runner", "runs" all become "run"
// "businesses" becomes "busi" (not a real word, but consistent)

Example 3: POS Tagging for Content Quality Analysis

Analyze text structure and extract specific word types:

msg.text = "Artificial intelligence revolutionizes modern technology rapidly and efficiently.";

// Configure Normalize Text:
// Operation: POS Tagging
// Language: English

// After Normalize Text
const tagged = msg.result;
// Result: "artificial/ADJ intelligence/NOUN revolutionizes/VERB modern/ADJ technology/NOUN rapidly/ADV and/CCONJ efficiently/ADV ./PUNCT"

// Parse and analyze
const words = tagged.split(' ');
msg.wordTypes = {
  nouns: [],
  verbs: [],
  adjectives: [],
  adverbs: []
};

words.forEach(word => {
  const [text, pos] = word.split('/');
  if (pos === 'NOUN') msg.wordTypes.nouns.push(text);
  if (pos === 'VERB') msg.wordTypes.verbs.push(text);
  if (pos === 'ADJ') msg.wordTypes.adjectives.push(text);
  if (pos === 'ADV') msg.wordTypes.adverbs.push(text);
});

console.log("Nouns:", msg.wordTypes.nouns.join(", "));
console.log("Verbs:", msg.wordTypes.verbs.join(", "));
console.log("Adjectives:", msg.wordTypes.adjectives.join(", "));
console.log("Adverbs:", msg.wordTypes.adverbs.join(", "));

Example 4: Multi-language Lemmatization

Process content in different languages:

// Spanish content
msg.text = "Los desarrolladores están desarrollando nuevas aplicaciones. El desarrollo es rápido.";

// Configure Normalize Text:
// Operation: Lemmatization
// Language: Spanish

// After Normalize Text
console.log("Spanish normalized:", msg.result);
// "desarrolladores", "desarrollando", "desarrollo" → "desarrollar"

// Turkish content
msg.text = "Yazılımcılar yeni uygulamalar geliştiriyor. Geliştirme süreci hızlı.";

// Configure Normalize Text:
// Operation: Lemmatization
// Language: Turkish

// After Normalize Text
console.log("Turkish normalized:", msg.result);

Example 5: Preprocessing for Better Frequency Analysis

Normalize text before frequency analysis to group word variations:

// Collect documents
const rawDocuments = [
  {
    name: "Doc 1",
    text: "Machine learning algorithms are learning from data. Deep learning is a subset."
  },
  {
    name: "Doc 2",
    text: "Learning machines can learn patterns. The learned patterns help predictions."
  }
];

// Normalize each document
msg.normalizedDocuments = [];

for (let doc of rawDocuments) {
  msg.text = doc.text;

  // Configure Normalize Text:
  // Operation: Lemmatization
  // Language: English

  // After Normalize Text
  msg.normalizedDocuments.push({
    name: doc.name,
    text: msg.result
  });
}

// Now run Frequency Analysis on normalized documents
msg.documents = msg.normalizedDocuments;

// After Frequency Analysis
console.log("Keywords from normalized text:");
msg.results.rows.forEach(kw => {
  console.log(`${kw.Keyword}: ${kw['Total Count']}`);
});
// "learn" will include all forms: learning, learn, learned, learns

Example 6: Extract Only Nouns for Topic Modeling

Use POS tagging to extract nouns as topics:

msg.text = "Cloud computing services enable scalable infrastructure. Modern applications utilize distributed systems.";

// Configure Normalize Text:
// Operation: POS Tagging
// Language: English

// After Normalize Text
const tagged = msg.result;

// Extract only nouns
const words = tagged.split(' ');
msg.topics = words
  .filter(word => word.includes('/NOUN'))
  .map(word => word.split('/')[0])
  .filter(word => word.length > 3); // Filter short words

console.log("Topics (nouns):", msg.topics.join(", "));
// Result: "cloud, computing, services, infrastructure, applications, systems"

// Use these nouns as keywords for content analysis
msg.keywords = msg.topics;

Example 7: Comparing Normalization Methods

See the difference between lemmatization and stemming:

const testText = "The children are playing with beautiful toys in the running water.";

// First: Lemmatization
msg.text = testText;
// Operation: Lemmatization
// After Normalize Text
const lemmatized = msg.result;

// Second: Stemming
msg.text = testText;
// Operation: Stemming
// After Normalize Text
const stemmed = msg.result;

console.log("Original:", testText);
console.log("Lemmatized:", lemmatized);
// "the child be play with beautiful toy in the run water ."
console.log("Stemmed:", stemmed);
// "the children ar play with beauti toy in the run water ."

// Lemmatization: "children" → "child" (correct form)
// Stemming: "children" → "children" (unchanged)
// Lemmatization: "beautiful" → "beautiful" (unchanged, already base form)
// Stemming: "beautiful" → "beauti" (not a real word)

Example 8: Advanced Model Usage

Use specific spaCy models for specialized needs:

msg.text = "Advanced natural language processing techniques.";

// Configure Normalize Text:
// Operation: Lemmatization
// Advanced Model: "en_core_web_sm" (smaller, faster model)
// Leave Language empty or it will be overridden

// After Normalize Text
const result = msg.result;

// Or use a custom trained model
// Advanced Model: "en_core_web_trf" (transformer-based, most accurate)

Tips for Effective Use

Choosing the Right Operation
- Lemmatization: When accuracy matters, semantic analysis, content quality
- Stemming: When speed matters, search indexing, large-scale processing
- POS Tagging: When you need grammatical context, filtering by word type
Language Selection
- Always select the correct language for your content
- Multi-language workflows require separate normalization per language
- Check if your language is supported before processing
Performance Considerations
- First run downloads models (one-time delay)
- Lemmatization and POS tagging slower than stemming
- Process long texts in batches if needed
Result Handling
- Lemmatization produces lowercase text
- Stemming may produce non-words (this is normal)
- POS tagging adds tags with "/" separator
Integration Workflows
- Normalize → Count Occurrences: Handle word variations
- Normalize → Frequency Analysis: Better keyword extraction
- Normalize → TF-IDF: More accurate term importance
- POS Tag → Filter → Analyze: Extract specific word types

Common Errors and Solutions

Error: "Language and Advanced Model can not be empty at the same time"

Cause: No language model has been selected.

Solution: Select a language from the Language dropdown or provide a model name in Advanced Model field.

Error: "Text cannot be empty"

Cause: Empty or null text provided.

Solution: Ensure msg.text contains actual text content before the node runs.

Error: "Unsupported Operation"

Cause: Invalid operation value or empty selection.

Solution: Select one of the three operations: Lemmatization, Stemming, or POS Tagging.

Issue: Model Download Taking Long Time

Cause: Large language models (500MB+) are being downloaded on first use.

Solution:

This is normal for first run per language
Be patient, models are cached afterward
Consider using Advanced Model with smaller models (e.g., "en_core_web_sm")

Issue: Stemming Produces Strange Results

Cause: Stemming uses algorithmic rules that may create non-words.

Solution:

This is expected behavior for stemming
Use lemmatization if you need valid dictionary words
Stemming is still useful for grouping variations consistently

Issue: POS Tags Not as Expected

Cause: Model may misidentify some words, especially in ambiguous contexts.

Solution:

This is a limitation of NLP models
Use larger, more accurate models via Advanced Model
Consider context when interpreting tags

Understanding Operations

Lemmatization Details

Algorithm: Dictionary and morphological analysis
Speed: Slower (requires full NLP pipeline)
Accuracy: High
Output: Valid words
Use Case: "running, runs, ran" → "run"

Stemming Details

Algorithm: Rule-based suffix stripping
Speed: Fast (simple string operations)
Accuracy: Moderate
Output: May produce non-words
Use Case: "running, runs, runner" → "run"

POS Tagging Details

Algorithm: Statistical model prediction
Speed: Slower (requires full NLP pipeline)
Accuracy: High (90%+ for most languages)
Output: word/TAG format
Use Case: Grammatical analysis, filtering

Performance Considerations

Lemmatization: 100-500 words/second (depends on language and model)
Stemming: 10,000+ words/second
POS Tagging: 100-500 words/second

Model Download Times (first run only):

Large models: 2-5 minutes
Subsequent runs: No download needed

Memory Usage:

Models in memory: 200-500MB per language
Models are cached locally

Recommendations:

Use stemming for very large datasets (millions of words)
Use lemmatization for accuracy-critical applications
Batch process documents if memory is limited

Count Occurrences - Count normalized keywords to handle variations
Frequency Analysis - Better results with normalized input
TF-IDF Analysis - More accurate with normalized text
Cluster Keywords - Group normalized keywords for better clustering

Common Properties​

Inputs​

Options​

Output​

How It Works​

Lemmatization​

Stemming​

POS Tagging​

Language Model Information​

Practical Examples​

Example 1: Basic Lemmatization for Keyword Matching​

Example 2: Stemming for Fast Text Processing​

Example 3: POS Tagging for Content Quality Analysis​

Example 4: Multi-language Lemmatization​

Example 5: Preprocessing for Better Frequency Analysis​

Example 6: Extract Only Nouns for Topic Modeling​

Example 7: Comparing Normalization Methods​

Example 8: Advanced Model Usage​

Tips for Effective Use​

Common Errors and Solutions​

Error: "Language and Advanced Model can not be empty at the same time"​

Error: "Text cannot be empty"​

Error: "Unsupported Operation"​

Issue: Model Download Taking Long Time​

Issue: Stemming Produces Strange Results​

Issue: POS Tags Not as Expected​

Understanding Operations​

Lemmatization Details​

Stemming Details​

POS Tagging Details​

Common POS Tags​

Performance Considerations​

Related Nodes​

Common Properties

Inputs

Options

Output

How It Works

Lemmatization

Stemming

POS Tagging

Language Model Information

Practical Examples

Example 1: Basic Lemmatization for Keyword Matching

Example 2: Stemming for Fast Text Processing

Example 3: POS Tagging for Content Quality Analysis

Example 4: Multi-language Lemmatization

Example 5: Preprocessing for Better Frequency Analysis

Example 6: Extract Only Nouns for Topic Modeling

Example 7: Comparing Normalization Methods

Example 8: Advanced Model Usage

Tips for Effective Use

Common Errors and Solutions

Error: "Language and Advanced Model can not be empty at the same time"

Error: "Text cannot be empty"

Error: "Unsupported Operation"

Issue: Model Download Taking Long Time

Issue: Stemming Produces Strange Results

Issue: POS Tags Not as Expected

Understanding Operations

Lemmatization Details

Stemming Details

POS Tagging Details

Common POS Tags

Performance Considerations

Related Nodes