Skip to main content

Normalize Text

Normalizes text using lemmatization, stemming, or POS (Part-of-Speech) tagging operations. This preprocessing node helps standardize text before analysis by reducing words to their base forms or adding grammatical context.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.
info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

  • Text - Text content to normalize using the selected operation. Can be a string containing articles, documents, or any text data.

Options

  • Operation - Select the text normalization operation. Options:

    • Lemmatization - Reduces words to their dictionary base form (lemma) using linguistic analysis. Example: "running" → "run", "better" → "good"
    • Stemming - Reduces words to their root form using algorithmic rules. Example: "running" → "run", "better" → "better"
    • POS Tagging - Adds part-of-speech tags to each word. Example: "running fast" → "running/VERB fast/ADV"

    Default: Lemmatization.

  • Language - Language model to use for lemmatization and POS tagging. Supported languages: Catalan, Chinese, Croatian, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Japanese, Korean, Lithuanian, Macedonian, Norwegian Bokmål, Polish, Portuguese, Romanian, Russian, Slovenian, Spanish, Swedish, Turkish, Ukrainian. Default: English.

    Note: Stemming uses a different set of language codes internally.

  • Advanced Model - You can enter a spaCy model name for advanced use cases. This option will override the Language choice. Leave empty to use the Language selection.

Output

  • Lemmatized Text - The normalized text result (despite the name, this contains the result of whichever operation is selected).

How It Works

The Normalize Text node uses NLP libraries to standardize text:

Lemmatization

  • Uses spaCy language models for linguistic analysis
  • Considers word context and meaning
  • Produces valid dictionary words
  • More accurate but slower than stemming
  • Best for: semantic analysis, content understanding

Stemming

  • Uses NLTK's Snowball Stemmer
  • Applies algorithmic rules to strip word endings
  • Faster but may produce non-words
  • Best for: quick preprocessing, search applications

POS Tagging

  • Uses spaCy language models to identify word roles
  • Adds tags like NOUN, VERB, ADJ, ADV
  • Useful for grammatical analysis and filtering
  • Best for: understanding text structure, filtering by word type

Language Model Information

The node automatically downloads required language models on first use. Models are cached locally for subsequent runs.

Model Size: Large models (~500MB per language) for high accuracy First Run: May take time to download models Subsequent Runs: Fast, uses cached models

Practical Examples

Example 1: Basic Lemmatization for Keyword Matching

Normalize text before counting keywords to handle word variations:

// Original text with various word forms
msg.text = "The company is optimizing their processes. They optimized performance and will optimize efficiency.";

// Configure Normalize Text:
// Operation: Lemmatization
// Language: English

// After Normalize Text
msg.normalizedText = msg.result;
// Result: "the company be optimize their process . they optimize performance and will optimize efficiency ."

// Now count keywords on normalized text
msg.text = msg.normalizedText;
msg.keywords = ["optimize"];

// After Count Occurrences
console.log(`"optimize" appears ${msg.counts.optimize} times`); // Will count all variations

Example 2: Stemming for Fast Text Processing

Use stemming for quick preprocessing in large-scale analysis:

// Content to analyze
msg.text = "Running businesses requires careful planning. The runner runs daily to maintain fitness.";

// Configure Normalize Text:
// Operation: Stemming
// Language: English (uses 'english' stemmer internally)

// After Normalize Text
console.log("Stemmed:", msg.result);
// Result: "run busi requir care plan . the runner run daili to maintain fit ."

// Note: "running", "runner", "runs" all become "run"
// "businesses" becomes "busi" (not a real word, but consistent)

Example 3: POS Tagging for Content Quality Analysis

Analyze text structure and extract specific word types:

msg.text = "Artificial intelligence revolutionizes modern technology rapidly and efficiently.";

// Configure Normalize Text:
// Operation: POS Tagging
// Language: English

// After Normalize Text
const tagged = msg.result;
// Result: "artificial/ADJ intelligence/NOUN revolutionizes/VERB modern/ADJ technology/NOUN rapidly/ADV and/CCONJ efficiently/ADV ./PUNCT"

// Parse and analyze
const words = tagged.split(' ');
msg.wordTypes = {
nouns: [],
verbs: [],
adjectives: [],
adverbs: []
};

words.forEach(word => {
const [text, pos] = word.split('/');
if (pos === 'NOUN') msg.wordTypes.nouns.push(text);
if (pos === 'VERB') msg.wordTypes.verbs.push(text);
if (pos === 'ADJ') msg.wordTypes.adjectives.push(text);
if (pos === 'ADV') msg.wordTypes.adverbs.push(text);
});

console.log("Nouns:", msg.wordTypes.nouns.join(", "));
console.log("Verbs:", msg.wordTypes.verbs.join(", "));
console.log("Adjectives:", msg.wordTypes.adjectives.join(", "));
console.log("Adverbs:", msg.wordTypes.adverbs.join(", "));

Example 4: Multi-language Lemmatization

Process content in different languages:

// Spanish content
msg.text = "Los desarrolladores están desarrollando nuevas aplicaciones. El desarrollo es rápido.";

// Configure Normalize Text:
// Operation: Lemmatization
// Language: Spanish

// After Normalize Text
console.log("Spanish normalized:", msg.result);
// "desarrolladores", "desarrollando", "desarrollo" → "desarrollar"

// Turkish content
msg.text = "Yazılımcılar yeni uygulamalar geliştiriyor. Geliştirme süreci hızlı.";

// Configure Normalize Text:
// Operation: Lemmatization
// Language: Turkish

// After Normalize Text
console.log("Turkish normalized:", msg.result);

Example 5: Preprocessing for Better Frequency Analysis

Normalize text before frequency analysis to group word variations:

// Collect documents
const rawDocuments = [
{
name: "Doc 1",
text: "Machine learning algorithms are learning from data. Deep learning is a subset."
},
{
name: "Doc 2",
text: "Learning machines can learn patterns. The learned patterns help predictions."
}
];

// Normalize each document
msg.normalizedDocuments = [];

for (let doc of rawDocuments) {
msg.text = doc.text;

// Configure Normalize Text:
// Operation: Lemmatization
// Language: English

// After Normalize Text
msg.normalizedDocuments.push({
name: doc.name,
text: msg.result
});
}

// Now run Frequency Analysis on normalized documents
msg.documents = msg.normalizedDocuments;

// After Frequency Analysis
console.log("Keywords from normalized text:");
msg.results.rows.forEach(kw => {
console.log(`${kw.Keyword}: ${kw['Total Count']}`);
});
// "learn" will include all forms: learning, learn, learned, learns

Example 6: Extract Only Nouns for Topic Modeling

Use POS tagging to extract nouns as topics:

msg.text = "Cloud computing services enable scalable infrastructure. Modern applications utilize distributed systems.";

// Configure Normalize Text:
// Operation: POS Tagging
// Language: English

// After Normalize Text
const tagged = msg.result;

// Extract only nouns
const words = tagged.split(' ');
msg.topics = words
.filter(word => word.includes('/NOUN'))
.map(word => word.split('/')[0])
.filter(word => word.length > 3); // Filter short words

console.log("Topics (nouns):", msg.topics.join(", "));
// Result: "cloud, computing, services, infrastructure, applications, systems"

// Use these nouns as keywords for content analysis
msg.keywords = msg.topics;

Example 7: Comparing Normalization Methods

See the difference between lemmatization and stemming:

const testText = "The children are playing with beautiful toys in the running water.";

// First: Lemmatization
msg.text = testText;
// Operation: Lemmatization
// After Normalize Text
const lemmatized = msg.result;

// Second: Stemming
msg.text = testText;
// Operation: Stemming
// After Normalize Text
const stemmed = msg.result;

console.log("Original:", testText);
console.log("Lemmatized:", lemmatized);
// "the child be play with beautiful toy in the run water ."
console.log("Stemmed:", stemmed);
// "the children ar play with beauti toy in the run water ."

// Lemmatization: "children" → "child" (correct form)
// Stemming: "children" → "children" (unchanged)
// Lemmatization: "beautiful" → "beautiful" (unchanged, already base form)
// Stemming: "beautiful" → "beauti" (not a real word)

Example 8: Advanced Model Usage

Use specific spaCy models for specialized needs:

msg.text = "Advanced natural language processing techniques.";

// Configure Normalize Text:
// Operation: Lemmatization
// Advanced Model: "en_core_web_sm" (smaller, faster model)
// Leave Language empty or it will be overridden

// After Normalize Text
const result = msg.result;

// Or use a custom trained model
// Advanced Model: "en_core_web_trf" (transformer-based, most accurate)

Tips for Effective Use

  1. Choosing the Right Operation

    • Lemmatization: When accuracy matters, semantic analysis, content quality
    • Stemming: When speed matters, search indexing, large-scale processing
    • POS Tagging: When you need grammatical context, filtering by word type
  2. Language Selection

    • Always select the correct language for your content
    • Multi-language workflows require separate normalization per language
    • Check if your language is supported before processing
  3. Performance Considerations

    • First run downloads models (one-time delay)
    • Lemmatization and POS tagging slower than stemming
    • Process long texts in batches if needed
  4. Result Handling

    • Lemmatization produces lowercase text
    • Stemming may produce non-words (this is normal)
    • POS tagging adds tags with "/" separator
  5. Integration Workflows

    • Normalize → Count Occurrences: Handle word variations
    • Normalize → Frequency Analysis: Better keyword extraction
    • Normalize → TF-IDF: More accurate term importance
    • POS Tag → Filter → Analyze: Extract specific word types

Common Errors and Solutions

Error: "Language and Advanced Model can not be empty at the same time"

Cause: No language model has been selected.

Solution: Select a language from the Language dropdown or provide a model name in Advanced Model field.

Error: "Text cannot be empty"

Cause: Empty or null text provided.

Solution: Ensure msg.text contains actual text content before the node runs.

Error: "Unsupported Operation"

Cause: Invalid operation value or empty selection.

Solution: Select one of the three operations: Lemmatization, Stemming, or POS Tagging.

Issue: Model Download Taking Long Time

Cause: Large language models (500MB+) are being downloaded on first use.

Solution:

  • This is normal for first run per language
  • Be patient, models are cached afterward
  • Consider using Advanced Model with smaller models (e.g., "en_core_web_sm")

Issue: Stemming Produces Strange Results

Cause: Stemming uses algorithmic rules that may create non-words.

Solution:

  • This is expected behavior for stemming
  • Use lemmatization if you need valid dictionary words
  • Stemming is still useful for grouping variations consistently

Issue: POS Tags Not as Expected

Cause: Model may misidentify some words, especially in ambiguous contexts.

Solution:

  • This is a limitation of NLP models
  • Use larger, more accurate models via Advanced Model
  • Consider context when interpreting tags

Understanding Operations

Lemmatization Details

  • Algorithm: Dictionary and morphological analysis
  • Speed: Slower (requires full NLP pipeline)
  • Accuracy: High
  • Output: Valid words
  • Use Case: "running, runs, ran" → "run"

Stemming Details

  • Algorithm: Rule-based suffix stripping
  • Speed: Fast (simple string operations)
  • Accuracy: Moderate
  • Output: May produce non-words
  • Use Case: "running, runs, runner" → "run"

POS Tagging Details

  • Algorithm: Statistical model prediction
  • Speed: Slower (requires full NLP pipeline)
  • Accuracy: High (90%+ for most languages)
  • Output: word/TAG format
  • Use Case: Grammatical analysis, filtering

Common POS Tags

  • NOUN: Noun (person, place, thing)
  • VERB: Verb (action, state)
  • ADJ: Adjective (describes noun)
  • ADV: Adverb (describes verb, adjective)
  • PRON: Pronoun (I, you, he, she)
  • DET: Determiner (a, an, the)
  • ADP: Adposition (in, on, at)
  • CONJ: Conjunction (and, or, but)
  • PUNCT: Punctuation

Performance Considerations

  • Lemmatization: 100-500 words/second (depends on language and model)
  • Stemming: 10,000+ words/second
  • POS Tagging: 100-500 words/second

Model Download Times (first run only):

  • Large models: 2-5 minutes
  • Subsequent runs: No download needed

Memory Usage:

  • Models in memory: 200-500MB per language
  • Models are cached locally

Recommendations:

  • Use stemming for very large datasets (millions of words)
  • Use lemmatization for accuracy-critical applications
  • Batch process documents if memory is limited
  • Count Occurrences - Count normalized keywords to handle variations
  • Frequency Analysis - Better results with normalized input
  • TF-IDF Analysis - More accurate with normalized text
  • Cluster Keywords - Group normalized keywords for better clustering