Normalize Text
Normalizes text using lemmatization, stemming, or POS (Part-of-Speech) tagging operations. This preprocessing node helps standardize text before analysis by reducing words to their base forms or adding grammatical context.
Common Properties
- Name - The custom name of the node.
- Color - The custom color of the node.
- Delay Before (sec) - Waits in seconds before executing the node.
- Delay After (sec) - Waits in seconds after executing node.
- Continue On Error - Automation will continue regardless of any error. The default value is false.
If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.
Inputs
- Text - Text content to normalize using the selected operation. Can be a string containing articles, documents, or any text data.
Options
-
Operation - Select the text normalization operation. Options:
- Lemmatization - Reduces words to their dictionary base form (lemma) using linguistic analysis. Example: "running" → "run", "better" → "good"
- Stemming - Reduces words to their root form using algorithmic rules. Example: "running" → "run", "better" → "better"
- POS Tagging - Adds part-of-speech tags to each word. Example: "running fast" → "running/VERB fast/ADV"
Default: Lemmatization.
-
Language - Language model to use for lemmatization and POS tagging. Supported languages: Catalan, Chinese, Croatian, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Japanese, Korean, Lithuanian, Macedonian, Norwegian Bokmål, Polish, Portuguese, Romanian, Russian, Slovenian, Spanish, Swedish, Turkish, Ukrainian. Default: English.
Note: Stemming uses a different set of language codes internally.
-
Advanced Model - You can enter a spaCy model name for advanced use cases. This option will override the Language choice. Leave empty to use the Language selection.
Output
- Lemmatized Text - The normalized text result (despite the name, this contains the result of whichever operation is selected).
How It Works
The Normalize Text node uses NLP libraries to standardize text:
Lemmatization
- Uses spaCy language models for linguistic analysis
- Considers word context and meaning
- Produces valid dictionary words
- More accurate but slower than stemming
- Best for: semantic analysis, content understanding
Stemming
- Uses NLTK's Snowball Stemmer
- Applies algorithmic rules to strip word endings
- Faster but may produce non-words
- Best for: quick preprocessing, search applications
POS Tagging
- Uses spaCy language models to identify word roles
- Adds tags like NOUN, VERB, ADJ, ADV
- Useful for grammatical analysis and filtering
- Best for: understanding text structure, filtering by word type
Language Model Information
The node automatically downloads required language models on first use. Models are cached locally for subsequent runs.
Model Size: Large models (~500MB per language) for high accuracy First Run: May take time to download models Subsequent Runs: Fast, uses cached models
Practical Examples
Example 1: Basic Lemmatization for Keyword Matching
Normalize text before counting keywords to handle word variations:
// Original text with various word forms
msg.text = "The company is optimizing their processes. They optimized performance and will optimize efficiency.";
// Configure Normalize Text:
// Operation: Lemmatization
// Language: English
// After Normalize Text
msg.normalizedText = msg.result;
// Result: "the company be optimize their process . they optimize performance and will optimize efficiency ."
// Now count keywords on normalized text
msg.text = msg.normalizedText;
msg.keywords = ["optimize"];
// After Count Occurrences
console.log(`"optimize" appears ${msg.counts.optimize} times`); // Will count all variations
Example 2: Stemming for Fast Text Processing
Use stemming for quick preprocessing in large-scale analysis:
// Content to analyze
msg.text = "Running businesses requires careful planning. The runner runs daily to maintain fitness.";
// Configure Normalize Text:
// Operation: Stemming
// Language: English (uses 'english' stemmer internally)
// After Normalize Text
console.log("Stemmed:", msg.result);
// Result: "run busi requir care plan . the runner run daili to maintain fit ."
// Note: "running", "runner", "runs" all become "run"
// "businesses" becomes "busi" (not a real word, but consistent)
Example 3: POS Tagging for Content Quality Analysis
Analyze text structure and extract specific word types:
msg.text = "Artificial intelligence revolutionizes modern technology rapidly and efficiently.";
// Configure Normalize Text:
// Operation: POS Tagging
// Language: English
// After Normalize Text
const tagged = msg.result;
// Result: "artificial/ADJ intelligence/NOUN revolutionizes/VERB modern/ADJ technology/NOUN rapidly/ADV and/CCONJ efficiently/ADV ./PUNCT"
// Parse and analyze
const words = tagged.split(' ');
msg.wordTypes = {
nouns: [],
verbs: [],
adjectives: [],
adverbs: []
};
words.forEach(word => {
const [text, pos] = word.split('/');
if (pos === 'NOUN') msg.wordTypes.nouns.push(text);
if (pos === 'VERB') msg.wordTypes.verbs.push(text);
if (pos === 'ADJ') msg.wordTypes.adjectives.push(text);
if (pos === 'ADV') msg.wordTypes.adverbs.push(text);
});
console.log("Nouns:", msg.wordTypes.nouns.join(", "));
console.log("Verbs:", msg.wordTypes.verbs.join(", "));
console.log("Adjectives:", msg.wordTypes.adjectives.join(", "));
console.log("Adverbs:", msg.wordTypes.adverbs.join(", "));
Example 4: Multi-language Lemmatization
Process content in different languages:
// Spanish content
msg.text = "Los desarrolladores están desarrollando nuevas aplicaciones. El desarrollo es rápido.";
// Configure Normalize Text:
// Operation: Lemmatization
// Language: Spanish
// After Normalize Text
console.log("Spanish normalized:", msg.result);
// "desarrolladores", "desarrollando", "desarrollo" → "desarrollar"
// Turkish content
msg.text = "Yazılımcılar yeni uygulamalar geliştiriyor. Geliştirme süreci hızlı.";
// Configure Normalize Text:
// Operation: Lemmatization
// Language: Turkish
// After Normalize Text
console.log("Turkish normalized:", msg.result);
Example 5: Preprocessing for Better Frequency Analysis
Normalize text before frequency analysis to group word variations:
// Collect documents
const rawDocuments = [
{
name: "Doc 1",
text: "Machine learning algorithms are learning from data. Deep learning is a subset."
},
{
name: "Doc 2",
text: "Learning machines can learn patterns. The learned patterns help predictions."
}
];
// Normalize each document
msg.normalizedDocuments = [];
for (let doc of rawDocuments) {
msg.text = doc.text;
// Configure Normalize Text:
// Operation: Lemmatization
// Language: English
// After Normalize Text
msg.normalizedDocuments.push({
name: doc.name,
text: msg.result
});
}
// Now run Frequency Analysis on normalized documents
msg.documents = msg.normalizedDocuments;
// After Frequency Analysis
console.log("Keywords from normalized text:");
msg.results.rows.forEach(kw => {
console.log(`${kw.Keyword}: ${kw['Total Count']}`);
});
// "learn" will include all forms: learning, learn, learned, learns
Example 6: Extract Only Nouns for Topic Modeling
Use POS tagging to extract nouns as topics:
msg.text = "Cloud computing services enable scalable infrastructure. Modern applications utilize distributed systems.";
// Configure Normalize Text:
// Operation: POS Tagging
// Language: English
// After Normalize Text
const tagged = msg.result;
// Extract only nouns
const words = tagged.split(' ');
msg.topics = words
.filter(word => word.includes('/NOUN'))
.map(word => word.split('/')[0])
.filter(word => word.length > 3); // Filter short words
console.log("Topics (nouns):", msg.topics.join(", "));
// Result: "cloud, computing, services, infrastructure, applications, systems"
// Use these nouns as keywords for content analysis
msg.keywords = msg.topics;
Example 7: Comparing Normalization Methods
See the difference between lemmatization and stemming:
const testText = "The children are playing with beautiful toys in the running water.";
// First: Lemmatization
msg.text = testText;
// Operation: Lemmatization
// After Normalize Text
const lemmatized = msg.result;
// Second: Stemming
msg.text = testText;
// Operation: Stemming
// After Normalize Text
const stemmed = msg.result;
console.log("Original:", testText);
console.log("Lemmatized:", lemmatized);
// "the child be play with beautiful toy in the run water ."
console.log("Stemmed:", stemmed);
// "the children ar play with beauti toy in the run water ."
// Lemmatization: "children" → "child" (correct form)
// Stemming: "children" → "children" (unchanged)
// Lemmatization: "beautiful" → "beautiful" (unchanged, already base form)
// Stemming: "beautiful" → "beauti" (not a real word)
Example 8: Advanced Model Usage
Use specific spaCy models for specialized needs:
msg.text = "Advanced natural language processing techniques.";
// Configure Normalize Text:
// Operation: Lemmatization
// Advanced Model: "en_core_web_sm" (smaller, faster model)
// Leave Language empty or it will be overridden
// After Normalize Text
const result = msg.result;
// Or use a custom trained model
// Advanced Model: "en_core_web_trf" (transformer-based, most accurate)
Tips for Effective Use
-
Choosing the Right Operation
- Lemmatization: When accuracy matters, semantic analysis, content quality
- Stemming: When speed matters, search indexing, large-scale processing
- POS Tagging: When you need grammatical context, filtering by word type
-
Language Selection
- Always select the correct language for your content
- Multi-language workflows require separate normalization per language
- Check if your language is supported before processing
-
Performance Considerations
- First run downloads models (one-time delay)
- Lemmatization and POS tagging slower than stemming
- Process long texts in batches if needed
-
Result Handling
- Lemmatization produces lowercase text
- Stemming may produce non-words (this is normal)
- POS tagging adds tags with "/" separator
-
Integration Workflows
- Normalize → Count Occurrences: Handle word variations
- Normalize → Frequency Analysis: Better keyword extraction
- Normalize → TF-IDF: More accurate term importance
- POS Tag → Filter → Analyze: Extract specific word types
Common Errors and Solutions
Error: "Language and Advanced Model can not be empty at the same time"
Cause: No language model has been selected.
Solution: Select a language from the Language dropdown or provide a model name in Advanced Model field.
Error: "Text cannot be empty"
Cause: Empty or null text provided.
Solution: Ensure msg.text contains actual text content before the node runs.
Error: "Unsupported Operation"
Cause: Invalid operation value or empty selection.
Solution: Select one of the three operations: Lemmatization, Stemming, or POS Tagging.
Issue: Model Download Taking Long Time
Cause: Large language models (500MB+) are being downloaded on first use.
Solution:
- This is normal for first run per language
- Be patient, models are cached afterward
- Consider using Advanced Model with smaller models (e.g., "en_core_web_sm")
Issue: Stemming Produces Strange Results
Cause: Stemming uses algorithmic rules that may create non-words.
Solution:
- This is expected behavior for stemming
- Use lemmatization if you need valid dictionary words
- Stemming is still useful for grouping variations consistently
Issue: POS Tags Not as Expected
Cause: Model may misidentify some words, especially in ambiguous contexts.
Solution:
- This is a limitation of NLP models
- Use larger, more accurate models via Advanced Model
- Consider context when interpreting tags
Understanding Operations
Lemmatization Details
- Algorithm: Dictionary and morphological analysis
- Speed: Slower (requires full NLP pipeline)
- Accuracy: High
- Output: Valid words
- Use Case: "running, runs, ran" → "run"
Stemming Details
- Algorithm: Rule-based suffix stripping
- Speed: Fast (simple string operations)
- Accuracy: Moderate
- Output: May produce non-words
- Use Case: "running, runs, runner" → "run"
POS Tagging Details
- Algorithm: Statistical model prediction
- Speed: Slower (requires full NLP pipeline)
- Accuracy: High (90%+ for most languages)
- Output: word/TAG format
- Use Case: Grammatical analysis, filtering
Common POS Tags
- NOUN: Noun (person, place, thing)
- VERB: Verb (action, state)
- ADJ: Adjective (describes noun)
- ADV: Adverb (describes verb, adjective)
- PRON: Pronoun (I, you, he, she)
- DET: Determiner (a, an, the)
- ADP: Adposition (in, on, at)
- CONJ: Conjunction (and, or, but)
- PUNCT: Punctuation
Performance Considerations
- Lemmatization: 100-500 words/second (depends on language and model)
- Stemming: 10,000+ words/second
- POS Tagging: 100-500 words/second
Model Download Times (first run only):
- Large models: 2-5 minutes
- Subsequent runs: No download needed
Memory Usage:
- Models in memory: 200-500MB per language
- Models are cached locally
Recommendations:
- Use stemming for very large datasets (millions of words)
- Use lemmatization for accuracy-critical applications
- Batch process documents if memory is limited
Related Nodes
- Count Occurrences - Count normalized keywords to handle variations
- Frequency Analysis - Better results with normalized input
- TF-IDF Analysis - More accurate with normalized text
- Cluster Keywords - Group normalized keywords for better clustering