Skip to main content

TF-IDF Analysis

Performs TF-IDF (Term Frequency-Inverse Document Frequency) analysis on documents to identify important keywords. This advanced algorithm identifies keywords that are important to specific documents while filtering out common terms that appear across all documents.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.
info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

  • Documents - An array of JSON objects each containing name and text fields. Example format:
    [
    {"name": "Document 1", "text": "Your document text here"},
    {"name": "Document 2", "text": "Another document text"}
    ]

Options

  • Top N - Specifies the number of top terms to retrieve from the TF-IDF analysis. Default: 25.

  • Min N-gram - Sets the lower boundary of the range of n-values for different n-gram sizes to be extracted. For instance, min_ngram=1 means the analysis will include unigrams (single words). Default: 1.

  • Max N-gram - Sets the upper boundary for the range of n-values for different n-gram sizes. A max_ngram=3 would mean the analysis includes trigrams (phrases of three words) at most. Default: 3.

  • Min DF - Minimum document frequency threshold. Terms appearing in fewer documents than this percentage will be ignored. A value of 0.01 (1%) means ignoring terms that appear in less than 1% of your documents. Default: 0.01.

  • Max DF - Maximum document frequency threshold. Terms appearing in more documents than this percentage will be filtered out. A value of 0.5 (50%) filters out terms that appear in more than half of your documents, as they are likely too common to be useful for differentiation. Default: 0.5.

  • Norm - Standardizes each output to have a unit norm. Options:

    • L1 normalization - Sum of absolute values is 1
    • L2 normalization - Sum of squares of vector elements is 1 (suitable for cosine similarity)
    • No normalization - No normalization applied

    Default: L2 normalization.

  • Language - Language for stop word filtering. Stop words are ignored during the process of vectorizing text, as these words usually carry less meaningful information for analysis (ex: "the", "is", "in" words in English). Supported languages: Catalan, Chinese, Croatian, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Japanese, Korean, Lithuanian, Macedonian, Norwegian Bokmål, Polish, Portuguese, Romanian, Russian, Slovenian, Spanish, Swedish, Turkish, Ukrainian, Hindi, Urdu. Default: English.

  • Language ISO - You can enter ISO code of language for some advanced use-cases. This option will override the Language choice.

  • Smooth IDF - Smoothing technique for IDF calculation to prevent zero divisions. Default: true.

Output

  • TF-IDF Results - An object containing the analysis results with the following structure:
    {
    "columns": ["Keyword", "Document 1", "Document 2", "N-Gram", "Document Count", "Total Count", "TF-IDF Score", "Total Score"],
    "rows": [
    {
    "Keyword": "keyword phrase",
    "Document 1": 0.523,
    "Document 2": 0.0,
    "N-Gram": 2,
    "Document Count": 1,
    "Total Count": 2.156,
    "TF-IDF Score": 1.523,
    "Total Score": 3.046
    }
    ]
    }

How It Works

TF-IDF Analysis identifies keywords that are important to specific documents while filtering out common terms. The algorithm:

  1. Tokenizes documents into n-grams (single words, two-word phrases, three-word phrases based on your settings)
  2. Filters stop words based on the selected language
  3. Calculates TF-IDF scores for each term in each document
  4. Applies document frequency filtering to remove very common and very rare terms
  5. Validates n-grams by ensuring they exist exactly in at least one document
  6. Computes a Total Score by multiplying N-Gram length, Document Count, and TF-IDF Score
  7. Ranks keywords by Total Score to identify the most significant terms
  8. Returns Top N keywords with their scores across all documents

Practical Examples

Example 1: Analyzing Blog Posts for SEO

Analyze multiple blog posts to identify unique keywords for each:

// Prepare documents
msg.documents = [
{
name: "Blog Post 1",
text: "Best practices for SEO optimization in 2024. Learn advanced SEO techniques and strategies."
},
{
name: "Blog Post 2",
text: "Content marketing strategies for small businesses. Effective marketing tips and tools."
}
];

// Configure TF-IDF node
// Top N: 10
// Min N-gram: 1
// Max N-gram: 2
// Min DF: 0.01
// Max DF: 0.5
// Language: English

// Access results
const keywords = msg.results.rows;
keywords.forEach(keyword => {
console.log(`${keyword.Keyword}: Score ${keyword['Total Score']}`);
});

Example 2: Competitive Content Analysis

Compare your content against competitor pages:

msg.documents = [
{
name: "My Page",
text: msg.myPageContent
},
{
name: "Competitor 1",
text: msg.competitor1Content
},
{
name: "Competitor 2",
text: msg.competitor2Content
}
];

// After TF-IDF analysis
const results = msg.results.rows;
const myKeywords = results.filter(r => r['My Page'] > 0);
const competitorKeywords = results.filter(r => r['Competitor 1'] > 0 || r['Competitor 2'] > 0);

Example 3: Multi-language Analysis

Analyze Turkish content:

msg.documents = [
{
name: "Türkçe Makale 1",
text: "SEO optimizasyonu için en iyi uygulamalar..."
},
{
name: "Türkçe Makale 2",
text: "Dijital pazarlama stratejileri..."
}
];

// Set Language: Turkish
// or use Language ISO: "tr"

Tips for Effective Use

  1. Document Preparation

    • Clean your text data before analysis (remove HTML tags, excessive whitespace)
    • Ensure documents are in the same language
    • Longer documents generally provide better analysis results
  2. N-gram Settings

    • Use Min=1, Max=3 for general analysis to capture single words and phrases
    • Increase Max N-gram to 4 or 5 for industry-specific terminology
    • Single words (unigrams) are good for broad topics, phrases capture specific concepts
  3. Document Frequency Tuning

    • Adjust Min DF based on corpus size: larger corpus = higher Min DF
    • Lower Max DF (0.3-0.4) for more distinctive keywords
    • Increase Max DF if too many relevant terms are being filtered
  4. Top N Selection

    • Start with 25-50 keywords for initial analysis
    • Increase for comprehensive keyword research
    • Decrease for focused topic extraction
  5. Normalization Choice

    • Use L2 for cosine similarity and general comparisons
    • Use L1 for probability-like distributions
    • Use No normalization if you need raw TF-IDF scores

Common Errors and Solutions

Error: "Language and Language ISO can not be empty at the same time"

Cause: No language has been selected.

Solution: Select a language from the Language dropdown or provide an ISO code in Language ISO field.

Issue: Too Many Common Words in Results

Cause: Max DF threshold is too high.

Solution: Lower the Max DF value to 0.3 or 0.4 to filter out more common terms.

Issue: No Keywords Found

Cause: Min DF threshold is too high for your document set, or documents are too short.

Solution:

  • Lower the Min DF value
  • Ensure you have enough documents (at least 10 recommended)
  • Check that documents contain sufficient text content

Issue: Results Missing Expected Keywords

Cause: Keywords might be stop words or filtered by document frequency thresholds.

Solutions:

  • Verify the keyword isn't a stop word in the selected language
  • Lower Min DF if terms appear in few documents
  • Increase Max DF if terms appear in many documents
  • Check if the exact phrase exists in your documents

Issue: Non-meaningful N-grams Appearing

Cause: N-grams are constructed from adjacent words but may not form meaningful phrases.

Solution: The algorithm already filters n-grams that don't exist exactly in documents. Consider:

  • Adjusting Min DF to require terms appear in more documents
  • Reducing Max N-gram value
  • Manually reviewing and filtering results based on Total Score

Performance Considerations

  • Processing time increases with:
    • Number of documents
    • Document length
    • Max N-gram value
    • Top N value
  • For large document sets (100+ documents), consider:
    • Processing in batches
    • Increasing Min DF to reduce vocabulary size
    • Decreasing Max N-gram
  • Memory usage scales with vocabulary size and number of documents
  • Frequency Analysis - Simpler alternative using raw counts instead of TF-IDF weighting
  • Cluster Keywords - Group similar keywords from TF-IDF results
  • Normalize Text - Preprocess text before analysis for better results
  • Count Occurrences - Verify keyword presence in specific documents