Frequency Analysis

Performs frequency analysis on documents to extract and rank keywords based on occurrence counts. This node provides a simpler alternative to TF-IDF analysis, focusing on raw frequency counts to identify commonly used terms across documents.

Common Properties

Name - The custom name of the node.
Color - The custom color of the node.
Delay Before (sec) - Waits in seconds before executing the node.
Delay After (sec) - Waits in seconds after executing node.
Continue On Error - Automation will continue regardless of any error. The default value is false.

info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

Documents - An array of JSON objects each containing name and text fields. Example format:

[
  {"name": "Document 1", "text": "Your document text here"},
  {"name": "Document 2", "text": "Another document text"}
]

Options

Top N - Specifies the number of top terms to retrieve from the frequency analysis. Default: 25.
Min N-gram - Sets the lower boundary of the range of n-values for different n-gram sizes to be extracted. For instance, min_ngram=1 means the analysis will include unigrams (single words). Default: 1.
Max N-gram - Sets the upper boundary for the range of n-values for different n-gram sizes. A max_ngram=3 would mean the analysis includes trigrams (phrases of three words) at most. Default: 3.
Min DF - Minimum document frequency threshold as a percentage. Terms appearing in fewer documents than this percentage will be ignored. A value of 0.01 (1%) means ignoring terms that appear in less than 1% of your documents. Default: 0.01.
Max DF - Maximum document frequency threshold as a percentage. Terms appearing in more documents than this percentage will be filtered out. A value of 0.5 (50%) filters out terms that appear in more than half of your documents. Default: 0.5.
Stop Words Language - Language for stop word filtering. Stop words are ignored during the process of vectorizing text, as these words usually carry less meaningful information for analysis. Supported languages: Arabic, Azerbaijani, Basque, Bengali, Catalan, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hinglish, Hungarian, Indonesian, Italian, Kazakh, Nepali, Norwegian, Portuguese, Romanian, Russian, Slovene, Spanish, Swedish, Tajik, Turkish. Default: English.

Output

Frequency Results - An object containing the analysis results with the following structure:

{
  "columns": ["Keyword", "Document 1", "Document 2", "N-Gram", "Document Count", "Total Count", "Total Score"],
  "rows": [
    {
      "Keyword": "keyword phrase",
      "Document 1": 5,
      "Document 2": 3,
      "N-Gram": 2,
      "Document Count": 2,
      "Total Count": 8,
      "Total Score": 32
    }
  ]
}

How It Works

Frequency Analysis extracts keywords based on how often they appear across documents:

Tokenizes documents into n-grams (single words, two-word phrases, three-word phrases based on your settings)
Filters stop words based on the selected language
Counts occurrences of each term in each document
Applies document frequency filtering to remove very common and very rare terms
Validates n-grams by ensuring they exist exactly in at least one document
Computes a Total Score by multiplying N-Gram length, Document Count, and Total Count
Ranks keywords by Total Score to identify the most frequently used terms
Returns Top N keywords with their counts across all documents