Skip to main content

Get Document Info

Get metadata and information about a document without fully parsing its content. This node provides a quick way to inspect document properties, check file support, and validate documents before processing.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.
info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

  • File Path - Path to the document file to inspect.

Output

  • Document Info - Object containing file metadata and properties:
    • filename - Name of the file
    • extension - File extension (e.g., ".pdf", ".docx")
    • size - File size in bytes
    • sizeFormatted - Human-readable file size (e.g., "1.5 MB")
    • mimeType - MIME type of the file
    • fileType - Descriptive file type (e.g., "PDF Document")
    • isSupported - Whether the format is supported by the package
    • (PDF only) pageCount - Number of pages
    • (PDF only) title - Document title from metadata
    • (PDF only) author - Document author from metadata
    • (PDF only) subject - Document subject from metadata
    • (PDF only) creator - PDF creator application
    • (PDF only) isEncrypted - Whether the PDF is password-protected
    • (DOCX only) paragraphCount - Number of paragraphs
    • (DOCX only) tableCount - Number of tables
  • Is Supported - Boolean indicating whether the file format is supported for processing.
  • File Type - String describing the detected file type.

Supported File Types

The node recognizes the following file types:

ExtensionFile Type
.pdfPDF Document
.docxWord Document
.docWord Document (Legacy)
.pptxPowerPoint Presentation
.pptPowerPoint Presentation (Legacy)
.xlsxExcel Spreadsheet
.xlsExcel Spreadsheet (Legacy)
.txtPlain Text
.mdMarkdown
.html, .htmHTML Document
.xmlXML Document
.jsonJSON File
.csvCSV File
.tsvTSV File
.rtfRich Text Format
.odtOpenDocument Text
.epubEPUB Book
.emlEmail Message
.msgOutlook Message

How It Works

The Get Document Info node provides lightweight document inspection. When executed, the node:

  1. Validates the provided file path
  2. Retrieves basic file system information (size, name, extension)
  3. Determines MIME type and file format
  4. Checks if the format is supported by the Document Processor package
  5. For PDFs, extracts metadata including page count, title, author, and encryption status
  6. For DOCX files, extracts paragraph and table counts
  7. Returns comprehensive document information object

Requirements

  • Valid file path to any file (doesn't need to be a supported document)
  • Read access to the file

Error Handling

The node will return specific errors in the following cases:

  • Empty or missing file path
  • File not found at the specified path
  • Insufficient permissions to read the file

Note: The node will NOT error on unsupported file types; it will simply return isSupported: false.

Usage Examples

Example 1: Validate Document Before Processing

// Input
msg.filePath = "/documents/report.pdf"

// Get Document Info output
msg.documentInfo = {
"filename": "report.pdf",
"extension": ".pdf",
"size": 1048576,
"sizeFormatted": "1.0 MB",
"mimeType": "application/pdf",
"fileType": "PDF Document",
"isSupported": true,
"pageCount": 15,
"title": "Annual Report 2024",
"author": "John Doe",
"isEncrypted": false
}
msg.isSupported = true
msg.fileType = "PDF Document"

// Use with Switch node to route based on support

Example 2: Filter Large Documents

// After Get Document Info

// Check file size before processing
if (msg.documentInfo.size > 10 * 1024 * 1024) { // 10 MB
msg.isLargeFile = true;
msg.warning = `Large file detected: ${msg.documentInfo.sizeFormatted}`;
}

// Use Switch node to handle large files differently

Example 3: Document Validation Pipeline

Get Document Info → Switch (check isSupported)
├─ Yes → Read Document → Process
└─ No → Log Error → Skip

Example 4: Batch Document Analysis

// In a Loop over file paths
const fileList = [
"/docs/file1.pdf",
"/docs/file2.docx",
"/docs/file3.txt"
];

// Get Document Info for each file
if (!msg.documentStats) msg.documentStats = [];

msg.documentStats.push({
filename: msg.documentInfo.filename,
type: msg.documentInfo.fileType,
size: msg.documentInfo.sizeFormatted,
pageCount: msg.documentInfo.pageCount || 'N/A',
supported: msg.documentInfo.isSupported
});

// Result: Array of document statistics

Example 5: Check for Encrypted PDFs

// After Get Document Info

if (msg.documentInfo.extension === '.pdf' && msg.documentInfo.isEncrypted) {
msg.error = `Cannot process encrypted PDF: ${msg.documentInfo.filename}`;
msg.requiresPassword = true;
// Route to error handler or password input flow
}

Example 6: Estimate Processing Time

// After Get Document Info

const sizeInMB = msg.documentInfo.size / (1024 * 1024);
const pageCount = msg.documentInfo.pageCount || 1;

// Rough estimation
let estimatedSeconds = 0;
if (msg.documentInfo.extension === '.pdf') {
estimatedSeconds = pageCount * 2; // ~2 seconds per page
} else if (msg.documentInfo.extension === '.docx') {
estimatedSeconds = (msg.documentInfo.paragraphCount || 100) * 0.01;
}

msg.estimatedProcessingTime = `~${Math.ceil(estimatedSeconds)} seconds`;

Example 7: Document Type Routing

// After Get Document Info

const extension = msg.documentInfo.extension;

if (extension === '.pdf') {
msg.processingRoute = 'pdf-pipeline';
} else if (['.docx', '.doc'].includes(extension)) {
msg.processingRoute = 'word-pipeline';
} else if (['.csv', '.xlsx'].includes(extension)) {
msg.processingRoute = 'spreadsheet-pipeline';
} else if (msg.documentInfo.isSupported) {
msg.processingRoute = 'generic-pipeline';
} else {
msg.processingRoute = 'unsupported';
}

// Use Switch node with msg.processingRoute

Tips for Effective Use

  1. Pre-Processing Validation:

    • Always check isSupported before attempting to process documents
    • Validate file size to prevent memory issues with very large files
    • Check for encrypted PDFs which require password handling
  2. Performance Optimization:

    • Get Document Info is much faster than full document parsing
    • Use it to filter or prioritize documents before heavy processing
    • Check page count to estimate processing time
  3. Metadata Extraction:

    • PDF metadata (title, author, subject) can be useful for categorization
    • Not all PDFs have metadata filled in
    • DOCX metadata is more consistently available
  4. File Type Detection:

    • The node detects based on file extension and MIME type
    • Extension is the primary detection method
    • Renamed files may be misdetected if extension doesn't match content
  5. Error Recovery:

    • Use this node in Try-Catch blocks when processing user-uploaded files
    • Check isSupported to provide user-friendly error messages
    • Validate before heavy processing to save resources

Common Errors and Solutions

Error: "File path is required"

Cause: No file path was provided to the node. Solution: Ensure the File Path input is connected or set with a valid path.

Error: "File does not exist"

Cause: The specified file path is invalid or the file doesn't exist. Solution:

  • Verify the file path is correct
  • Check that the file exists at that location
  • Ensure the file hasn't been moved or deleted
  • Use absolute paths to avoid ambiguity

Warning: isSupported is false

Cause: The file format is not supported by the Document Processor package. Solution:

  • Check the Supported File Types table above
  • Convert the file to a supported format
  • Use alternative processing methods for unsupported formats

Issue: Missing PDF Metadata

Cause: The PDF doesn't have embedded metadata or PyMuPDF is not available. Solution:

  • This is normal for some PDFs (metadata is optional)
  • The node will still provide basic file information
  • pageCount may still be available through alternative methods

Issue: pageCount not available

Cause: Page count is only available for certain formats (PDF, DOCX). Solution:

  • This is expected for formats like TXT, MD, HTML
  • Use Extract Text node and estimate pages from character count if needed

Integration with Other Nodes

Document Validation Flow

Get Document Info → Switch (isSupported)
├─ True → Switch (check size) → Read Document
└─ False → Log Warning → Skip Document

Batch Processing with Filtering

Loop (files) → Get Document Info → Switch (filter criteria) → Process Document → Collect Results

Document Classification

Get Document Info → Function (analyze metadata) → Route to Processor → Database Insert

Quality Check Pipeline

Upload File → Get Document Info → Validate (size, type, encryption) → Accept/Reject

Performance Considerations

  • Speed: Very fast - only reads file metadata, not full content
  • Memory: Minimal memory usage
  • Use Case: Ideal for validation before heavy processing
  • Batch Operations: Can quickly scan hundreds of files

Comparison with Other Nodes

FeatureGet Document InfoRead DocumentExtract Text
SpeedVery FastSlowMedium
MemoryMinimalHighMedium
Full ParseNoYesYes
MetadataYesLimitedNo
ContentNoStructured elementsPlain text
Best ForValidation, filteringStructure-aware processingText extraction

Best Practices

  1. Always Validate: Use this node before processing unknown documents
  2. Check File Size: Prevent memory issues by checking size first
  3. Handle Unsupported: Provide clear feedback when isSupported is false
  4. Leverage Metadata: Use PDF/DOCX metadata for document classification
  5. Batch Scanning: Quickly scan large document collections for inventory
  6. Error Prevention: Catch encrypted PDFs and unsupported formats early