Get Document Info
Get metadata and information about a document without fully parsing its content. This node provides a quick way to inspect document properties, check file support, and validate documents before processing.
Common Properties
- Name - The custom name of the node.
- Color - The custom color of the node.
- Delay Before (sec) - Waits in seconds before executing the node.
- Delay After (sec) - Waits in seconds after executing node.
- Continue On Error - Automation will continue regardless of any error. The default value is false.
If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.
Inputs
- File Path - Path to the document file to inspect.
Output
- Document Info - Object containing file metadata and properties:
filename- Name of the fileextension- File extension (e.g., ".pdf", ".docx")size- File size in bytessizeFormatted- Human-readable file size (e.g., "1.5 MB")mimeType- MIME type of the filefileType- Descriptive file type (e.g., "PDF Document")isSupported- Whether the format is supported by the package- (PDF only)
pageCount- Number of pages - (PDF only)
title- Document title from metadata - (PDF only)
author- Document author from metadata - (PDF only)
subject- Document subject from metadata - (PDF only)
creator- PDF creator application - (PDF only)
isEncrypted- Whether the PDF is password-protected - (DOCX only)
paragraphCount- Number of paragraphs - (DOCX only)
tableCount- Number of tables
- Is Supported - Boolean indicating whether the file format is supported for processing.
- File Type - String describing the detected file type.
Supported File Types
The node recognizes the following file types:
| Extension | File Type |
|---|---|
| PDF Document | |
| .docx | Word Document |
| .doc | Word Document (Legacy) |
| .pptx | PowerPoint Presentation |
| .ppt | PowerPoint Presentation (Legacy) |
| .xlsx | Excel Spreadsheet |
| .xls | Excel Spreadsheet (Legacy) |
| .txt | Plain Text |
| .md | Markdown |
| .html, .htm | HTML Document |
| .xml | XML Document |
| .json | JSON File |
| .csv | CSV File |
| .tsv | TSV File |
| .rtf | Rich Text Format |
| .odt | OpenDocument Text |
| .epub | EPUB Book |
| .eml | Email Message |
| .msg | Outlook Message |
How It Works
The Get Document Info node provides lightweight document inspection. When executed, the node:
- Validates the provided file path
- Retrieves basic file system information (size, name, extension)
- Determines MIME type and file format
- Checks if the format is supported by the Document Processor package
- For PDFs, extracts metadata including page count, title, author, and encryption status
- For DOCX files, extracts paragraph and table counts
- Returns comprehensive document information object
Requirements
- Valid file path to any file (doesn't need to be a supported document)
- Read access to the file
Error Handling
The node will return specific errors in the following cases:
- Empty or missing file path
- File not found at the specified path
- Insufficient permissions to read the file
Note: The node will NOT error on unsupported file types; it will simply return isSupported: false.
Usage Examples
Example 1: Validate Document Before Processing
// Input
msg.filePath = "/documents/report.pdf"
// Get Document Info output
msg.documentInfo = {
"filename": "report.pdf",
"extension": ".pdf",
"size": 1048576,
"sizeFormatted": "1.0 MB",
"mimeType": "application/pdf",
"fileType": "PDF Document",
"isSupported": true,
"pageCount": 15,
"title": "Annual Report 2024",
"author": "John Doe",
"isEncrypted": false
}
msg.isSupported = true
msg.fileType = "PDF Document"
// Use with Switch node to route based on support
Example 2: Filter Large Documents
// After Get Document Info
// Check file size before processing
if (msg.documentInfo.size > 10 * 1024 * 1024) { // 10 MB
msg.isLargeFile = true;
msg.warning = `Large file detected: ${msg.documentInfo.sizeFormatted}`;
}
// Use Switch node to handle large files differently
Example 3: Document Validation Pipeline
Get Document Info → Switch (check isSupported)
├─ Yes → Read Document → Process
└─ No → Log Error → Skip
Example 4: Batch Document Analysis
// In a Loop over file paths
const fileList = [
"/docs/file1.pdf",
"/docs/file2.docx",
"/docs/file3.txt"
];
// Get Document Info for each file
if (!msg.documentStats) msg.documentStats = [];
msg.documentStats.push({
filename: msg.documentInfo.filename,
type: msg.documentInfo.fileType,
size: msg.documentInfo.sizeFormatted,
pageCount: msg.documentInfo.pageCount || 'N/A',
supported: msg.documentInfo.isSupported
});
// Result: Array of document statistics
Example 5: Check for Encrypted PDFs
// After Get Document Info
if (msg.documentInfo.extension === '.pdf' && msg.documentInfo.isEncrypted) {
msg.error = `Cannot process encrypted PDF: ${msg.documentInfo.filename}`;
msg.requiresPassword = true;
// Route to error handler or password input flow
}
Example 6: Estimate Processing Time
// After Get Document Info
const sizeInMB = msg.documentInfo.size / (1024 * 1024);
const pageCount = msg.documentInfo.pageCount || 1;
// Rough estimation
let estimatedSeconds = 0;
if (msg.documentInfo.extension === '.pdf') {
estimatedSeconds = pageCount * 2; // ~2 seconds per page
} else if (msg.documentInfo.extension === '.docx') {
estimatedSeconds = (msg.documentInfo.paragraphCount || 100) * 0.01;
}
msg.estimatedProcessingTime = `~${Math.ceil(estimatedSeconds)} seconds`;
Example 7: Document Type Routing
// After Get Document Info
const extension = msg.documentInfo.extension;
if (extension === '.pdf') {
msg.processingRoute = 'pdf-pipeline';
} else if (['.docx', '.doc'].includes(extension)) {
msg.processingRoute = 'word-pipeline';
} else if (['.csv', '.xlsx'].includes(extension)) {
msg.processingRoute = 'spreadsheet-pipeline';
} else if (msg.documentInfo.isSupported) {
msg.processingRoute = 'generic-pipeline';
} else {
msg.processingRoute = 'unsupported';
}
// Use Switch node with msg.processingRoute
Tips for Effective Use
-
Pre-Processing Validation:
- Always check
isSupportedbefore attempting to process documents - Validate file size to prevent memory issues with very large files
- Check for encrypted PDFs which require password handling
- Always check
-
Performance Optimization:
- Get Document Info is much faster than full document parsing
- Use it to filter or prioritize documents before heavy processing
- Check page count to estimate processing time
-
Metadata Extraction:
- PDF metadata (title, author, subject) can be useful for categorization
- Not all PDFs have metadata filled in
- DOCX metadata is more consistently available
-
File Type Detection:
- The node detects based on file extension and MIME type
- Extension is the primary detection method
- Renamed files may be misdetected if extension doesn't match content
-
Error Recovery:
- Use this node in Try-Catch blocks when processing user-uploaded files
- Check
isSupportedto provide user-friendly error messages - Validate before heavy processing to save resources
Common Errors and Solutions
Error: "File path is required"
Cause: No file path was provided to the node. Solution: Ensure the File Path input is connected or set with a valid path.
Error: "File does not exist"
Cause: The specified file path is invalid or the file doesn't exist. Solution:
- Verify the file path is correct
- Check that the file exists at that location
- Ensure the file hasn't been moved or deleted
- Use absolute paths to avoid ambiguity
Warning: isSupported is false
Cause: The file format is not supported by the Document Processor package. Solution:
- Check the Supported File Types table above
- Convert the file to a supported format
- Use alternative processing methods for unsupported formats
Issue: Missing PDF Metadata
Cause: The PDF doesn't have embedded metadata or PyMuPDF is not available. Solution:
- This is normal for some PDFs (metadata is optional)
- The node will still provide basic file information
pageCountmay still be available through alternative methods
Issue: pageCount not available
Cause: Page count is only available for certain formats (PDF, DOCX). Solution:
- This is expected for formats like TXT, MD, HTML
- Use Extract Text node and estimate pages from character count if needed
Integration with Other Nodes
Document Validation Flow
Get Document Info → Switch (isSupported)
├─ True → Switch (check size) → Read Document
└─ False → Log Warning → Skip Document
Batch Processing with Filtering
Loop (files) → Get Document Info → Switch (filter criteria) → Process Document → Collect Results
Document Classification
Get Document Info → Function (analyze metadata) → Route to Processor → Database Insert
Quality Check Pipeline
Upload File → Get Document Info → Validate (size, type, encryption) → Accept/Reject
Performance Considerations
- Speed: Very fast - only reads file metadata, not full content
- Memory: Minimal memory usage
- Use Case: Ideal for validation before heavy processing
- Batch Operations: Can quickly scan hundreds of files
Comparison with Other Nodes
| Feature | Get Document Info | Read Document | Extract Text |
|---|---|---|---|
| Speed | Very Fast | Slow | Medium |
| Memory | Minimal | High | Medium |
| Full Parse | No | Yes | Yes |
| Metadata | Yes | Limited | No |
| Content | No | Structured elements | Plain text |
| Best For | Validation, filtering | Structure-aware processing | Text extraction |
Best Practices
- Always Validate: Use this node before processing unknown documents
- Check File Size: Prevent memory issues by checking size first
- Handle Unsupported: Provide clear feedback when
isSupportedis false - Leverage Metadata: Use PDF/DOCX metadata for document classification
- Batch Scanning: Quickly scan large document collections for inventory
- Error Prevention: Catch encrypted PDFs and unsupported formats early