Document Processor
Parse and process documents for RAG (Retrieval-Augmented Generation) applications. Extract text, tables, and metadata from PDF, DOCX, PPTX, HTML, and more.
Supported Formats
The Document Processor package supports a wide range of document formats:
| Format | Extensions |
|---|---|
.pdf | |
| Microsoft Word | .docx, .doc |
| Microsoft PowerPoint | .pptx, .ppt |
| Microsoft Excel | .xlsx, .xls |
| Plain Text | .txt |
| Markdown | .md |
| HTML | .html, .htm |
| Rich Text | .rtf |
| OpenDocument | .odt |
| EPUB | .epub |
.eml, .msg | |
| CSV/TSV | .csv, .tsv |
| JSON/XML | .json, .xml |
Package Overview
This package is designed specifically for processing documents in AI and RAG workflows. It provides nodes for extracting content, chunking text for embeddings, and preparing data for vector databases.
Key Features
- Universal document parsing with automatic format detection
- Intelligent text chunking optimized for embedding models
- Table extraction with multiple output formats (HTML, JSON, text)
- Metadata extraction without full document parsing
- Direct integration with OpenAI embeddings and LanceDB
- Support for semantic chunking by document structure
Common Use Cases
RAG Pipeline
Build a complete document ingestion pipeline for RAG applications by reading documents, chunking text, generating embeddings, and storing in vector databases.
Document Analysis
Extract and analyze tables, text, and metadata from various document formats for automated processing workflows.
Content Extraction
Quickly extract text content from PDFs, Word documents, and other formats for further processing or analysis.
Knowledge Base Creation
Process large document collections into searchable chunks ready for semantic search and question answering.
📄️ Chunk Text
Robomotion.DocumentProcessor.ChunkText
📄️ Extract Tables
Robomotion.DocumentProcessor.ExtractTables
📄️ Extract Text
Robomotion.DocumentProcessor.ExtractText
📄️ Get Document Info
Robomotion.DocumentProcessor.GetDocumentInfo
📄️ Prepare for Embeddings
Robomotion.DocumentProcessor.PrepareForEmbeddings
📄️ Read Document
Robomotion.DocumentProcessor.ReadDocument