Document Processor

Parse and process documents for RAG (Retrieval-Augmented Generation) applications. Extract text, tables, and metadata from PDF, DOCX, PPTX, HTML, and more.

Supported Formats

The Document Processor package supports a wide range of document formats:

Format	Extensions
PDF	`.pdf`
Microsoft Word	`.docx`, `.doc`
Microsoft PowerPoint	`.pptx`, `.ppt`
Microsoft Excel	`.xlsx`, `.xls`
Plain Text	`.txt`
Markdown	`.md`
HTML	`.html`, `.htm`
Rich Text	`.rtf`
OpenDocument	`.odt`
EPUB	`.epub`
Email	`.eml`, `.msg`
CSV/TSV	`.csv`, `.tsv`
JSON/XML	`.json`, `.xml`

Package Overview

This package is designed specifically for processing documents in AI and RAG workflows. It provides nodes for extracting content, chunking text for embeddings, and preparing data for vector databases.

Key Features

Universal document parsing with automatic format detection
Intelligent text chunking optimized for embedding models
Table extraction with multiple output formats (HTML, JSON, text)
Metadata extraction without full document parsing
Direct integration with OpenAI embeddings and LanceDB
Support for semantic chunking by document structure

Common Use Cases

RAG Pipeline

Build a complete document ingestion pipeline for RAG applications by reading documents, chunking text, generating embeddings, and storing in vector databases.

Document Analysis

Extract and analyze tables, text, and metadata from various document formats for automated processing workflows.

Content Extraction

Quickly extract text content from PDFs, Word documents, and other formats for further processing or analysis.

Knowledge Base Creation

Process large document collections into searchable chunks ready for semantic search and question answering.

Document Processor

Supported Formats

Package Overview

Key Features

Common Use Cases

RAG Pipeline

Document Analysis

Content Extraction

Knowledge Base Creation

📄️ Chunk Text

📄️ Extract Tables

📄️ Extract Text

📄️ Get Document Info

📄️ Prepare for Embeddings

📄️ Read Document

Supported Formats​

Package Overview​

Key Features​

Common Use Cases​

RAG Pipeline​

Document Analysis​

Content Extraction​

Knowledge Base Creation​

📄️ Chunk Text

📄️ Extract Tables

📄️ Extract Text

📄️ Get Document Info

📄️ Prepare for Embeddings

📄️ Read Document

Supported Formats

Package Overview

Key Features

Common Use Cases

RAG Pipeline

Document Analysis

Content Extraction

Knowledge Base Creation