Skip to main content

Document Processor

Parse and process documents for RAG (Retrieval-Augmented Generation) applications. Extract text, tables, and metadata from PDF, DOCX, PPTX, HTML, and more.

Supported Formats

The Document Processor package supports a wide range of document formats:

FormatExtensions
PDF.pdf
Microsoft Word.docx, .doc
Microsoft PowerPoint.pptx, .ppt
Microsoft Excel.xlsx, .xls
Plain Text.txt
Markdown.md
HTML.html, .htm
Rich Text.rtf
OpenDocument.odt
EPUB.epub
Email.eml, .msg
CSV/TSV.csv, .tsv
JSON/XML.json, .xml

Package Overview

This package is designed specifically for processing documents in AI and RAG workflows. It provides nodes for extracting content, chunking text for embeddings, and preparing data for vector databases.

Key Features

  • Universal document parsing with automatic format detection
  • Intelligent text chunking optimized for embedding models
  • Table extraction with multiple output formats (HTML, JSON, text)
  • Metadata extraction without full document parsing
  • Direct integration with OpenAI embeddings and LanceDB
  • Support for semantic chunking by document structure

Common Use Cases

RAG Pipeline

Build a complete document ingestion pipeline for RAG applications by reading documents, chunking text, generating embeddings, and storing in vector databases.

Document Analysis

Extract and analyze tables, text, and metadata from various document formats for automated processing workflows.

Content Extraction

Quickly extract text content from PDFs, Word documents, and other formats for further processing or analysis.

Knowledge Base Creation

Process large document collections into searchable chunks ready for semantic search and question answering.