Docling
Open-source document conversion built for RAG pipelines
Docling is profiled here as a RAG Framework tool for engineering teams. Read about features, pricing, and how it compares to related options in the tools directory.
Description
Docling is an open-source document conversion toolkit that IBM Research released in 2024 and donated to the LF AI and Data Foundation in 2025. It parses PDFs, Office files, images, and HTML into a unified DoclingDocument representation that preserves layout, reading order, and table structure, then exports Markdown or JSON for downstream pipelines. Everything runs locally, which keeps sensitive documents inside the network boundary. IBM also publishes Granite-Docling, a compact vision language model trained for the conversion pipeline, which handles complex page layouts end to end.
Key Capabilities:
PDF understanding with layout analysis and TableFormer table extraction
OCR support for scanned documents
Unified DoclingDocument format with Markdown, HTML, and JSON export
Visual language model pipeline for end-to-end page conversion
Native integrations with LangChain, LlamaIndex, and Haystack
MIT license with fully local, air-gapped execution
Alternative tools
- Unstructured
Turn raw documents into LLM-ready structured data
- Voyage AI
Retrieval-optimized embedding and reranking models
- Chroma
Developer-first embedding database that runs anywhere
- Qdrant
Rust-based vector search engine with rich filtering
- Weaviate
Open-source vector database with native hybrid search
- Pinecone
Managed vector database for production retrieval workloads
