MinerU
Open-source engine converting documents to clean Markdown
MinerU is profiled here as a RAG Framework tool for engineering teams. Read about features, pricing, and how it compares to related options in the tools directory.
Description
MinerU is an open-source document parsing engine from OpenDataLab at the Shanghai AI Laboratory, originally built to prepare scientific literature for model pre-training. It converts PDFs, images, and office files into Markdown and JSON while preserving headings, tables, equations, and reading order through a pipeline of vision and OCR models. MinerU runs locally or through a cloud API, supports over a hundred languages, and ships under an open-source license based on Apache 2.0 that eases commercial adoption.
Key Capabilities:
Conversion of PDFs, images, and office files into Markdown and JSON
Equation recognition that outputs LaTeX from scientific documents
Table and layout extraction that preserves structure and reading order
A vision-language and OCR pipeline for high-accuracy parsing
Support for over a hundred languages
Local execution plus a cloud API, SDKs, and an MCP server
Alternative tools
- Reducto
Document ingestion API with structure-preserving extraction
- LlamaParse
Document parser built for retrieval and LLM pipelines
- Deep Lake
Database for AI that stores tensors and embeddings
- Model2Vec
Distill sentence transformers into fast static embeddings
- Mixedbread
Embedding and reranking models with a hosted API
- RAGFlow
Open-source RAG engine with deep document understanding
