MCPcopy
hub / github.com/QuivrHQ/MegaParse

github.com/QuivrHQ/MegaParse @megaparse-sdk-v0.1.12 sqlite

repository ↗ · DeepWiki ↗ · release megaparse-sdk-v0.1.12 ↗
285 symbols 1,227 edges 66 files 63 documented · 22%
README

MegaParse - Your Parser for every type of documents

<img src="https://raw.githubusercontent.com/QuivrHQ/MegaParse/main/logo.png" alt="Quivr-logo" width="30%"  style="border-radius: 50%; padding-bottom: 20px"/>

MegaParse is a powerful and versatile parser that can handle various types of documents with ease. Whether you're dealing with text, PDFs, Powerpoint presentations, Word documents MegaParse has got you covered. Focus on having no information loss during parsing.

Key Features 🎯

  • Versatile Parser: MegaParse is a powerful and versatile parser that can handle various types of documents with ease.
  • No Information Loss: Focus on having no information loss during parsing.
  • Fast and Efficient: Designed with speed and efficiency at its core.
  • Wide File Compatibility: Supports Text, PDF, Powerpoint presentations, Excel, CSV, Word documents.
  • Open Source: Freedom is beautiful, and so is MegaParse. Open source and free to use.

Support

  • Files: ✅ PDF ✅ Powerpoint ✅ Word
  • Content: ✅ Tables ✅ TOC ✅ Headers ✅ Footers ✅ Images

Example

https://github.com/QuivrHQ/MegaParse/assets/19614572/1b4cdb73-8dc2-44ef-b8b4-a7509bc8d4f3

Installation

required python version >= 3.11

pip install megaparse

Usage

  1. Add your OpenAI or Anthropic API key to the .env file

  2. Install poppler on your computer (images and PDFs)

  3. Install tesseract on your computer (images and PDFs)

  4. If you have a mac, you also need to install libmagic brew install libmagic

Use MegaParse as it is :

from megaparse import MegaParse
from langchain_openai import ChatOpenAI

megaparse = MegaParse()
response = megaparse.load("./test.pdf")
print(response)

Use MegaParse Vision

from megaparse.parser.megaparse_vision import MegaParseVision

model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))  # type: ignore
parser = MegaParseVision(model=model)
response = parser.convert("./test.pdf")
print(response)

Note: The model supported by MegaParse Vision are the multimodal ones such as claude 3.5, claude 4, gpt-4o and gpt-4.

Use as an API

There is a MakeFile for you, simply use : make dev at the root of the project and you are good to go.

See localhost:8000/docs for more info on the different endpoints !

BenchMark

Parser similarity_ratio
megaparse_vision 0.87
unstructured_with_check_table 0.77
unstructured 0.59
llama_parser 0.33

Higher the better

Note: Want to evaluate and compare your Megaparse module with ours ? Please add your config in evaluations/script.py and then run python evaluations/script.py. If it is better, do a PR, I mean, let's go higher together .

In Construction 🚧

  • Improve table checker
  • Create Checkers to add modular postprocessing ⚙️
  • Add Structured output, let's get computer talking 🤖

Star History

Star History Chart

Core symbols most depended-on inside this repo

extract_page_strategies
called by 6
libs/megaparse/src/megaparse/megaparse.py
aload
called by 6
libs/megaparse/src/megaparse/megaparse.py
determine_global_strategy
called by 6
libs/megaparse/src/megaparse/utils/strategy.py
check_supported_extension
called by 6
libs/megaparse/src/megaparse/parser/base.py
close
called by 6
libs/megaparse_sdk/megaparse_sdk/__init__.py
load
called by 5
libs/megaparse/src/megaparse/megaparse.py
to_numpy
called by 5
libs/megaparse_sdk/megaparse_sdk/schema/document.py
parse_file
called by 4
libs/megaparse_sdk/megaparse_sdk/client.py

Shape

Method 131
Class 91
Function 60
Route 3

Languages

Python100%

Modules by API surface

libs/megaparse_sdk/megaparse_sdk/schema/document.py46 symbols
libs/megaparse_sdk/megaparse_sdk/client.py16 symbols
libs/megaparse_sdk/tests/test_nats_client.py11 symbols
libs/megaparse/src/megaparse/parser/megaparse_vision.py11 symbols
libs/megaparse_sdk/megaparse_sdk/schema/mp_exceptions.py10 symbols
libs/megaparse/tests/conftest.py10 symbols
libs/megaparse/src/megaparse/formatter/table_formatter/vision_table_formatter.py10 symbols
libs/megaparse/src/megaparse/api/exceptions/megaparse_exceptions.py10 symbols
libs/megaparse/src/megaparse/api/app.py9 symbols
libs/megaparse_sdk/megaparse_sdk/schema/mp_inputs.py7 symbols
libs/megaparse/tests/pdf/test_pdf_processing.py7 symbols
libs/megaparse/src/megaparse/parser/doctr_parser.py7 symbols

Dependencies from manifests, versioned

aiohappyeyeballs2.4.3 · 1×
aiohttp3.11.5 · 1×
aiosignal1.3.1 · 1×
annotated-types0.7.0 · 1×
anthropic0.39.0 · 1×
antlr4-python3-runtime4.9.3 · 1×
anyascii0.3.2 · 1×
anyio4.6.2.post1 · 1×
appnope0.1.4 · 1×
asttokens2.4.1 · 1×
attrs24.2.0 · 1×
backoff2.2.1 · 1×

For agents

$ claude mcp add MegaParse \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact