MCPcopy
hub / github.com/shcherbak-ai/contextgem

github.com/shcherbak-ai/contextgem @v0.25.1 sqlite

repository ↗ · DeepWiki ↗ · release v0.25.1 ↗
835 symbols 5,069 edges 197 files 790 documented · 95%
README

ContextGem

ContextGem: Effortless LLM extraction from documents

Package PyPI PyPI Downloads Python Versions License
Quality tests Coverage CodeQL license compatibility security: bandit OpenSSF Best Practices
Tools uv Ruff Pydantic v2 ty pre-commit deptry egress: tethered Hatch project
Docs docs documentation Docstring Coverage DeepWiki
Community Contributor Covenant GitHub issues closed GitHub latest commit

ContextGem: 2nd Product of the week

ContextGem is a free, open-source LLM framework that makes it radically easier to extract structured data and insights from documents — with minimal code.


💎 Why ContextGem?

Reliable structured extraction from documents typically involves writing extraction prompts, designing validation models, mapping outputs back to source references, orchestrating multi-step pipelines, and tracking usage across LLMs. ContextGem handles all of this through powerful abstractions — you describe what to extract in natural language, and the framework handles how.

The result: structured data with precise paragraph- and sentence-level references, automatic justifications, hierarchical multi-aspect extraction, and a unified, serializable document storage model — all from minimal code.

📖 Read more on the project motivation in the documentation.

⭐ Key features

Automated dynamic prompts 📐 Automated data modelling 📍 Granular reference mapping
💭 Built-in justifications 🪆 Nested context extraction 🔗 Unified declarative pipeline

💡 What you can build

With minimal code, you can:

  • Extract structured data from documents (text, images)
  • Identify and analyze key aspects (topics, themes, categories) within documents (learn more)
  • Extract specific concepts (entities, facts, conclusions, assessments) from documents (learn more)
  • Build complex extraction workflows through a simple, intuitive API
  • Create multi-level extraction pipelines (aspects containing concepts, hierarchical aspects)

ContextGem extraction example

📦 Installation

Using uv (recommended):

uv add contextgem

Or using pip:

pip install -U contextgem

🚀 Quick start

The following example demonstrates how to use ContextGem to extract anomalies from a legal document - a complex concept that requires contextual understanding. Unlike traditional RAG approaches that might miss subtle inconsistencies, ContextGem analyzes the entire document context to identify content that doesn't belong, complete with source references and justifications.

# Quick Start Example - Extracting anomalies from a document, with source references and justifications

import os

from contextgem import Document, DocumentLLM, StringConcept


# Sample document text (shortened for brevity)
doc = Document(
    raw_text=(
        "Consultancy Agreement\n"
        "This agreement between Company A (Supplier) and Company B (Customer)...\n"
        "The term of the agreement is 1 year from the Effective Date...\n"
        "The Supplier shall provide consultancy services as described in Annex 2...\n"
        "The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
        "The purple elephant danced gracefully on the moon while eating ice cream.\n"  # 💎 anomaly
        "Time-traveling dinosaurs will review all deliverables before acceptance.\n"  # 💎 another anomaly
        "This agreement is governed by the laws of Norway...\n"
    ),
)

# Attach a document-level concept
doc.concepts = [
    StringConcept(
        name="Anomalies",  # in longer contexts, this concept is hard to capture with RAG
        description="Anomalies in the document",
        add_references=True,
        reference_depth="sentences",
        add_justifications=True,
        justification_depth="brief",
        # see the docs for more configuration options
    )
    # add more concepts to the document, if needed
    # see the docs for available concepts: StringConcept, JsonObjectConcept, etc.
]
# Or use `doc.add_concepts([...])`

# Define an LLM for extracting information from the document
llm = DocumentLLM(
    model="openai/gpt-4o-mini",  # or another provider/LLM
    api_key=os.environ.get(
        "CONTEXTGEM_OPENAI_API_KEY"
    ),  # your API key for the LLM provider
    # see the docs for more configuration options
)

# Extract information from the document
doc = llm.extract_all(doc)  # or use async version `await llm.extract_all_async(doc)`

# Access extracted information in the document object
anomalies_concept = doc.concepts[0]
# or `doc.get_concept_by_name("Anomalies")`
for item in anomalies_concept.extracted_items:
    print("Anomaly:")
    print(f"  {item.value}")
    print("Justification:")
    print(f"  {item.justification}")
    print("Reference paragraphs:")
    for p in item.reference_paragraphs:
        print(f"  - {p.raw_text}")
    print("Reference sentences:")
    for s in item.reference_sentences:
        print(f"  - {s.raw_text}")
    print()

Open In Colab


🧠 How it works

📝 Step 1: Define extraction context

📄 Document
Create a Document that contains text and/or visual content representing your document (contract, invoice, report, CV, etc.), from which an LLM extracts information (aspects and/or concepts). Learn more
document = Document(raw_text="Non-Disclosure Agreement...")

🎯 Step 2: Define what to extract

🔍 Aspects 💡 Concepts
Define Aspects to extract text segments from the document (sections, topics, themes). You can organize content hierarchically and combine with concepts for comprehensive analysis. Learn more Define Concepts to extract specific data points with intelligent inference: entities, insights, structured objects, classifications, numerical calculations, dates, ratings, and assessments. Learn more
# Extract document sections
aspect = Aspect(
    name="Term and termination",
    description="Clauses on contract term and termination",
)
# Extract specific data points
concept = BooleanConcept(
    name="NDA check",
    description="Is the contract an NDA?",
)
# Add these to the document instance for further extraction
document.add_aspects([aspect])
document.add_concepts([concept])
🔄 Alternative: Configure Extraction Pipeline
Create a reusable collection of predefined aspects and concepts that enables consistent extraction across multiple documents. Learn more

🧠 Step 3: Run LLM extraction

🤖 LLM 🤖🤖 Alternative: LLM Group (advanced)
Configure a cloud or local LLM that will extract aspects and/or concepts from the document. DocumentLLM supports fallback models and role-based task routing for optimal performance. Learn more Configure a group of LLMs with unique roles for complex extraction workflows. You can route different aspects and/or concepts to specialized LLMs (e.g., simple extraction vs. reasoning tasks). Learn more
llm = DocumentLLM(
    model="openai/gpt-5-mini",  # or another provider/LLM
    api_key="...",
)
document = llm.extract_all(document)
# print(document.aspects[0].extracted_items)
# print(document.concepts[0].extracted_items)

📖 Learn more about ContextGem's core components and their practical examples in the documentation.

📚 Usage Examples

🌟 Basic usage:

🚀 Advanced usage:

  • [Extracting Aspects Containing Concepts](https://contextgem.dev/advanced_usage/#extracting-as

Core symbols most depended-on inside this repo

_docx_xpath
called by 92
contextgem/internal/converters/docx/utils.py
debug
called by 77
contextgem/internal/loggers.py
check_instance_serialization_and_cloning
called by 60
tests/utils.py
check_locals_memory_usage
called by 57
tests/memory_profiling.py
info
called by 56
contextgem/internal/loggers.py
extract_concepts_from_document
called by 55
contextgem/internal/base/llms.py
add_concepts
called by 53
contextgem/internal/base/attrs.py
get_concept_by_name
called by 41
contextgem/internal/base/attrs.py

Shape

Method 406
Function 289
Class 134
Route 6

Languages

Python100%

Modules by API surface

tests/test_all.py113 symbols
contextgem/internal/base/llms.py94 symbols
tests/test_units.py91 symbols
contextgem/internal/base/concepts.py53 symbols
tests/test_properties.py51 symbols
contextgem/internal/base/attrs.py38 symbols
contextgem/internal/converters/docx/base.py29 symbols
tests/utils.py26 symbols
contextgem/internal/utils.py24 symbols
contextgem/internal/base/documents.py18 symbols
contextgem/internal/loggers.py16 symbols
contextgem/internal/base/serialization.py16 symbols

Dependencies from manifests, versioned

genai-prices0.0.64 · 1×
litellm1.87.1 · 1×
openai2.41.0 · 1×

For agents

$ claude mcp add contextgem \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact