hub / github.com/google/langextract

github.com/google/langextract @v1.6.0 sqlite

repository ↗ · DeepWiki ↗ · release v1.6.0 ↗

1,434 symbols 3,563 edges 89 files 843 documented · 59%

README

LangExtract

Introduction
Why LangExtract?
Quick Start
Installation
API Key Setup for Cloud Models
Adding Custom Model Providers
Using OpenAI Models
Using Local LLMs with Ollama
More Examples
Romeo and Juliet Full Text Extraction
Medication Extraction
Radiology Report Structuring: RadExtract
Community Providers
Contributing
Testing
Disclaimer

Introduction

LangExtract is a Python library that uses LLMs to extract structured information from unstructured text documents based on user-defined instructions. It processes materials such as clinical notes or reports, identifying and organizing key details while ensuring the extracted data corresponds to the source text.

Why LangExtract?

Precise Source Grounding: Maps every extraction to its exact location in the source text, enabling visual highlighting for easy traceability and verification.
Reliable Structured Outputs: Enforces a consistent output schema based on your few-shot examples, leveraging controlled generation in supported models like Gemini to guarantee robust, structured results.
Optimized for Long Documents: Overcomes the "needle-in-a-haystack" challenge of large document extraction by using an optimized strategy of text chunking, parallel processing, and multiple passes for higher recall.
Interactive Visualization: Instantly generates a self-contained, interactive HTML file to visualize and review thousands of extracted entities in their original context.
Flexible LLM Support: Supports your preferred models, from cloud-based LLMs like the Google Gemini family to local open-source models via the built-in Ollama interface.
Adaptable to Any Domain: Define extraction tasks for any domain using just a few examples. LangExtract adapts to your needs without requiring any model fine-tuning.
Leverages LLM World Knowledge: Utilize precise prompt wording and few-shot examples to influence how the extraction task may utilize LLM knowledge. The accuracy of any inferred information and its adherence to the task specification are contingent upon the selected LLM, the complexity of the task, the clarity of the prompt instructions, and the nature of the prompt examples.

Quick Start

Note: Using cloud-hosted models like Gemini requires an API key. See the API Key Setup section for instructions on how to get and configure your key.

Extract structured information with just a few lines of code.

1. Define Your Extraction Task

First, create a prompt that clearly describes what you want to extract. Then, provide a high-quality example to guide the model.

import langextract as lx
import textwrap

# 1. Define the prompt and extraction rules
prompt = textwrap.dedent("""\
    Extract characters, emotions, and relationships in order of appearance.
    Use exact text for extractions. Do not paraphrase or overlap entities.
    Provide meaningful attributes for each entity to add context.""")

# 2. Provide a high-quality example to guide the model
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet is the sun",
                attributes={"type": "metaphor"}
            ),
        ]
    )
]

Note: Examples drive model behavior. Each extraction_text should ideally be verbatim from the example's text (no paraphrasing), listed in order of appearance. LangExtract raises Prompt alignment warnings by default if examples don't follow this pattern—resolve these for best results.

Grounding: LLMs may occasionally extract content from few-shot examples rather than the input text. LangExtract automatically detects this: extractions that cannot be located in the source text will have char_interval = None. Filter these out with [e for e in result.extractions if e.char_interval] to keep only grounded results.

2. Run the Extraction

Provide your input text and the prompt materials to the lx.extract function.

# The input text to be processed
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"

# Run the extraction
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-3.5-flash",
)

For advanced constraints beyond examples, such as enum values on extraction attributes, Gemini and OpenAI support output_schema with or without few-shot examples. See Custom output schemas.

Model Selection: gemini-3.5-flash is the recommended default, offering strong extraction quality for LangExtract's schema-constrained workflows. For high-volume or cost-sensitive workloads, consider the current stable Flash-Lite model, gemini-3.1-flash-lite; for highly complex tasks requiring deeper reasoning, evaluate a current Gemini Pro model from the official model documentation. For large-scale or production use, a paid Gemini tier is suggested to increase throughput and avoid rate limits. See the rate-limit documentation for details.

Model Lifecycle: Note that Gemini models have a lifecycle with defined retirement dates. Users should consult the official model version documentation to stay informed about the latest stable and legacy versions.

3. Visualize the Results

The extractions can be saved to a .jsonl file, a popular format for working with language model data. LangExtract can then generate an interactive HTML visualization from this file to review the entities in context.

# Save the results to a JSONL file
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")

# Generate the visualization from the file
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
    if hasattr(html_content, 'data'):
        f.write(html_content.data)  # For Jupyter/Colab
    else:
        f.write(html_content)

This creates an animated and interactive HTML file:

Romeo and Juliet Basic Visualization

Note on LLM Knowledge Utilization: This example demonstrates extractions that stay close to the text evidence - extracting "longing" for Lady Juliet's emotional state and identifying "yearning" from "gazed longingly at the stars." The task could be modified to generate attributes that draw more heavily from the LLM's world knowledge (e.g., adding "identity": "Capulet family daughter" or "literary_context": "tragic heroine"). The balance between text-evidence and knowledge-inference is controlled by your prompt instructions and example attributes.

Scaling to Longer Documents

For larger texts, you can process entire documents directly from URLs with parallel processing and enhanced sensitivity:

# Process Romeo & Juliet directly from Project Gutenberg
result = lx.extract(
    text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt",
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-3.5-flash",
    extraction_passes=3,    # Improves recall through multiple passes
    max_workers=20,         # Parallel processing for speed
    max_char_buffer=1000    # Smaller contexts for better accuracy
)

This approach can extract hundreds of entities from full novels while maintaining high accuracy. The interactive visualization seamlessly handles large result sets, making it easy to explore hundreds of entities from the output JSONL file. See the full Romeo and Juliet extraction example → for detailed results and performance insights.

Vertex AI Batch Processing

Save costs on large-scale tasks by enabling Vertex AI Batch API with language_model_params that include vertexai=True, project, location, and a batch config.

See an example of the Vertex AI Batch API usage in this example.

Installation

From PyPI

pip install langextract

Recommended for most users. For isolated environments, consider using a virtual environment:

python -m venv langextract_env
source langextract_env/bin/activate  # On Windows: langextract_env\Scripts\activate
pip install langextract

From Source

LangExtract uses modern Python packaging with pyproject.toml for dependency management:

Installing with -e puts the package in development mode, allowing you to modify the code without reinstalling.

git clone https://github.com/google/langextract.git
cd langextract

# For basic installation:
pip install -e .

# For development (includes linting tools):
pip install -e ".[dev]"

# For testing (includes pytest):
pip install -e ".[test]"

Docker

docker build -t langextract .
docker run --rm -e LANGEXTRACT_API_KEY="your-api-key" langextract python your_script.py

API Key Setup for Cloud Models

When using LangExtract with cloud-hosted models (like Gemini or OpenAI), you'll need to set up an API key. On-device models don't require an API key. For developers using local LLMs, LangExtract offers built-in support for Ollama and can be extended to other third-party APIs by updating the inference endpoints.

API Key Sources

Get API keys from:

AI Studio for Gemini models
Vertex AI for enterprise use
OpenAI Platform for OpenAI models

Setting up API key in your environment

Option 1: Environment Variable

export LANGEXTRACT_API_KEY="your-api-key-here"

Option 2: .env File (Recommended)

Add your API key to a .env file:

# Add API key to .env file
cat >> .env << 'EOF'
LANGEXTRACT_API_KEY=your-api-key-here
EOF

# Keep your API key secure
echo '.env' >> .gitignore

In your Python code:

import langextract as lx

result = lx.extract(
    text_or_documents=input_text,
    prompt_description="Extract information...",
    examples=[...],
    model_id="gemini-3.5-flash"
)

Option 3: Direct API Key (Not Recommended for Production)

You can also provide the API key directly in your code, though this is not recommended for production use:

result = lx.extract(
    text_or_documents=input_text,
    prompt_description="Extract information...",
    examples=[...],
    model_id="gemini-3.5-flash",
    api_key="your-api-key-here"  # Only use this for testing/development
)

Option 4: Vertex AI (Service Accounts)

Use Vertex AI for authentication with service accounts:

result = lx.extract(
    text_or_documents=input_text,
    prompt_description="Extract information...",
    examples=[...],
    model_id="gemini-3.5-flash",
    language_model_params={
        "vertexai": True,
        "project": "your-project-id",
        "location": "global"  # or regional endpoint
    }
)

Adding Custom Model Providers

LangExtract supports custom LLM providers via a lightweight plugin system. You can add support for new models without changing core code.

Add new model support independently of the core library
Distribute your provider as a separate Python package
Keep custom dependencies isolated
Override or extend built-in providers via priority-based resolution

See the detailed guide in Provider System Documentation to learn how to:

Register a provider with @router.register(...) from langextract.providers
Publish an entry point for discovery
Optionally provide a schema with get_schema_class() for structured output
Integrate with the factory via create_model(...)

Using OpenAI Models

LangExtract supports OpenAI models (requires optional dependency: pip install langextract[openai]):

```python import langextract as lx

OPENAI_API_KEY in the environment is picked up automatically; pass

api_key=

Core symbols most depended-on inside this repo

tokenize

called by 47

langextract/core/tokenizer.py

resolve

called by 29

langextract/resolver.py

infer

called by 24

langextract/providers/ollama.py

infer

called by 23

langextract/providers/openai.py

tokenize

called by 21

langextract/core/tokenizer.py

annotate_text

called by 16

langextract/annotation.py

infer

called by 14

langextract/providers/gemini.py

build_prompt

called by 12

langextract/prompting.py

Shape

Method 808

Function 256

Class 218

Route 152

Languages

Python100%

Modules by API surface

tests/inference_test.py75 symbols

tests/test_kwargs_passthrough.py68 symbols

tests/factory_test.py61 symbols

tests/schema_test.py56 symbols

tests/tokenizer_test.py54 symbols

tests/openai_batch_test.py50 symbols

tests/gemini_retry_test.py48 symbols

tests/test_gemini_batch_api.py47 symbols

tests/provider_plugin_test.py47 symbols

tests/resolver_test.py44 symbols

tests/init_test.py43 symbols

tests/provider_schema_test.py36 symbols

Dependencies from manifests, versioned

PyYAML6.0 · 1×

absl-py1.0.0 · 1×

aiohttp3.8.0 · 1×

async_timeout4.0.0 · 1×

exceptiongroup1.1.0 · 1×

google-cloud-storage2.14.0 · 1×

google-genai1.39.0 · 1×

langextract1×

ml-collections0.1.0 · 1×

more-itertools8.0.0 · 1×

numpy1.20.0 · 1×

pandas1.3.0 · 1×

For agents

$ claude mcp add langextract \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact

github.com/google/langextract @v1.6.0 sqlite

LangExtract

Table of Contents

Introduction

Why LangExtract?

Quick Start

1. Define Your Extraction Task

2. Run the Extraction

3. Visualize the Results

Scaling to Longer Documents

Vertex AI Batch Processing

Installation

From PyPI

From Source

Docker

API Key Setup for Cloud Models

API Key Sources

Setting up API key in your environment

Adding Custom Model Providers

Using OpenAI Models

OPENAI_API_KEY in the environment is picked up automatically; pass

api_key=

Core symbols most depended-on inside this repo

Shape

Languages

Modules by API surface

Dependencies from manifests, versioned

For agents