MCPcopy
hub / github.com/e-p-armstrong/augmentoolkit

github.com/e-p-armstrong/augmentoolkit @v3.0.0 sqlite

repository ↗ · DeepWiki ↗ · release v3.0.0 ↗
723 symbols 2,791 edges 128 files 169 documented · 23%
README

Augmentoolkit - Data for Domain-expert AI

Augmentoolkit creates domain-expert datasets that update an AI's brain (basically, its knowledge cutoff), so that the AI becomes an expert in an area of your choosing.

You upload documents, and press a button. And get a fully trained custom LLM. Now every aspect of your AI's behavior and understanding is under your control. Better still, Augmentoolkit optionally works offline on your computer -- no external API key required* for datagen† on most hardware.

Maybe you want AI to know the latest research papers in your field, or perhaps you want an LLM that understands your passion deeply and has learned from the same sources as you. Possibly, you dream of creating a lore expert for your favorite obscure fictional universe. Whatever the application is, Augmentoolkit lets you take text and make an LLM's brain inherently learn the information contained within. It also automatically creates a RAG-ready dataset (and can start up an inference server) if you want some traditional grounding as well.

Get started now (the interface will guide you through generating your first dataset):

MacOS (interface)

git clone https://github.com/e-p-armstrong/augmentoolkit.git
cd augmentoolkit
bash macos.sh # NOTE: Will attempt to install valkey via brew if not found.

Linux (interface)

git clone https://github.com/e-p-armstrong/augmentoolkit.git
cd augmentoolkit
bash linux.sh # NOTE: will build Valkey from source if a Redis/Valkey server is not running

Or for local inference

git clone https://github.com/e-p-armstrong/augmentoolkit.git
cd augmentoolkit
bash local_linux.sh normal # or you can write "small" or a custom model name to serve the quantized version (for more consumer hardware) or a model of your choice, respectively

Windows (interface)

[!NOTE]

The interface requires Valkey (or Redis) to be installed and running MANUALLY. The CLI is easier to get running on windows honestly. Running the bat command will give you install instructions. Alternatively, see docs/quickstart.md

git clone https://github.com/e-p-armstrong/augmentoolkit.git
cd augmentoolkit
./windows.bat # see above note about valkey/redis

Note that datagen can take a while on a lot of hardware however, don't expect fast datagen on an old mac for instance. And for training you will need either a powerful machine of your own, or to rent (latter is done automatically for you if you so choose). †If you want data to generate faster you can* use an open-source LLM API, and the quickstart encourages you to. Augmentoolkit is optimized for open source LLMs like Deepseek or Llama.

Augmentoolkit, now that it is on its 3.0 version, has been refined and improved through over a year of professional application and experimentation. It is now the best way in the world to create domain expert LLMs, and it's MIT licensed.

If you use this project and like it, please consider starring the repo! It's also designed to be extremely customizable so consider forking Augmentoolkit!

[!IMPORTANT]

The below links contain very useful information. There is a table of contents, that links to extensive documentation pages for any conceivable part of the project, a bit further down.

Help Videos I walk through how to do all the cool stuff in this project starting from scratch, including training LLMs with the data and configs you get (takes 10 minutes). Check out the help videos if you want further guidance!

Community If you have questions, if you are training models, if you are building cool new pipelines or extensions on top of Augmentoolkit's code, or if just want to hang out, I'd love to see you on the Augmentoolkit discord! It's also a good place to reach me.

Newsletter I write about model training and data generation over on a Substack. Totally free, I just want to help people be able to use the tool better.

Contact I'm doing all kinds of things around this project, if you're interested in the mission and the business of bringing custom, personally-aligned AI to everyone, let's get in touch!

Build Augmentoolkit is meant to be the go-to tool for people experimenting with training LLMs, whether they're a hobbyist or a professional. To that end, building new pipelines is as simple as writing Python functions (while adhering to about 2 mostly optional conventions). Efficient explainers and pipeline templates are provided for you to build your own dataset generation pipelines, and by extension, your own datasets and your own completely custom LLMs.

All configs are fully annotated with comments and placeholders to help you understand them as you fill them out.

Documentation Pages

[!NOTE]

that this documentation page (the main README) has an important note about model training for facts that you should read regardless of your experience level.

  1. Quickstart
  2. Video Help
  3. Vision
  4. Longstart (customization and development guide links)

If you're familiar with LLMs and want a more jargonful rundown of what Augmentoolkit is and what makes it cool, check out this section

Cite: DOI

Video Tutorials

Train a Model on your Own Data in 13 Minutes

Interface Deep Dive!

CLI and Code Structure Deep Dive!

^ This one is useful if you're going to make modifications to the code

Benefits

Augmentoolkit makes LLM data easy. - Cheap: Augmentoolkit pipelines use open-source LLMs, and so can be run on consumer hardware for hardly any cost, or cheaply via APIs like Deepinfra (the "local" prompt sets should enable usage of most pipelines by reasoning models, too) - Effortless: Any Augmentoolkit pipeline can be run with an intuitive interface that is started by running a start script. Alternatively, you can make data by putting some files in a folder, and then running a Python script. If that's too much, you can also use the graphical user interface, now a first-class citizen in Augmentoolkit 3 (and in fact, the recommended way to run Augmentoolkit). Previously-started runs are continued automatically, so you don't need to worry about interruptions costing you time and/or money. - Fast: when using APIs, you can quickly generate millions of trainable tokens. Fully async code lets you get results quickly. Reading and chunking caches ensure that even large-scale workloads are quick to use. Models are automatically trained after the data is ready, and are even automatically downloaded and prepared for inference on your local machine. All the hard or annoying parts of the process have been automated and made efficient. In the past creating datasets and iterating and testing and learning could have taken a skilled person months; now, anyone can press a button, come back in a day, and chat with a newly-trained model. - Innovative, Effective Approach to Factual Training: Augmentoolkit has a production-tested method of creating domain-expert LLMs that can understand entirely new subjects. Many separate pipelines are composed together to produce quality datasets that teach capabilities such as answering factual questions, acknowledging when something is not known by the model, correcting mistakes, etc. You can be confident in getting high-quality specialist models when you use Augmentoolkit.

We've also done our best to facilitate the step after you generate your data -- training your LLM: - Production-Scale: Datasets that are gigabytes-large have been generated with Augmentoolkit -- it is battle-hardened, it works at scale without annoying inefficiencies costing immense time, and it is ready for the stresses of production. - Train an AI for the cost of a dinner: you can generate data on your own hardware for what is basically free. Augmentoolkit can then automatically perform a full finetune of an AI, on your own data, for a tiny sum of money (roughly $20 for the finetuning part of the process). - Create your LLM in less than a day: with a fully automated process for turning documents into datasets, and only a single button-click needed to kick off training, making a subject matter expert LLM is fast (especially when you use API for the dataset generation). Iterate quickly and cheaply. - When you use the same recipe, you get the same bread: Augmentoolkit datasets have been used successfully for professional consulting projects. Video documentation is linked in this README that shows exactly how to use this tool to do the same. The code, settings, and prompts you need is all here. Examples, templates, comments, marked-out placeholders, and extensive documentation is all available. - Train AI with confidence, especially if it's your first time: between the battle-tested process, extensive video docs, in-depth README, and Discord community, you can be confident you'll get a good LLM out of this.

Do it all locally With a custom-trained 7b model built to run these pipelines specifically, Augmentoolkit can generate data on consumer hardware, and can do so at incredible scale, with incredible parallelism, when on higher-performance computers. Budget does not need to be a constraint -- just passion and time. Of course, if you want immediate results/speed, you can use an API too.

Finally, using the model you create should be easy and valuable: - AI that understands your facts: For the professionals and the passionate: training an LLM with Augmentoolkit's Complete Factual Datagen "composition" pipeline creates an assistant that understands the big picture of the data you're training on. If RAG is like giving an LLM an open-book test on a textbook it hasn't read before, then training on Augmentoolkit data gives it some time to study before the test as well. This pipeline has been battle-tested in consulting projects across different industries. Compared to earlier versions of Augmentoolkit, Augmentoolkit's 3.0 version generates a wide variety of different domain data, and it even automatically balances this data with the generic data it uses. - Individual Alignment: Use GPRO (the same algorithm that made Deepseek R1

Core symbols most depended-on inside this repo

set_progress
called by 105
redis_config.py
_sample_tracked_item
called by 44
generation/core_composition/meta_datagen/meta.py
run
called by 42
augmentoolkit/generation_functions/one_to_many_step.py
count_tokens
called by 24
generation/core_components/chunking.py
make_relative_to_self
called by 23
generation/core_components/setup_components.py
getApiUrl
called by 23
atk-interface/src/utils/apiUtils.js
random
called by 19
atk-interface/src/components/ProgressBarParticles.jsx
create_input_token_counter
called by 18
augmentoolkit/utils/observers.py

Shape

Function 587
Method 68
Route 36
Class 32

Languages

Python77%
TypeScript23%

Modules by API surface

api.py72 symbols
generation/core_pipelines/rptoolkit/rptoolkit_helpers.py39 symbols
atk-interface/src/api.js32 symbols
generation/core_components/chunking.py29 symbols
generation/core_pipelines/representation_variation/repvar.py20 symbols
generation/core_pipelines/recall_multiple_sources/multi_source_helpers.py18 symbols
generation/core_pipelines/factual_generation_individual/factual_generation_helpers.py17 symbols
generation/core_pipelines/do_grpo_rl_with_a_prompt/grpo.py17 symbols
atk-interface/src/pages/ConfigSelector.jsx17 symbols
augmentoolkit/generation_functions/pipeline_step_class.py16 symbols
generation/core_components/data_prep_operations.py15 symbols
generation/core_pipelines/do_grpo_rl_with_a_prompt/reward_functions.py14 symbols

Dependencies from manifests, versioned

@eslint/js9.22.0 · 1×
@monaco-editor/react4.7.0 · 1×
@types/react19.0.10 · 1×
@types/react-dom19.0.4 · 1×
@vitejs/plugin-react4.3.4 · 1×
autoprefixer10.4.21 · 1×
eslint9.22.0 · 1×
eslint-plugin-react-hooks5.2.0 · 1×
eslint-plugin-react-refresh0.4.19 · 1×
file-saver2.0.5 · 1×
framer-motion12.7.4 · 1×
globals16.0.0 · 1×

For agents

$ claude mcp add augmentoolkit \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact