hub / github.com/mlc-ai/web-llm

github.com/mlc-ai/web-llm @v0.2.83 sqlite

repository ↗ · DeepWiki ↗ · release v0.2.83 ↗

754 symbols 1,530 edges 95 files 74 documented · 10%

README

WebLLM

High-Performance In-Browser LLM Inference Engine.

Documentation | Blogpost | Paper | Examples

Overview

WebLLM is a high-performance in-browser LLM inference engine that brings language model inference directly onto web browsers with hardware acceleration. Everything runs inside the browser with no server support and is accelerated with WebGPU.

WebLLM is fully compatible with OpenAI API. That is, you can use the same OpenAI API on any open source models locally, with functionalities including streaming, JSON-mode, function-calling (WIP), etc.

We can bring a lot of fun opportunities to build AI assistants for everyone and enable privacy while enjoying GPU acceleration.

You can use WebLLM as a base npm package and build your own web application on top of it by following the examples below. This project is a companion project of MLC LLM, which enables universal deployment of LLM across hardware environments.

Check out WebLLM Chat to try it out!

Key Features

In-Browser Inference: WebLLM is a high-performance, in-browser language model inference engine that leverages WebGPU for hardware acceleration, enabling powerful LLM operations directly within web browsers without server-side processing.
Full OpenAI API Compatibility: Seamlessly integrate your app with WebLLM using OpenAI API with functionalities such as streaming, JSON-mode, logit-level control, seeding, and more.
Structured JSON Generation: WebLLM supports state-of-the-art JSON mode structured generation, implemented in the WebAssembly portion of the model library for optimal performance. Check WebLLM JSON Playground on HuggingFace to try generating JSON output with custom JSON schema.
Extensive Model Support: WebLLM natively supports a range of models including Llama 3, Phi 3, Gemma, Mistral, Qwen(通义千问), and many others, making it versatile for various AI tasks. For the complete supported model list, check MLC Models.
Custom Model Integration: Easily integrate and deploy custom models in MLC format, allowing you to adapt WebLLM to specific needs and scenarios, enhancing flexibility in model deployment.
Plug-and-Play Integration: Easily integrate WebLLM into your projects using package managers like NPM and Yarn, or directly via CDN, complete with comprehensive examples and a modular design for connecting with UI components.
Streaming & Real-Time Interactions: Supports streaming chat completions, allowing real-time output generation which enhances interactive applications like chatbots and virtual assistants.
Web Worker & Service Worker Support: Optimize UI performance and manage the lifecycle of models efficiently by offloading computations to separate worker threads or service workers.
Chrome Extension Support: Extend the functionality of web browsers through custom Chrome extensions using WebLLM, with examples available for building both basic and advanced extensions.

Built-in Models

Check the complete list of available models on MLC Models. WebLLM supports a subset of these available models and the list can be accessed at prebuiltAppConfig.model_list.

Here are the primary families of models currently supported:

Llama: Llama 3, Llama 2, Hermes-2-Pro-Llama-3
Phi: Phi 3, Phi 2, Phi 1.5
Gemma: Gemma-2B
Mistral: Mistral-7B-v0.3, Hermes-2-Pro-Mistral-7B, NeuralHermes-2.5-Mistral-7B, OpenHermes-2.5-Mistral-7B
Qwen (通义千问): Qwen2 0.5B, 1.5B, 7B

If you need more models, request a new model via opening an issue or check Custom Models for how to compile and use your own models with WebLLM.

Jumpstart with Examples

Learn how to use WebLLM to integrate large language models into your application and generate chat completions through this simple Chatbot example:

For an advanced example of a larger, more complicated project, check WebLLM Chat.

More examples for different use cases are available in the examples folder.

Get Started

WebLLM offers a minimalist and modular interface to access the chatbot in the browser. The package is designed in a modular way to hook to any of the UI components.

Installation

Package Manager

# npm
npm install @mlc-ai/web-llm
# yarn
yarn add @mlc-ai/web-llm
# or pnpm
pnpm install @mlc-ai/web-llm

Then import the module in your code.

// Import everything
import * as webllm from "@mlc-ai/web-llm";
// Or only import what you need
import { CreateMLCEngine } from "@mlc-ai/web-llm";

CDN Delivery

Thanks to jsdelivr.com, WebLLM can be imported directly through URL and work out-of-the-box on cloud development platforms like jsfiddle.net, Codepen.io, and Scribbler:

import * as webllm from "https://esm.run/@mlc-ai/web-llm";

It can also be dynamically imported as:

const webllm = await import("https://esm.run/@mlc-ai/web-llm");

Create MLCEngine

Most operations in WebLLM are invoked through the MLCEngine interface. You can create an MLCEngine instance and loading the model by calling the CreateMLCEngine() factory function.

(Note that loading models requires downloading and it can take a significant amount of time for the very first run without caching previously. You should properly handle this asynchronous call.)

import { CreateMLCEngine } from "@mlc-ai/web-llm";

// Callback function to update model loading progress
const initProgressCallback = (initProgress) => {
  console.log(initProgress);
};
const selectedModel = "Llama-3.1-8B-Instruct-q4f32_1-MLC";

const engine = await CreateMLCEngine(
  selectedModel,
  { initProgressCallback: initProgressCallback }, // engineConfig
);

Under the hood, this factory function does the following steps for first creating an engine instance (synchronous) and then loading the model (asynchronous). You can also do them separately in your application.

import { MLCEngine } from "@mlc-ai/web-llm";

// This is a synchronous call that returns immediately
const engine = new MLCEngine({
  initProgressCallback: initProgressCallback,
});

// This is an asynchronous call and can take a long time to finish
await engine.reload(selectedModel);

Cache Backend Policy

WebLLM supports three cache backends through AppConfig.cacheBackend:

"cache": browser Cache API (default).
"indexeddb": browser IndexedDB.
"cross-origin": experimental Chrome Cross-Origin Storage API extension backend. Install the Cross-Origin Storage extension to use it. (If the extension isn't installed, WebLLM falls back to the default cache automatically.)

Example:

import { CreateMLCEngine, prebuiltAppConfig } from "@mlc-ai/web-llm";

const appConfig = { ...prebuiltAppConfig, cacheBackend: "cross-origin" };
const engine = await CreateMLCEngine("Llama-3.1-8B-Instruct-q4f32_1-MLC", {
  appConfig,
});

Notes: - The "cross-origin" backend requires installing and enabling a compatible browser extension. - Cross-origin backend currently does not support programmatic tensor-cache deletion; clearing is extension-managed.

Chat Completion

After successfully initializing the engine, you can now invoke chat completions using OpenAI style chat APIs through the engine.chat.completions interface. For the full list of parameters and their descriptions, check section below and OpenAI API reference.

(Note: The model parameter is not supported and will be ignored here. Instead, call CreateMLCEngine(model) or engine.reload(model) instead as shown in the Create MLCEngine above.)

const messages = [
  { role: "system", content: "You are a helpful AI assistant." },
  { role: "user", content: "Hello!" },
];

const reply = await engine.chat.completions.create({
  messages,
});
console.log(reply.choices[0].message);
console.log(reply.usage);

Streaming

WebLLM also supports streaming chat completion generating. To use it, simply pass stream: true to the engine.chat.completions.create call.

const messages = [
  { role: "system", content: "You are a helpful AI assistant." },
  { role: "user", content: "Hello!" },
];

// Chunks is an AsyncGenerator object
const chunks = await engine.chat.completions.create({
  messages,
  temperature: 1,
  stream: true, // <-- Enable streaming
  stream_options: { include_usage: true },
});

let reply = "";
for await (const chunk of chunks) {
  reply += chunk.choices[0]?.delta.content || "";
  console.log(reply);
  if (chunk.usage) {
    console.log(chunk.usage); // only last chunk has usage
  }
}

const fullReply = await engine.getMessage();
console.log(fullReply);

Advanced Usage

Using Workers

You can put the heavy computation in a worker script to optimize your application performance. To do so, you need to:

Create a handler in the worker thread that communicates with the frontend while handling the requests.
Create a Worker Engine in your main application, which under the hood sends messages to the handler in the worker thread.

For detailed implementations of different kinds of Workers, check the following sections.

Dedicated Web Worker

WebLLM comes with API support for WebWorker so you can hook the generation process into a separate worker thread so that the computing in the worker thread won't disrupt the UI.

We create a handler in the worker thread that communicates with the frontend while handling the requests.

// worker.ts
import { WebWorkerMLCEngineHandler } from "@mlc-ai/web-llm";

// A handler that resides in the worker thread
const handler = new WebWorkerMLCEngineHandler();
self.onmessage = (msg: MessageEvent) => {
  handler.onmessage(msg);
};

In the main logic, we create a WebWorkerMLCEngine that implements the same MLCEngineInterface. The rest of the logic remains the same.

// main.ts
import { CreateWebWorkerMLCEngine } from "@mlc-ai/web-llm";

async function main() {
  // Use a WebWorkerMLCEngine instead of MLCEngine here
  const engine = await CreateWebWorkerMLCEngine(
    new Worker(new URL("./worker.ts", import.meta.url), {
      type: "module",
    }),
    selectedModel,
    { initProgressCallback }, // engineConfig
  );

  // everything else remains the same
}

Use Service Worker

WebLLM comes with API support for ServiceWorker so you can hook the generation process into a service worker to avoid reloading the model in every page visit and optimize your application's offline experience.

(Note, Service Worker's life cycle is managed by the browser and can be killed any time without notifying the webapp. ServiceWorkerMLCEngine will try to keep the service worker thread alive by periodically sending heartbeat events, but your application should also include proper error handling. Check keepAliveMs and missedHeatbeat in ServiceWorkerMLCEngine for more details.)

We create a handler in the worker thread that communicates with the frontend while handling the requests.

// sw.ts
import { ServiceWorkerMLCEngineHandler } from "@mlc-ai/web-llm";

let handler: ServiceWorkerMLCEngineHandler;

self.addEventListener("activate", function (event) {
  handler = new ServiceWorkerMLCEngineHandler();
  console.log("Service Worker is ready");
});

Then in the main logic, we register the service worker and create the engine using CreateServiceWorkerMLCEngine function. The rest of the logic remains the same.

```typescript // main.ts import { MLCEngineInterface, CreateServiceWorkerMLCEngine, } from "@mlc-ai/web-llm";

if ("serviceWorker" in navigator) { navigator.se

Extension points exported contracts — how you extend this code

MLCEngineInterface (Interface)

(no doc) [4 implementers]

src/types.ts

ChatWorker (Interface)

(no doc) [2 implementers]

src/web_worker.ts

AppConfig (Interface)

(no doc)

utils/vram_requirements/src/vram_requirements.ts

ExtensionMLCEngineConfig (Interface)

(no doc)

src/extension_service_worker.ts

ConvTemplateConfig (Interface)

(no doc)

src/config.ts

ModelIntegrity (Interface)

(no doc)

src/integrity.ts

ReloadParams (Interface)

(no doc)

src/message.ts

CreateEmbeddingResponse (Interface)

(no doc)

src/openai_api_protocols/embedding.ts

Core symbols most depended-on inside this repo

create

called by 59

src/openai_api_protocols/embedding.ts

src/service_worker.ts

postInitAndCheckFields

called by 21

src/openai_api_protocols/chat_completion.ts

appendMessage

called by 19

src/conversation.ts

getConversationFromChatCompletionRequest

called by 19

src/conversation.ts

Shape

Method 304

Function 196

Class 188

Interface 63

Enum 3

Languages

TypeScript100%

Modules by API surface

src/error.ts204 symbols

src/llm_chat.ts55 symbols

tests/engine_integration.test.ts37 symbols

src/openai_api_protocols/chat_completion.ts35 symbols

src/web_worker.ts34 symbols

src/engine.ts33 symbols

src/conversation.ts18 symbols

examples/simple-chat-upload/src/simple_chat.ts18 symbols

src/extension_service_worker.ts17 symbols

examples/simple-chat-ts/src/simple_chat.ts15 symbols

src/service_worker.ts14 symbols

src/config.ts14 symbols

Dependencies from manifests, versioned

@eslint/eslintrc3.3.1 · 1×

@eslint/js9.9.0 · 1×

@mlc-ai/web-llm0.2.83 · 1×

@mlc-ai/web-runtime0.24.0-dev3 · 1×

@mlc-ai/web-tokenizers0.1.6 · 1×

@mlc-ai/web-xgrammar0.1.27 · 1×

@next/eslint-plugin-next16.0.0 · 1×

@parcel/config-webextension2.9.3 · 1×

@rollup/plugin-commonjs29.0.0 · 1×

@rollup/plugin-node-resolve16.0.3 · 1×

@rollup/plugin-typescript12.3.0 · 1×

@types/chrome0.0.266 · 1×

For agents

$ claude mcp add web-llm \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact