The Runtime for
Edge AI in React

Run privacy-first small AI without the infrastructure overhead.
Zero Latency. Zero Cost. Easy To Use.

Llama-3.2-1B

Gemma-2-2B

Phi-3.5-Mini

Qwen-2.5-1.5B

Mistral-7B-v0.3

Llama-3.1-8B

Llama-3.2-1B

Gemma-2-2B

Phi-3.5-Mini

Qwen-2.5-1.5B

Mistral-7B-v0.3

Llama-3.1-8B

Llama-3.2-1B

Gemma-2-2B

Phi-3.5-Mini

Qwen-2.5-1.5B

Mistral-7B-v0.3

Llama-3.1-8B

Llama-3.2-1B

Gemma-2-2B

Phi-3.5-Mini

Qwen-2.5-1.5B

Mistral-7B-v0.3

Llama-3.1-8B

JSON Extraction

PII Redaction

Semantic Search

Mod & Safety

Smart Form Fill

JSON Extraction

PII Redaction

Semantic Search

Mod & Safety

Smart Form Fill

JSON Extraction

PII Redaction

Semantic Search

Mod & Safety

Smart Form Fill

JSON Extraction

PII Redaction

Semantic Search

Mod & Safety

Smart Form Fill

CHECKING HARDWARE...

Analyzing Hardware...

Community Feedback

Comment
by u/red_it__ from discussion
inreactjs

Comment
by u/red_it__ from discussion
in reactjs

Victor OkefieFounder @ Eptopia

The constraint you named honestly: "This is not for lightweight, general-purpose landing pages." Most libraries hide the 3GB download. You put it upfront. That's not a bug, it's a filter. The use cases that survive that constraint are the ones that actually need local inference: B2B dashboards, enterprise data privacy, structured extraction. Everything else falls away. That's good product design. You built for the problem, not the demo.

KnowbandeCommerce plugin development company

Really solid abstraction of a genuinely painful setup, wrapping WebGPU, workers, and caching into a simple hook is a big DX win. I especially like the honest positioning around when it actually makes sense, the B2B and privacy use cases are spot on.

moteEmbedded multi-modal database for edge AI

WebGPU for local inference is the right direction. We hit a similar problem building data infrastructure for edge AI — when your robot needs to make decisions in under 50ms, round-tripping to an API is not even an option. The 1.5-3GB model download concern you mentioned is real though. In our case we solved it by shipping a smaller quantized model (Q4) as part of the binary itself. The tradeoff is accuracy vs. startup time, but for many edge use cases that tradeoff makes sense. One question: how does react-brai handle tab-level coordination? If a user has 3 tabs open, do they each download their own model copy? That was a nasty issue we had with Web Workers — each worker loading its own model into memory and OOM-killing the browser.

React Brai Features

A complete runtime for distributed, privacy-first AI on a React app.

Slash Cloud Costs

Don't pay per-token for every simple task. Offload routine jobs like summarization and JSON extraction to the client, saving your API budget for complex reasoning.

Crash-Proof Swarm

Browsers limit WebGPU to one active context. Our Leader Election system synchronizes state across tabs to prevent crashes.

Total Data Privacy

Inputs never leave the client's device. Perfect for HIPAA-compliant healthcare apps, legal tech, and PII redaction.

Zero-Latency Inference

Remove network round-trips entirely. Once the model loads, text generation is instantaneous and works fully offline.

Production-Grade DX

Built for serious development. Full TypeScript support, Zod schema validation, and strict state management out of the box.

Native Metal & Vulkan

We bypass standard WebGL to access raw compute shaders via WebGPU, unlocking near-native performance on Apple M-Series and NVIDIA cards.

Documentation

v2.1.0 Stable

Introduction

react-brai is a fault-tolerant WebGPU runtime for React. It creates a distributed "Swarm" across your user's open tabs, allowing them to share a single GPU context (VRAM) while executing requests in a persistent, recoverable queue.

The Swarm (v2)

Browsers limit WebGPU to a single active context. To prevent crashes when users open multiple tabs,react-brai uses a Leader Election system.

Leader Tab

The first tab to open. It holds the GPU Lock and the loaded Model in VRAM. It processes the Job Queue sequentially.

Follower Tabs

Subsequent tabs. They do not load the model. Instead, they send requests to the Leader and listen for token streams.

Persistent Recovery

If the Leader is closed or refreshed, the lock is released. The next Follower in line automatically promotes itself to Leader, reads the Persistent Queue from LocalStorage, reloads the model, and resumes processing pending jobs.

Quick Start

Initialize the engine. The hook automatically handles Leader/Follower negotiation.

ChatComponent.tsx

tsx

import { useLocalAI } from 'react-brai';

export default function Chat() {
  const { loadModel, chat, isReady, tps } = useLocalAI();

  useEffect(() => { 
      loadModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'); 
  }, []);

  return <div>Speed: {tps} T/s</div>
}

Core Patterns

Structured Extraction

extract.ts

typescript

const response = await chat([
  { role: "system", content: "Output JSON: { sentiment: 'pos' | 'neg' }" },
  { role: "user", content: "I love this library!" }
]);
const data = JSON.parse(response);

API Reference

useLocalAI()

Swarm State

roleSynced'LEADER' | 'FOLLOWER' | 'PENDING'

Current tab status. LEADERS perform inference. FOLLOWERS delegate to Leader.

queueSyncedQueueItem[]

Real-time list of all pending jobs across all tabs. Synced via LocalStorage.

Engine State

isReadySyncedboolean

True when the Swarm Leader has a model loaded in VRAM.

progressSynced{ progress: number, text: string }

Download progress (0-1). Visible to all tabs during initialization.

isLoadingLocalboolean

True ONLY if *this specific tab* initiated the request currently being processed.

responseLocalstring

Streaming output buffer. Only populates for the tab that requested the chat.

Telemetry

tpsReal-timenumber

Tokens Per Second. Instantaneous inference speed metric.

generatedTokensReal-timenumber

Total number of tokens generated for the current active response.

Methods

loadModel(modelId, config?)Requiredvoid

Requests the Leader to load a model. If Leader is busy/loaded, request is ignored safely.

unloadModel()void

Forces the Leader to purge the model from VRAM and terminate the worker to free memory.

chat(messages)Requiredvoid

Push a job to the Swarm Queue. The promise resolves when tokens start streaming.

reset()void

Aborts the current job (if it belongs to this tab) and kills the Leader worker if necessary.

Type Definitions

types.ts

typescript

export interface LocalAIHook {
  role: 'LEADER' | 'FOLLOWER' | 'PENDING';
  isReady: boolean;
  isLoading: boolean;
  response: string;
  tps: number;              // 🆕 Telemetry
  generatedTokens: number;  // 🆕 Telemetry
  
  loadModel: (modelId: string, config?: ModelConfig) => void;
  unloadModel: () => void;  // 🆕 Method
  chat: (messages: Message[]) => void;
  reset: () => void;
}

BROWSER NATIVE SWARM

React Brai Playground

Serverless AI, powered by your GPU.

Model Configuration

Select Model

Looking for more? Find compatible models on HuggingFace ending in q4f16_1-MLC

VRAM Required2GB

Fastest alternative. Good for simple formatting.

Active Application

Telemetry

Speed

tokens per second

Job Queue0

Idle. Ready for requests.

json

Ready to Initialize

We need to download 1.2 GB of weights to your cache. This happens once.

The Runtime forEdge AI in React

Community Feedback

React Brai Features

Slash Cloud Costs

Crash-Proof Swarm

Total Data Privacy

Zero-Latency Inference

Production-Grade DX

Native Metal & Vulkan

Documentation

Introduction

The Swarm (v2)

Leader Tab

Follower Tabs

Persistent Recovery

Quick Start

Core Patterns

Structured Extraction

API Reference

useLocalAI()

Swarm State

Engine State

Telemetry

Methods

Type Definitions

React Brai Playground

Ready to Initialize

The Runtime for
Edge AI in React