The Runtime for
Edge AI in React
Run privacy-first small AI without the infrastructure overhead.
Zero Latency. Zero Cost. Easy To Use.
CHECKING HARDWARE...
Analyzing Hardware...
Community Feedback
Comment
by u/red_it__ from discussion
inreactjs
Comment
by u/red_it__ from discussion
in reactjs
The constraint you named honestly: "This is not for lightweight, general-purpose landing pages." Most libraries hide the 3GB download. You put it upfront. That's not a bug, it's a filter. The use cases that survive that constraint are the ones that actually need local inference: B2B dashboards, enterprise data privacy, structured extraction. Everything else falls away. That's good product design. You built for the problem, not the demo.
Really solid abstraction of a genuinely painful setup, wrapping WebGPU, workers, and caching into a simple hook is a big DX win. I especially like the honest positioning around when it actually makes sense, the B2B and privacy use cases are spot on.
WebGPU for local inference is the right direction. We hit a similar problem building data infrastructure for edge AI — when your robot needs to make decisions in under 50ms, round-tripping to an API is not even an option. The 1.5-3GB model download concern you mentioned is real though. In our case we solved it by shipping a smaller quantized model (Q4) as part of the binary itself. The tradeoff is accuracy vs. startup time, but for many edge use cases that tradeoff makes sense. One question: how does react-brai handle tab-level coordination? If a user has 3 tabs open, do they each download their own model copy? That was a nasty issue we had with Web Workers — each worker loading its own model into memory and OOM-killing the browser.
React Brai Features
A complete runtime for distributed, privacy-first AI on a React app.
Slash Cloud Costs
Don't pay per-token for every simple task. Offload routine jobs like summarization and JSON extraction to the client, saving your API budget for complex reasoning.
Crash-Proof Swarm
Browsers limit WebGPU to one active context. Our Leader Election system synchronizes state across tabs to prevent crashes.
Total Data Privacy
Inputs never leave the client's device. Perfect for HIPAA-compliant healthcare apps, legal tech, and PII redaction.
Zero-Latency Inference
Remove network round-trips entirely. Once the model loads, text generation is instantaneous and works fully offline.
Production-Grade DX
Built for serious development. Full TypeScript support, Zod schema validation, and strict state management out of the box.
Native Metal & Vulkan
We bypass standard WebGL to access raw compute shaders via WebGPU, unlocking near-native performance on Apple M-Series and NVIDIA cards.
Documentation
Introduction
react-brai is a fault-tolerant WebGPU runtime for React. It creates a distributed "Swarm" across your user's open tabs, allowing them to share a single GPU context (VRAM) while executing requests in a persistent, recoverable queue.
The Swarm (v2)
Browsers limit WebGPU to a single active context. To prevent crashes when users open multiple tabs,react-brai uses a Leader Election system.
Leader Tab
The first tab to open. It holds the GPU Lock and the loaded Model in VRAM. It processes the Job Queue sequentially.
Follower Tabs
Subsequent tabs. They do not load the model. Instead, they send requests to the Leader and listen for token streams.
Persistent Recovery
If the Leader is closed or refreshed, the lock is released. The next Follower in line automatically promotes itself to Leader, reads the Persistent Queue from LocalStorage, reloads the model, and resumes processing pending jobs.
Quick Start
Initialize the engine. The hook automatically handles Leader/Follower negotiation.
import { useLocalAI } from 'react-brai';
export default function Chat() {
const { loadModel, chat, isReady, tps } = useLocalAI();
useEffect(() => {
loadModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
}, []);
return <div>Speed: {tps} T/s</div>
}Core Patterns
Structured Extraction
const response = await chat([
{ role: "system", content: "Output JSON: { sentiment: 'pos' | 'neg' }" },
{ role: "user", content: "I love this library!" }
]);
const data = JSON.parse(response);API Reference
useLocalAI()
Swarm State
roleSynced'LEADER' | 'FOLLOWER' | 'PENDING'Current tab status. LEADERS perform inference. FOLLOWERS delegate to Leader.
queueSyncedQueueItem[]Real-time list of all pending jobs across all tabs. Synced via LocalStorage.
Engine State
isReadySyncedbooleanTrue when the Swarm Leader has a model loaded in VRAM.
progressSynced{ progress: number, text: string }Download progress (0-1). Visible to all tabs during initialization.
isLoadingLocalbooleanTrue ONLY if *this specific tab* initiated the request currently being processed.
responseLocalstringStreaming output buffer. Only populates for the tab that requested the chat.
Telemetry
tpsReal-timenumberTokens Per Second. Instantaneous inference speed metric.
generatedTokensReal-timenumberTotal number of tokens generated for the current active response.
Methods
loadModel(modelId, config?)RequiredvoidRequests the Leader to load a model. If Leader is busy/loaded, request is ignored safely.
unloadModel()voidForces the Leader to purge the model from VRAM and terminate the worker to free memory.
chat(messages)RequiredvoidPush a job to the Swarm Queue. The promise resolves when tokens start streaming.
reset()voidAborts the current job (if it belongs to this tab) and kills the Leader worker if necessary.
Type Definitions
export interface LocalAIHook {
role: 'LEADER' | 'FOLLOWER' | 'PENDING';
isReady: boolean;
isLoading: boolean;
response: string;
tps: number; // 🆕 Telemetry
generatedTokens: number; // 🆕 Telemetry
loadModel: (modelId: string, config?: ModelConfig) => void;
unloadModel: () => void; // 🆕 Method
chat: (messages: Message[]) => void;
reset: () => void;
}React Brai Playground
Serverless AI, powered by your GPU.
Looking for more? Find compatible models on HuggingFace ending in q4f16_1-MLC
Fastest alternative. Good for simple formatting.
Ready to Initialize
We need to download 1.2 GB of weights to your cache. This happens once.


