All Posts
AIFrontendWebGPU
Intermediate

On-Device AI: Running LLMs Directly in the Browser

Zero API cost, offline-capable, and data never leaves the device. WebLLM, WebGPU, Chrome's built-in AI APIs, and the hybrid architecture pattern for production apps.

M
Mini Bhati··13 min read
0

TLDR: WebLLM + WebGPU now runs real LLMs in the browser — offline, zero API cost, data never leaves the device. The 2 GB download and GPU dependency make it a hybrid-fallback problem in practice, not a cloud replacement. Strong fit for privacy-sensitive features and high-frequency low-complexity tasks.

What if your AI feature worked offline, had zero API cost, and never sent user data to a server? That's what on-device AI in the browser promises — and in 2025, it's no longer a research curiosity. Teams are shipping it.

I've been experimenting with this for a while and the results are more useful than I expected for specific use cases — and clearly wrong for others. Here's the honest picture.


Why this matters for frontend engineers

The standard LLM architecture: user → your server → API provider → response. On-device flips this: the model runs in the browser, on the user's GPU.

What you gain:

  • Zero marginal cost per inference — no API fees per user interaction
  • Works offline after initial model download
  • Data never leaves the device — privacy by architecture, not policy
  • No latency from network round-trips
  • No rate limits, no cold starts

What you give up:

  • Large initial download (1–7 GB depending on model — you're asking users to download this)
  • Completely dependent on user hardware — no dedicated GPU means slow inference
  • Smaller models = less capable (no Sonnet-quality reasoning on a phone)
  • WebGPU support is still incomplete across browsers
  • Smaller context windows than cloud models

The use cases where on-device wins are specific: offline PWAs, privacy-sensitive inputs, high-frequency low-complexity tasks where you'd otherwise be paying per request. The use cases where cloud is clearly better: anything requiring real reasoning, enterprise on older hardware, or when you need consistent quality across all users.


WebGPU: The foundation everything runs on

Everything in browser AI runs on WebGPU — the new web standard for GPU compute, available in Chrome 113+, Edge, and Safari 18+.

async function checkWebGPUSupport() {
  if (!navigator.gpu) {
    return { supported: false, reason: "WebGPU not available in this browser" };
  }

  const adapter = await navigator.gpu.requestAdapter();
  if (!adapter) {
    return { supported: false, reason: "No GPU adapter found" };
  }

  const info = await adapter.requestAdapterInfo();
  return {
    supported: true,
    vendor: info.vendor,
    architecture: info.architecture,
  };
}

Firefox doesn't support WebGPU by default yet. Always check before attempting to load a model, and always have a cloud API fallback ready.


WebLLM: The most production-ready option

WebLLM from MLC AI compiles LLMs to WebGPU using Machine Learning Compilation. It's the most mature option for running LLMs in the browser today.

npm install @mlc-ai/web-llm

Basic usage

import * as webllm from "@mlc-ai/web-llm";

const engine = new webllm.MLCEngine();

// First run downloads ~2-4 GB — show a clear progress indicator
await engine.reload("Llama-3.2-3B-Instruct-q4f32_1-MLC", {
  initProgressCallback: (progress) => {
    updateProgressBar(progress.progress);
  },
});

// OpenAI-compatible API
const response = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Explain React hooks in one paragraph." }],
  temperature: 0.7,
  max_tokens: 512,
});

Streaming

const stream = await engine.chat.completions.create({
  messages: [{ role: "user", content: userMessage }],
  stream: true,
});

let fullText = "";
for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content ?? "";
  fullText += delta;
  updateUI(fullText);
}

Model choices

Model Download size Context Use case
Llama-3.2-1B-Instruct-q4f32_1 ~700 MB 128k Ultra-fast, basic tasks
Llama-3.2-3B-Instruct-q4f32_1 ~2 GB 128k Good balance
Llama-3.1-8B-Instruct-q4f32_1 ~4.5 GB 128k Near-GPT-3.5 quality
Phi-3.5-mini-instruct-q4f16_1 ~2.2 GB 128k Microsoft's compact model
Gemma-2-2b-it-q4f32_1 ~1.5 GB 8k Google's lightweight model

Starting recommendation: Llama-3.2-3B — 2 GB download, fast on mid-range GPUs, surprisingly capable for autocomplete, summarization, and simple Q&A. Don't start with the 8B model until you've validated that users' hardware handles it.


Chrome's built-in AI APIs

Google is embedding AI directly into Chrome via the Chrome AI APIs (Gemini Nano). No download required — the model is already on the device if the user has Chrome with AI features enabled.

if ("ai" in window && "languageModel" in window.ai) {
  const session = await window.ai.languageModel.create({
    systemPrompt: "You are a helpful writing assistant.",
  });

  const stream = session.promptStreaming("Improve this sentence: " + userText);

  for await (const chunk of stream) {
    outputElement.textContent = chunk;
  }

  session.destroy();
}

Current limitations as of mid-2025: requires Chrome Canary or Chrome 127+ with flags, Gemini Nano is small (great for autocomplete and summarization, not for complex reasoning), no user-controlled model selection. Not production-ready today, but worth testing early so you're ready when it stabilizes.


The hybrid architecture pattern

Pure on-device doesn't work for everyone's hardware. The pattern I'd ship is hybrid: try on-device first, fall back to cloud if the device can't handle it.

class AIProvider {
  private webLLMEngine?: webllm.MLCEngine;
  private onDeviceAvailable = false;

  async initialize() {
    const gpuCheck = await checkWebGPUSupport();

    if (gpuCheck.supported) {
      try {
        this.webLLMEngine = new webllm.MLCEngine();
        await this.webLLMEngine.reload("Llama-3.2-3B-Instruct-q4f32_1-MLC");
        this.onDeviceAvailable = true;
      } catch {
        // On-device load failed — fall through to cloud
      }
    }
  }

  async complete(messages: Message[]): Promise<string> {
    if (this.onDeviceAvailable && this.webLLMEngine) {
      const result = await this.webLLMEngine.chat.completions.create({ messages });
      return result.choices[0].message.content ?? "";
    }
    return callCloudAPI(messages);
  }
}

Users with capable hardware get free, private, offline inference. Everyone else gets a consistent cloud experience. Neither group needs to know which path they're on.


Angular + Web Workers: Keep the main thread free

Running inference on the main thread blocks the UI completely. You need a Web Worker — no exceptions.

// ai.worker.ts
import * as webllm from "@mlc-ai/web-llm";

const engine = new webllm.MLCEngine();

self.onmessage = async (event) => {
  const { type, payload } = event.data;

  if (type === "LOAD") {
    await engine.reload("Llama-3.2-3B-Instruct-q4f32_1-MLC", {
      initProgressCallback: (p) =>
        self.postMessage({ type: "PROGRESS", progress: p.progress }),
    });
    self.postMessage({ type: "READY" });
  }

  if (type === "INFER") {
    const stream = await engine.chat.completions.create({
      messages: payload.messages,
      stream: true,
    });

    for await (const chunk of stream) {
      const text = chunk.choices[0]?.delta?.content ?? "";
      if (text) self.postMessage({ type: "CHUNK", text });
    }

    self.postMessage({ type: "DONE" });
  }
};
@Injectable({ providedIn: "root" })
export class OnDeviceAIService {
  private worker = new Worker(new URL("./ai.worker.ts", import.meta.url));
  readonly status = signal<"idle" | "loading" | "ready" | "inferring">("idle");
  readonly progress = signal(0);

  constructor() {
    this.worker.onmessage = ({ data }) => {
      if (data.type === "PROGRESS") this.progress.set(data.progress);
      if (data.type === "READY") this.status.set("ready");
    };
  }

  load() {
    this.status.set("loading");
    this.worker.postMessage({ type: "LOAD" });
  }
}

When to actually use on-device

Use case On-device? Why
Offline-first PWA Yes No network needed
Privacy-sensitive input (journals, health) Yes Data never leaves device
High-frequency autocomplete Yes Sub-100ms possible, no API cost
Complex reasoning / long documents No Small models struggle here
Consistent quality across all users No Varies by hardware
Enterprise / older devices No Insufficient GPU

On-device AI isn't replacing cloud LLMs. It's filling the gap where cloud is overkill — offline, private, high-frequency, low-complexity tasks. The 2 GB download is the biggest UX barrier right now. As models get smaller and better (Llama 4 Scout is already impressive at its activation size), and as Chrome's built-in AI stabilizes, the tradeoffs will continue to shift. Worth understanding now so you're not learning from scratch when it becomes the obvious choice.

Found this useful? Give it a like.

Newsletter

Stay in the loop

New writing on frontend engineering, system architecture & AI — delivered straight to your inbox. No spam, unsubscribe anytime.