On-Device AI: Running LLMs Directly in the Browser

TLDR: WebLLM + WebGPU now runs real LLMs in the browser — offline, zero API cost, data never leaves the device. The 2 GB download and GPU dependency make it a hybrid-fallback problem in practice, not a cloud replacement. Strong fit for privacy-sensitive features and high-frequency low-complexity tasks.

What if your AI feature worked offline, had zero API cost, and never sent user data to a server? That's what on-device AI in the browser promises — and in 2025, it's no longer a research curiosity. Teams are shipping it.

I've been experimenting with this for a while and the results are more useful than I expected for specific use cases — and clearly wrong for others. Here's the honest picture.

Why this matters for frontend engineers

The standard LLM architecture: user → your server → API provider → response. On-device flips this: the model runs in the browser, on the user's GPU.

What you gain:

Zero marginal cost per inference — no API fees per user interaction
Works offline after initial model download
Data never leaves the device — privacy by architecture, not policy
No latency from network round-trips
No rate limits, no cold starts

What you give up:

Large initial download (1–7 GB depending on model — you're asking users to download this)
Completely dependent on user hardware — no dedicated GPU means slow inference
Smaller models = less capable (no Sonnet-quality reasoning on a phone)
WebGPU support is still incomplete across browsers
Smaller context windows than cloud models

The use cases where on-device wins are specific: offline PWAs, privacy-sensitive inputs, high-frequency low-complexity tasks where you'd otherwise be paying per request. The use cases where cloud is clearly better: anything requiring real reasoning, enterprise on older hardware, or when you need consistent quality across all users.

WebGPU: The foundation everything runs on

Everything in browser AI runs on WebGPU — the new web standard for GPU compute, available in Chrome 113+, Edge, and Safari 18+.

async function checkWebGPUSupport() {
  if (!navigator.gpu) {
    return { supported: false, reason: "WebGPU not available in this browser" };
  }

  const adapter = await navigator.gpu.requestAdapter();
  if (!adapter) {
    return { supported: false, reason: "No GPU adapter found" };
  }

  const info = await adapter.requestAdapterInfo();
  return {
    supported: true,
    vendor: info.vendor,
    architecture: info.architecture,
  };
}

Firefox doesn't support WebGPU by default yet. Always check before attempting to load a model, and always have a cloud API fallback ready.

WebLLM: The most production-ready option

WebLLM from MLC AI compiles LLMs to WebGPU using Machine Learning Compilation. It's the most mature option for running LLMs in the browser today.

npm install @mlc-ai/web-llm

Basic usage

import * as webllm from "@mlc-ai/web-llm";

const engine = new webllm.MLCEngine();

// First run downloads ~2-4 GB — show a clear progress indicator
await engine.reload("Llama-3.2-3B-Instruct-q4f32_1-MLC", {
  initProgressCallback: (progress) => {
    updateProgressBar(progress.progress);
  },
});

// OpenAI-compatible API
const response = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Explain React hooks in one paragraph." }],
  temperature: 0.7,
  max_tokens: 512,
});

Streaming

const stream = await engine.chat.completions.create({
  messages: [{ role: "user", content: userMessage }],
  stream: true,
});

let fullText = "";
for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content ?? "";
  fullText += delta;
  updateUI(fullText);
}

Model choices

Model	Download size	Context	Use case
Llama-3.2-1B-Instruct-q4f32_1	~700 MB	128k	Ultra-fast, basic tasks
Llama-3.2-3B-Instruct-q4f32_1	~2 GB	128k	Good balance
Llama-3.1-8B-Instruct-q4f32_1	~4.5 GB	128k	Near-GPT-3.5 quality
Phi-3.5-mini-instruct-q4f16_1	~2.2 GB	128k	Microsoft's compact model
Gemma-2-2b-it-q4f32_1	~1.5 GB	8k	Google's lightweight model

Starting recommendation: Llama-3.2-3B — 2 GB download, fast on mid-range GPUs, surprisingly capable for autocomplete, summarization, and simple Q&A. Don't start with the 8B model until you've validated that users' hardware handles it.

Chrome's built-in AI APIs

Google is embedding AI directly into Chrome via the Chrome AI APIs (Gemini Nano). No download required — the model is already on the device if the user has Chrome with AI features enabled.

if ("ai" in window && "languageModel" in window.ai) {
  const session = await window.ai.languageModel.create({
    systemPrompt: "You are a helpful writing assistant.",
  });

  const stream = session.promptStreaming("Improve this sentence: " + userText);

  for await (const chunk of stream) {
    outputElement.textContent = chunk;
  }

  session.destroy();
}

Current limitations as of mid-2025: requires Chrome Canary or Chrome 127+ with flags, Gemini Nano is small (great for autocomplete and summarization, not for complex reasoning), no user-controlled model selection. Not production-ready today, but worth testing early so you're ready when it stabilizes.

The hybrid architecture pattern

Pure on-device doesn't work for everyone's hardware. The pattern I'd ship is hybrid: try on-device first, fall back to cloud if the device can't handle it.

class AIProvider {
  private webLLMEngine?: webllm.MLCEngine;
  private onDeviceAvailable = false;

  async initialize() {
    const gpuCheck = await checkWebGPUSupport();

    if (gpuCheck.supported) {
      try {
        this.webLLMEngine = new webllm.MLCEngine();
        await this.webLLMEngine.reload("Llama-3.2-3B-Instruct-q4f32_1-MLC");
        this.onDeviceAvailable = true;
      } catch {
        // On-device load failed — fall through to cloud
      }
    }
  }

  async complete(messages: Message[]): Promise<string> {
    if (this.onDeviceAvailable && this.webLLMEngine) {
      const result = await this.webLLMEngine.chat.completions.create({ messages });
      return result.choices[0].message.content ?? "";
    }
    return callCloudAPI(messages);
  }
}

Users with capable hardware get free, private, offline inference. Everyone else gets a consistent cloud experience. Neither group needs to know which path they're on.

Angular + Web Workers: Keep the main thread free

Running inference on the main thread blocks the UI completely. You need a Web Worker — no exceptions.

// ai.worker.ts
import * as webllm from "@mlc-ai/web-llm";

const engine = new webllm.MLCEngine();

self.onmessage = async (event) => {
  const { type, payload } = event.data;

  if (type === "LOAD") {
    await engine.reload("Llama-3.2-3B-Instruct-q4f32_1-MLC", {
      initProgressCallback: (p) =>
        self.postMessage({ type: "PROGRESS", progress: p.progress }),
    });
    self.postMessage({ type: "READY" });
  }

  if (type === "INFER") {
    const stream = await engine.chat.completions.create({
      messages: payload.messages,
      stream: true,
    });

    for await (const chunk of stream) {
      const text = chunk.choices[0]?.delta?.content ?? "";
      if (text) self.postMessage({ type: "CHUNK", text });
    }

    self.postMessage({ type: "DONE" });
  }
};

@Injectable({ providedIn: "root" })
export class OnDeviceAIService {
  private worker = new Worker(new URL("./ai.worker.ts", import.meta.url));
  readonly status = signal<"idle" | "loading" | "ready" | "inferring">("idle");
  readonly progress = signal(0);

  constructor() {
    this.worker.onmessage = ({ data }) => {
      if (data.type === "PROGRESS") this.progress.set(data.progress);
      if (data.type === "READY") this.status.set("ready");
    };
  }

  load() {
    this.status.set("loading");
    this.worker.postMessage({ type: "LOAD" });
  }
}

When to actually use on-device

Use case	On-device?	Why
Offline-first PWA	Yes	No network needed
Privacy-sensitive input (journals, health)	Yes	Data never leaves device
High-frequency autocomplete	Yes	Sub-100ms possible, no API cost
Complex reasoning / long documents	No	Small models struggle here
Consistent quality across all users	No	Varies by hardware
Enterprise / older devices	No	Insufficient GPU

On-device AI isn't replacing cloud LLMs. It's filling the gap where cloud is overkill — offline, private, high-frequency, low-complexity tasks. The 2 GB download is the biggest UX barrier right now. As models get smaller and better (Llama 4 Scout is already impressive at its activation size), and as Chrome's built-in AI stabilizes, the tradeoffs will continue to shift. Worth understanding now so you're not learning from scratch when it becomes the obvious choice.

The Frontend Engineer's Honest Guide to Gen AI

From skeptic to daily user — an honest take on how Gen AI actually shows up in frontend work without the hype. What LLMs really are, what they're good at, and what frontend devs still need to own.

11 min read ·May 17, 2026

Read

AIDesign SystemsFrontend

Intermediate

Claude Code: Using AI to Build and Manage Your Design Token System

Design tokens are tedious to name, hard to keep consistent, and painful to scale across themes. Here's how to use Claude to generate a full two-tier token system, map dark mode, audit naming drift, and convert between formats — with practical prompts you can use today.

11 min read ·May 14, 2026

Read

AIEngineeringFrontend

Beginner

Vibe Coding: AI-First Development Is Reshaping Frontend Engineering

You describe a feature in plain English and get working JSX, typed services, and passing tests. Vibe coding is not a gimmick — here's the workflow, the tools, and what frontend devs must still own.

10 min read ·May 14, 2026

Read

Back to all posts