On-Device AI

webAI runs large language models directly on your hardware. There’s no cloud API, no network dependency, and no data leaving your machine for on-device inference. This page explains how the inference system works under the hood.

One exception: the built-in HR Persona processes queries through a remote webAI-hosted service, not on-device inference. See connected personas for details.

Inference backends

webAI supports three inference backends. The system automatically selects the best one based on your hardware, or you can choose manually.

WebGPU
MLX
llama.cpp

Browser-native GPU inference. Works on any platform with a supported browser (Chrome 113+, Edge 113+). This is the default backend in browser mode.

Runs in the browser — no native app required
Uses your GPU via the WebGPU API
Best for: quick setup, cross-platform compatibility

WebGPU support varies by device. Older GPUs or browsers without WebGPU will fall back to llama.cpp if running in the desktop app.

Automatic backend routing

When you load a model, the system profiles your device and selects the best backend automatically:

Device profiling

The system checks your available memory, GPU capabilities, and platform (macOS, browser, etc.).

Backend selection

Based on the profile, it selects WebGPU, MLX, or llama.cpp — prioritizing speed and model compatibility.

Model loading

The selected backend loads the model. If the first choice fails (e.g., not enough VRAM for WebGPU), the system falls back to the next available backend.

You can override this in Settings and manually select a backend if you prefer.

Model tiers

Models are organized into tiers based on size and the memory required to run them. The system recommends a tier based on your device’s capabilities.

MLX models (Apple Silicon)

Tier	Model	Memory required
1	Gemma 3n E2B	4 GB
2	Gemma 3n E4B / Qwen3 4B	8 GB
3	Qwen3 8B	16 GB
4	Qwen3 32B	32 GB
5	Qwen3 235B	64+ GB

llama.cpp models (CPU)

Tier	Model	Memory required
1	Qwen3 0.6B	4 GB
2	Qwen3 4B	6 GB
3	Qwen3 8B / Gemma 3 12B	16 GB
4	Qwen3 32B	32 GB
5	Llama 3.3 70B	64+ GB

WebGPU models (Browser)

Tier	Model	GPU memory required
1	Qwen3 1.7B	4 GB
2	Qwen3 4B	8 GB
3	Qwen3 8B	16 GB
4	Qwen3 14B	32 GB
5	Qwen3 32B	32+ GB

WebGPU model availability depends on your device’s GPU memory. The system automatically falls back to smaller models if a tier can’t be loaded. For the largest models, use the desktop app with MLX or llama.cpp.

LoRA adapters

LoRA (Low-Rank Adaptation) adapters let you customize a model’s behavior without downloading an entirely new model. Adapters are small files that layer on top of a base model to specialize it for a particular domain or style.

Available adapters

Adapter	Base model	Purpose
Chatbot LoRA	0.5B	General conversational improvements
PubMedQA LoRA	0.5B	Medical and biomedical Q&A
QwQ Creative LoRA	0.5B	Creative writing and storytelling
Math LoRA	0.6B	Mathematical reasoning
UltraChat SFT	1.7B	Instruction following
SFT LoRA	1.7B	Supervised fine-tuning
DPO LoRA	1.7B	Alignment and preference optimization

Adapters are attached through personas — each persona can specify which adapter to load alongside the base model.

Idle management

To conserve resources, the AI runtime automatically manages model lifecycle:

Soft timeout (2 minutes) — If no requests are made, the runtime begins preparing to release resources.
Hard timeout (5 minutes) — The model is fully unloaded from memory.
Instant reload — When you send a new message, the model loads back automatically.

Getting Started

Using webAI

Apps

Key Concepts

Settings

Inference backends

Automatic backend routing

Model tiers

MLX models (Apple Silicon)

llama.cpp models (CPU)

WebGPU models (Browser)

LoRA adapters

Available adapters

Idle management

Learn more

Oasis

Choosing a model

Getting Started

Using webAI

Apps

Key Concepts

Settings

Documentation Index

​Inference backends

​Automatic backend routing

​Model tiers

​MLX models (Apple Silicon)

​llama.cpp models (CPU)

​WebGPU models (Browser)

​LoRA adapters

​Available adapters

​Idle management

​Learn more

Oasis

Choosing a model

Inference backends

Automatic backend routing

Model tiers

MLX models (Apple Silicon)

llama.cpp models (CPU)

WebGPU models (Browser)

LoRA adapters

Available adapters

Idle management

Learn more