Skip to main content
webAI runs large language models directly on your hardware. There’s no cloud API, no network dependency, and no data leaving your machine for on-device inference. This page explains how the inference system works under the hood.
One exception: the built-in HR Persona processes queries through a remote webAI-hosted service, not on-device inference. See connected personas for details.

Inference backends

webAI supports three inference backends. The system automatically selects the best one based on your hardware, or you can choose manually.
Browser-native GPU inference. Works on any platform with a supported browser (Chrome 113+, Edge 113+). This is the default backend in browser mode.
  • Runs in the browser — no native app required
  • Uses your GPU via the WebGPU API
  • Best for: quick setup, cross-platform compatibility
WebGPU support varies by device. Older GPUs or browsers without WebGPU will fall back to llama.cpp if running in the desktop app.

Automatic backend routing

When you load a model, the system profiles your device and selects the best backend automatically:
1

Device profiling

The system checks your available memory, GPU capabilities, and platform (macOS, browser, etc.).
2

Backend selection

Based on the profile, it selects WebGPU, MLX, or llama.cpp — prioritizing speed and model compatibility.
3

Model loading

The selected backend loads the model. If the first choice fails (e.g., not enough VRAM for WebGPU), the system falls back to the next available backend.
You can override this in Settings and manually select a backend if you prefer.

Model tiers

Models are organized into tiers based on size and the memory required to run them. The system recommends a tier based on your device’s capabilities.

MLX models (Apple Silicon)

TierModelMemory required
1Gemma 3n E2B4 GB
2Gemma 3n E4B / Qwen3 4B8 GB
3Qwen3 8B16 GB
4Qwen3 32B32 GB
5Qwen3 235B64+ GB

llama.cpp models (CPU)

TierModelMemory required
1Qwen3 0.6B4 GB
2Qwen3 4B6 GB
3Qwen3 8B / Gemma 3 12B16 GB
4Qwen3 32B32 GB
5Llama 3.3 70B64+ GB

WebGPU models (Browser)

TierModelGPU memory required
1Qwen3 1.7B4 GB
2Qwen3 4B8 GB
3Qwen3 8B16 GB
4Qwen3 14B32 GB
5Qwen3 32B32+ GB
WebGPU model availability depends on your device’s GPU memory. The system automatically falls back to smaller models if a tier can’t be loaded. For the largest models, use the desktop app with MLX or llama.cpp.

LoRA adapters

LoRA (Low-Rank Adaptation) adapters let you customize a model’s behavior without downloading an entirely new model. Adapters are small files that layer on top of a base model to specialize it for a particular domain or style.

Available adapters

AdapterBase modelPurpose
Chatbot LoRA0.5BGeneral conversational improvements
PubMedQA LoRA0.5BMedical and biomedical Q&A
QwQ Creative LoRA0.5BCreative writing and storytelling
Math LoRA0.6BMathematical reasoning
UltraChat SFT1.7BInstruction following
SFT LoRA1.7BSupervised fine-tuning
DPO LoRA1.7BAlignment and preference optimization
Adapters are attached through personas — each persona can specify which adapter to load alongside the base model.

Idle management

To conserve resources, the AI runtime automatically manages model lifecycle:
  • Soft timeout (2 minutes) — If no requests are made, the runtime begins preparing to release resources.
  • Hard timeout (5 minutes) — The model is fully unloaded from memory.
  • Instant reload — When you send a new message, the model loads back automatically.

Learn more