One exception: the built-in HR Persona processes queries through a remote webAI-hosted service, not on-device inference. See connected personas for details.
Inference backends
webAI supports three inference backends. The system automatically selects the best one based on your hardware, or you can choose manually.- WebGPU
- MLX
- llama.cpp
Browser-native GPU inference. Works on any platform with a supported browser (Chrome 113+, Edge 113+). This is the default backend in browser mode.
- Runs in the browser — no native app required
- Uses your GPU via the WebGPU API
- Best for: quick setup, cross-platform compatibility
WebGPU support varies by device. Older GPUs or browsers without WebGPU will fall back to llama.cpp if running in the desktop app.
Automatic backend routing
When you load a model, the system profiles your device and selects the best backend automatically:Device profiling
The system checks your available memory, GPU capabilities, and platform (macOS, browser, etc.).
Backend selection
Based on the profile, it selects WebGPU, MLX, or llama.cpp — prioritizing speed and model compatibility.
Model tiers
Models are organized into tiers based on size and the memory required to run them. The system recommends a tier based on your device’s capabilities.MLX models (Apple Silicon)
| Tier | Model | Memory required |
|---|---|---|
| 1 | Gemma 3n E2B | 4 GB |
| 2 | Gemma 3n E4B / Qwen3 4B | 8 GB |
| 3 | Qwen3 8B | 16 GB |
| 4 | Qwen3 32B | 32 GB |
| 5 | Qwen3 235B | 64+ GB |
llama.cpp models (CPU)
| Tier | Model | Memory required |
|---|---|---|
| 1 | Qwen3 0.6B | 4 GB |
| 2 | Qwen3 4B | 6 GB |
| 3 | Qwen3 8B / Gemma 3 12B | 16 GB |
| 4 | Qwen3 32B | 32 GB |
| 5 | Llama 3.3 70B | 64+ GB |
WebGPU models (Browser)
| Tier | Model | GPU memory required |
|---|---|---|
| 1 | Qwen3 1.7B | 4 GB |
| 2 | Qwen3 4B | 8 GB |
| 3 | Qwen3 8B | 16 GB |
| 4 | Qwen3 14B | 32 GB |
| 5 | Qwen3 32B | 32+ GB |
WebGPU model availability depends on your device’s GPU memory. The system automatically falls back to smaller models if a tier can’t be loaded. For the largest models, use the desktop app with MLX or llama.cpp.
LoRA adapters
LoRA (Low-Rank Adaptation) adapters let you customize a model’s behavior without downloading an entirely new model. Adapters are small files that layer on top of a base model to specialize it for a particular domain or style.Available adapters
| Adapter | Base model | Purpose |
|---|---|---|
| Chatbot LoRA | 0.5B | General conversational improvements |
| PubMedQA LoRA | 0.5B | Medical and biomedical Q&A |
| QwQ Creative LoRA | 0.5B | Creative writing and storytelling |
| Math LoRA | 0.6B | Mathematical reasoning |
| UltraChat SFT | 1.7B | Instruction following |
| SFT LoRA | 1.7B | Supervised fine-tuning |
| DPO LoRA | 1.7B | Alignment and preference optimization |
Idle management
To conserve resources, the AI runtime automatically manages model lifecycle:- Soft timeout (2 minutes) — If no requests are made, the runtime begins preparing to release resources.
- Hard timeout (5 minutes) — The model is fully unloaded from memory.
- Instant reload — When you send a new message, the model loads back automatically.