Skip to main content
Local providers run AI models on your hardware, offering free, private AI with no API costs.

Hardware Requirements

Running local models requires significant RAM/VRAM:
  • Capable coding models (30B+ parameters) need 24GB+ VRAM or unified memory (Apple Silicon M2 Pro/Max and above)
  • Smaller models run on less hardware but may struggle with complex coding tasks
  • Systems with limited VRAM: Ollama’s cloud models are an excellent free alternative — they run on Ollama’s servers with no local GPU needed
HardwareCapabilityRecommended Models
Apple Silicon M2 Pro/Max+ (32GB+)Highqwen3-coder (local), MLX models
NVIDIA 3090/4090 (24GB+)Highqwen3-coder, gpt-oss:20b
Mid-range GPU (12-24GB)Midgpt-oss:20b, qwen2.5-coder:14b
Low-end GPU (under 12GB)LowUse Ollama Cloud models
CPU onlyMinimalUse Ollama Cloud (recommended)

Ollama

Ollama runs models locally (free) or on Ollama’s cloud (no GPU needed).

Installation

brew install ollama
Ollama 0.15+ can auto-configure Claude Code:
ollama launch claude          # Interactive setup, picks model, launches Claude
ollama launch claude --config # Configure only, don't launch

Cloud Models (No GPU Required)

Cloud models run on Ollama’s infrastructure — ideal if your system doesn’t have enough VRAM for local models. Pull the manifest first (tiny download, the model runs remotely):
ollama pull minimax-m2.5:cloud       # Tiny download, runs remotely
ai --ollama --model minimax-m2.5:cloud

Available Cloud Models

Cloud ModelSWE-benchParams (active)Best ForLicense
minimax-m2.5:cloud80.2%230B MoE (10B)Coding, agentic workflowsMIT
glm-5:cloud77.8%744B MoE (40B)Reasoning, math, knowledgeMIT
See Ollama cloud models for the full list.

Local Models (Free, Private)

Local models require sufficient VRAM — 24GB+ recommended for capable coding models.
ollama pull qwen3-coder   # Coding optimized (needs 24GB+ VRAM)
ai --ollama
ModelSizeVRAM NeededBest For
qwen3-coder30B~28GBCoding tasks, large context
gpt-oss:20b20B~16GBStrong general-purpose
qwen2.5-coder:14b14B~12GBMid-range GPUs
qwen2.5-coder:7b7B~8GBLimited VRAM

Model Aliases

Create aliases for tools expecting Anthropic model names:
ollama cp qwen3-coder claude-sonnet-4-6
ai --ollama --model claude-sonnet-4-6

Configuration

Override defaults in ~/.ai-runner/secrets.sh:
export OLLAMA_MODEL_MID="qwen3-coder"        # Default model
export OLLAMA_SMALL_FAST_MODEL="qwen3-coder" # Background model (or leave empty to use same)
By default, Ollama uses the same model for both main and background operations to avoid VRAM swapping. Only set OLLAMA_SMALL_FAST_MODEL if you have 24GB+ VRAM.

Auto-Download Feature

When you specify a model that isn’t installed locally, Andi AIRun offers a choice between local and cloud:
ai --ollama --model qwen3-coder
# Model 'qwen3-coder' not found locally.
#
# Your system has ~32GB usable VRAM.
#
# Options:
#   1) Pull local version (recommended) - qwen3-coder
#   2) Use cloud version - qwen3-coder:cloud
#
# Choice [1]: 1
# Pulling model: qwen3-coder
# [##################################################] 100%
# Model pulled successfully
For systems with limited VRAM (< 20GB), cloud is recommended first.

Usage Examples

# Use default model
ai --ollama

# Use cloud model
ai --ollama --model glm-5:cloud

# Use specific local model
ai --ollama --model qwen3-coder

# Use with tier flags
ai --ollama --opus  # Uses OLLAMA_MODEL_HIGH if set

Ollama Anthropic API Compatibility

Learn more about Ollama’s Anthropic API compatibility

LM Studio

LM Studio runs local models with Anthropic API compatibility. Especially powerful on Apple Silicon with MLX models. Requires sufficient RAM/VRAM for the model you choose.

Advantages Over Ollama

  • MLX model support (significantly faster on Apple Silicon)
  • GGUF + MLX formats supported
  • Bring your own models from HuggingFace

Installation

Download from lmstudio.ai

Setup

  1. Download a model in LM Studio (e.g., from HuggingFace)
  2. Load the model in LM Studio UI
  3. Start the server:
    lms server start --port 1234
    
    Or start from the LM Studio app’s local server tab.
  4. Run Andi AIRun:
    ai --lmstudio
    # or
    ai --lm
    
For Claude Code, use models with:
  • 25K+ context window (required for Claude Code’s heavy context usage)
  • Function calling / tool use support
Examples:
  • openai/gpt-oss-20b - Strong general-purpose
  • ibm/granite-4-micro - Fast, efficient

Apple Silicon Optimization

LM Studio supports MLX models which are significantly faster than GGUF on M1/M2/M3/M4 chips. When downloading models, look for MLX versions for best performance.
MLX models can be 2-3x faster than GGUF on Apple Silicon due to optimized metal acceleration.

Configuration

Override defaults in ~/.ai-runner/secrets.sh:
export LMSTUDIO_HOST="http://localhost:1234"     # Custom server URL
export LMSTUDIO_MODEL_MID="openai/gpt-oss-20b"   # Default model
export LMSTUDIO_MODEL_HIGH="openai/gpt-oss-20b"  # High tier model
export LMSTUDIO_MODEL_LOW="ibm/granite-4-micro"  # Low tier model
By default, LM Studio uses the same model for all tiers and background operations to avoid model swapping.

Context Window

Configure context size in LM Studio:
  • UI: Settings → Context Length
  • Minimum recommended: 25K tokens
  • Higher is better for complex coding tasks

Auto-Download Feature

When you specify a model that isn’t available, Andi AIRun will offer to download it:
ai --lm --model lmstudio-community/qwen3-8b-gguf
# Model 'lmstudio-community/qwen3-8b-gguf' not found in LM Studio.
# Download it? [Y/n]: y
# Downloading model: lmstudio-community/qwen3-8b-gguf
# Progress: 100.0%
# Model downloaded successfully
# Load it now? [Y/n]: y
# Model loaded

Usage Examples

# Use default model
ai --lmstudio

# Use short alias
ai --lm

# Use specific model
ai --lm --model openai/gpt-oss-20b

# Use with tier flags
ai --lm --opus  # Uses LMSTUDIO_MODEL_HIGH if set

LM Studio Claude Code Guide

Learn more about using LM Studio with Claude Code

Comparison: Ollama vs LM Studio

FeatureOllamaLM Studio
Cloud models✅ Yes (free)❌ No
MLX support❌ No✅ Yes (faster on Apple Silicon)
Model formatsOllama formatGGUF, MLX
Model libraryCuratedHuggingFace, custom
SetupCommand-line focusedGUI-focused
Best forQuick start, cloud fallbackApple Silicon, custom models

Next Steps