On-Device LLMs: Running AI Locally on Android and iOS

On-device vs Cloud AI — Quick Comparison

	On-Device	Cloud API
Privacy	✅ Data never leaves device	⚠️ Data sent to cloud
Latency	✅ ~50–200ms	⚠️ 300–2000ms
Cost	✅ Free after install	💰 Per token
Capability	⚠️ Smaller models (1–7B)	✅ GPT-4 / Gemini Ultra
Offline	✅ Works offline	❌ Needs internet
Model size	⚠️ 1–4GB download	✅ No download

Privacy-first AI isn't a niche feature anymore. Users are increasingly uncomfortable with their data leaving their device, and regulators in the EU and elsewhere are tightening restrictions on cloud AI processing. On-device LLMs are the answer — but they come with real trade-offs.

The current landscape

In 2025, you have three serious options for running LLMs on mobile:

1. MediaPipe LLM Inference API (Google)

The most production-ready option for Android. Supports Gemma 2B and 7B, runs on GPU or CPU, and integrates cleanly with both Android native and Flutter. Available on iOS too, though with some limitations.

2. llama.cpp / Llama 3 (Meta)

The open-source powerhouse. Llama 3 8B (4-bit quantised) runs at 15–25 tokens/second on a modern Android flagship. Best for developers who want maximum flexibility and control. Integration requires more work.

3. Core ML (Apple)

Apple's machine learning framework is tightly integrated with iOS hardware. The Neural Engine on Apple Silicon is exceptional for inference. Models in Core ML format can be integrated via Swift. Access from Flutter requires a platform channel.

Getting started with MediaPipe LLM in Flutter

Google provides the google_mediapipe plugin. Here's the basic setup:

// Load the model
final llmInference = await LlmInference.createFromOptions(
  LlmInferenceOptions(
    modelPath: '/data/local/tmp/gemma-2b-it-q4.bin',
    maxTokens: 1024,
    topK: 40,
    temperature: 0.8,
  ),
);

// Generate text
final result = await llmInference.generateResponse(
  'Summarise this article: ${articleText}',
);

The model download challenge

The biggest UX challenge with on-device AI is the initial model download — typically 1–4GB. The right approach:

Download on first launch over WiFi only (check Connectivity before starting)
Show a clear progress indicator with size information
Store in the app's internal storage or cache directory
Use resumable downloads — mobile connections drop
Consider whether the model is bundled (for enterprise apps) or downloaded

Quantisation: the key to fitting LLMs on phones

A full Llama 3 8B model is ~16GB. A 4-bit quantised version is ~4GB. A Q2_K version is ~2.5GB. Quality degrades with quantisation, but 4-bit quantisation is excellent — most users can't tell the difference for typical tasks.

Hybrid patterns: the best of both worlds

The smartest apps don't choose — they combine. Use on-device AI for:

Privacy-sensitive tasks (personal notes, health data, messages)
Offline operation
Low-latency features (inline suggestions, real-time feedback)

And cloud AI for:

Complex reasoning requiring GPT-4 capability
Tasks requiring up-to-date knowledge
Multimodal analysis on low-spec devices

The future of AI apps isn't cloud vs. on-device. It's intelligently routing between them based on context, privacy requirements, and available connectivity.

On-Device LLMs: Running AI Locally on Android and iOS

The current landscape

1. MediaPipe LLM Inference API (Google)

2. llama.cpp / Llama 3 (Meta)

3. Core ML (Apple)

Getting started with MediaPipe LLM in Flutter

The model download challenge

Quantisation: the key to fitting LLMs on phones

Hybrid patterns: the best of both worlds

Building AI apps for every platform

More articles

The Rise of AI Agents: What Every Developer Should Know in 2025

Building Production-Ready AI Apps with Flutter and Gemini

Prompt Engineering for App Developers: A Practical Field Guide