R
Roboto Systems
AI Agents & Apps · Est. 2012
Navigation
Get in Touch

On-Device LLMs: Running AI Locally on Android and iOS

Privacy-first AI is here. We look at Llama 3, MediaPipe LLM, and what it really takes to run language models on a mobile device.

April 10, 2026·12 min read

On-device vs Cloud AI — Quick Comparison

On-DeviceCloud API
Privacy✅ Data never leaves device⚠️ Data sent to cloud
Latency✅ ~50–200ms⚠️ 300–2000ms
Cost✅ Free after install💰 Per token
Capability⚠️ Smaller models (1–7B)✅ GPT-4 / Gemini Ultra
Offline✅ Works offline❌ Needs internet
Model size⚠️ 1–4GB download✅ No download

Privacy-first AI isn't a niche feature anymore. Users are increasingly uncomfortable with their data leaving their device, and regulators in the EU and elsewhere are tightening restrictions on cloud AI processing. On-device LLMs are the answer — but they come with real trade-offs.

The current landscape

In 2025, you have three serious options for running LLMs on mobile:

1. MediaPipe LLM Inference API (Google)

The most production-ready option for Android. Supports Gemma 2B and 7B, runs on GPU or CPU, and integrates cleanly with both Android native and Flutter. Available on iOS too, though with some limitations.

2. llama.cpp / Llama 3 (Meta)

The open-source powerhouse. Llama 3 8B (4-bit quantised) runs at 15–25 tokens/second on a modern Android flagship. Best for developers who want maximum flexibility and control. Integration requires more work.

3. Core ML (Apple)

Apple's machine learning framework is tightly integrated with iOS hardware. The Neural Engine on Apple Silicon is exceptional for inference. Models in Core ML format can be integrated via Swift. Access from Flutter requires a platform channel.

Getting started with MediaPipe LLM in Flutter

Google provides the google_mediapipe plugin. Here's the basic setup:

// Load the model
final llmInference = await LlmInference.createFromOptions(
  LlmInferenceOptions(
    modelPath: '/data/local/tmp/gemma-2b-it-q4.bin',
    maxTokens: 1024,
    topK: 40,
    temperature: 0.8,
  ),
);

// Generate text
final result = await llmInference.generateResponse(
  'Summarise this article: ${articleText}',
);

The model download challenge

The biggest UX challenge with on-device AI is the initial model download — typically 1–4GB. The right approach:

  • Download on first launch over WiFi only (check Connectivity before starting)
  • Show a clear progress indicator with size information
  • Store in the app's internal storage or cache directory
  • Use resumable downloads — mobile connections drop
  • Consider whether the model is bundled (for enterprise apps) or downloaded

Quantisation: the key to fitting LLMs on phones

A full Llama 3 8B model is ~16GB. A 4-bit quantised version is ~4GB. A Q2_K version is ~2.5GB. Quality degrades with quantisation, but 4-bit quantisation is excellent — most users can't tell the difference for typical tasks.

Hybrid patterns: the best of both worlds

The smartest apps don't choose — they combine. Use on-device AI for:

  • Privacy-sensitive tasks (personal notes, health data, messages)
  • Offline operation
  • Low-latency features (inline suggestions, real-time feedback)

And cloud AI for:

  • Complex reasoning requiring GPT-4 capability
  • Tasks requiring up-to-date knowledge
  • Multimodal analysis on low-spec devices
The future of AI apps isn't cloud vs. on-device. It's intelligently routing between them based on context, privacy requirements, and available connectivity.

Roboto Systems

Building AI apps for every platform

We design and build production-grade AI agents and apps for Android, iOS, Web, and Desktop. Need an AI product built? Let's talk.

Start a Project