On-device vs Cloud AI — Quick Comparison
| On-Device | Cloud API | |
|---|---|---|
| Privacy | ✅ Data never leaves device | ⚠️ Data sent to cloud |
| Latency | ✅ ~50–200ms | ⚠️ 300–2000ms |
| Cost | ✅ Free after install | 💰 Per token |
| Capability | ⚠️ Smaller models (1–7B) | ✅ GPT-4 / Gemini Ultra |
| Offline | ✅ Works offline | ❌ Needs internet |
| Model size | ⚠️ 1–4GB download | ✅ No download |
Privacy-first AI isn't a niche feature anymore. Users are increasingly uncomfortable with their data leaving their device, and regulators in the EU and elsewhere are tightening restrictions on cloud AI processing. On-device LLMs are the answer — but they come with real trade-offs.
The current landscape
In 2025, you have three serious options for running LLMs on mobile:
1. MediaPipe LLM Inference API (Google)
The most production-ready option for Android. Supports Gemma 2B and 7B, runs on GPU or CPU, and integrates cleanly with both Android native and Flutter. Available on iOS too, though with some limitations.
2. llama.cpp / Llama 3 (Meta)
The open-source powerhouse. Llama 3 8B (4-bit quantised) runs at 15–25 tokens/second on a modern Android flagship. Best for developers who want maximum flexibility and control. Integration requires more work.
3. Core ML (Apple)
Apple's machine learning framework is tightly integrated with iOS hardware. The Neural Engine on Apple Silicon is exceptional for inference. Models in Core ML format can be integrated via Swift. Access from Flutter requires a platform channel.
Getting started with MediaPipe LLM in Flutter
Google provides the google_mediapipe plugin. Here's the basic setup:
// Load the model
final llmInference = await LlmInference.createFromOptions(
LlmInferenceOptions(
modelPath: '/data/local/tmp/gemma-2b-it-q4.bin',
maxTokens: 1024,
topK: 40,
temperature: 0.8,
),
);
// Generate text
final result = await llmInference.generateResponse(
'Summarise this article: ${articleText}',
);
The model download challenge
The biggest UX challenge with on-device AI is the initial model download — typically 1–4GB. The right approach:
- Download on first launch over WiFi only (check
Connectivitybefore starting) - Show a clear progress indicator with size information
- Store in the app's internal storage or cache directory
- Use resumable downloads — mobile connections drop
- Consider whether the model is bundled (for enterprise apps) or downloaded
Quantisation: the key to fitting LLMs on phones
A full Llama 3 8B model is ~16GB. A 4-bit quantised version is ~4GB. A Q2_K version is ~2.5GB. Quality degrades with quantisation, but 4-bit quantisation is excellent — most users can't tell the difference for typical tasks.
Hybrid patterns: the best of both worlds
The smartest apps don't choose — they combine. Use on-device AI for:
- Privacy-sensitive tasks (personal notes, health data, messages)
- Offline operation
- Low-latency features (inline suggestions, real-time feedback)
And cloud AI for:
- Complex reasoning requiring GPT-4 capability
- Tasks requiring up-to-date knowledge
- Multimodal analysis on low-spec devices
The future of AI apps isn't cloud vs. on-device. It's intelligently routing between them based on context, privacy requirements, and available connectivity.