Do Foundation Models require a cloud API?

No, they run entirely on-device without cloud access or per-token billing.

How large are the packaged on-device models?

They are quantized to roughly 50-200 MB to fit consumer devices while remaining useful.

Generative AI June 10, 2025

Apple Foundation Models gives iOS developers on-device AI without cloud APIs

Q: What AI capabilities does the framework provide?

It supports text generation, classification, embeddings, and tool calling via a Swift-native API.

Apple's new Foundation Models framework was one of the most important WWDC announcements for developers, even if it didn't get the usual AI stage treatment. The idea is straightforward. Third-party apps can call Apple's built-in foundation models dir...

Apple just gave iOS developers a local LLM API, and that changes the economics of app AI

Apple's new Foundation Models framework was one of the most important WWDC announcements for developers, even if it didn't get the usual AI stage treatment.

The idea is straightforward. Third-party apps can call Apple's built-in foundation models directly on-device across iOS, iPadOS, and macOS. No trip out to OpenAI, Anthropic, or your own inference backend. No per-token bill. No user data leaving the phone unless the app sends it somewhere else.

That matters for three familiar reasons: cost, latency, and privacy. All three have made AI features harder to ship at scale. Apple is trying to turn that whole problem into a platform feature.

For developers, that's a real shift. For users, it means more AI features that work on a plane, in the subway, or inside a locked-down enterprise setup. For Apple, it's a clean way to turn its chip advantage into an ecosystem advantage.

What Apple is shipping

The Foundation Models framework sits inside the broader Apple Intelligence push. Developers get a Swift-native API for running Apple's own packaged models locally. Apple says the framework supports text generation, classification, embeddings, and tool calling. The API is intentionally small. Apple's pitch is that you can stand up a basic text-generation flow in a few lines of Swift.

The abstractions that matter are simple:

FMModel loads and runs a packaged model
FMTool lets the model call app-defined functions during generation

That second part matters. Tool calling gives an app controlled access to local logic, internal data stores, or external APIs. That's how you get from "write me a summary" to "summarize this meeting, pull the related customer record, and draft follow-up actions."

Apple is also handling the hardware dispatch layer. The framework can route work across CPU, GPU, and the Neural Engine depending on the task and device. Underneath, it builds on the same stack developers already know from Core ML and Metal Performance Shaders.

That's the right design. It keeps AI features inside Apple's existing performance and power model instead of pushing every app team into building its own inference stack.

Why this matters more than another model SDK

Local inference on Apple hardware isn't new. Developers have shipped Core ML models, quantized transformers, and custom pipelines for years. The problem was product friction.

You had to source or train a model, compress it enough to run on consumer hardware, ship it without blowing up app size, build fallbacks for weaker devices, and accept that the local model usually trailed cloud APIs in quality.

Apple is removing a lot of that friction at once.

If the platform ships a general model that's good enough to trust, the default question changes. Teams stop asking which API provider to wire up first and start asking whether they can keep the feature on-device and avoid a billing problem.

That's a meaningful shift for notes apps, journaling tools, field service software, education products, healthcare apps, and internal enterprise tools. In those categories, privacy and predictable cost usually matter more than squeezing out the last few points of model quality.

Quality is still the catch.

The trade-offs are real

Apple's models are reportedly distributed in quantized formats, roughly in the 8-bit to 16-bit range, with footprints around 50 MB to 200 MB. That tells you what kind of models these are.

They're compressed models built to fit inside phone and laptop constraints while staying useful. Apple is leaning on quantization, pruning, and runtime scheduling to make local inference work on A16, A17, and M-series chips.

That's solid engineering. It also means local AI still lives under tight budgets:

Memory is limited
Thermal headroom is limited
Battery is limited
Token throughput is limited
Older devices will be uneven at best

Apple's local model probably won't beat the best cloud model on every benchmark, especially for long-context reasoning or complex tool use. The practical question is narrower: is it good enough for a lot of app features people actually use? Summaries, extraction, rewrites, tagging, semantic search, lightweight assistants, offline transcription workflows. For those, it may well be.

That's where this framework starts to matter.

Privacy actually means something here

Apple talks about privacy so often that the message usually wears thin. Here, the technical setup backs it up.

If inference stays on-device, sensitive inputs don't have to leave the user's hardware. That lowers exposure for apps handling medical notes, legal records, financial data, employee communications, or customer PII. It also makes compliance conversations around GDPR, CCPA, and cross-border data transfer less painful.

That doesn't make every AI workflow compliant. Tool calling can still trigger network requests. Developers can still log prompts badly. Data can still end up where it shouldn't. But the default architecture is safer.

That's a strong argument for teams that have avoided LLM features because legal or security review kept shutting them down.

There's also a basic trust issue. People are more likely to accept AI suggestions in a diary app or CRM note editor if they know the raw content stays local.

The limits developers need to watch

Apple's framework lowers the barrier. It also narrows the lane.

You're using Apple's models, Apple's APIs, Apple's hardware assumptions, and Apple's OS support matrix. If you want cross-platform parity, deep model customization, or one inference stack shared across Android, web, and backend, this won't solve that. It may make it messier.

A few issues stand out.

Device fragmentation still matters

An M3 MacBook and an older iPhone are very different targets. Even with dynamic dispatch, speed and output quality will vary. Teams will still need hardware-aware fallbacks and serious performance testing.

App size can get ugly

Even a 50 MB model isn't free. If your app ships multiple local models or model assets, download size and storage pressure become product problems. Apple can hide some of this with system-level packaging, but the trade-off doesn't disappear.

You don't get frontier-model flexibility

If your product depends on top-tier reasoning, code generation, or long-context behavior, you'll still want cloud access somewhere in the stack. Apple's local models are likely to be practical tools, not universal replacements.

Safety is only partly abstracted

Apple says it includes token-level moderation and content filtering. Good. That helps. But app-level safety is still on you, especially if the model can call tools that touch user data, trigger actions, or hit external systems.

The best use cases are the boring ones

That's a compliment.

The strongest fits here are features developers couldn't really justify sending to the cloud anyway. Think:

summarizing notes or meetings locally
turning text into flashcards in an education app
tagging and classifying documents on-device
extracting structured fields from scanned forms
powering semantic search over private app content
giving field workers offline assistance in poor-connectivity environments
translation or transcription with no network

These are useful, repeatable, and expensive to run through paid APIs at scale. They also benefit from low latency. Nobody wants a spinner for "summarize my note" if the work can happen on the same device in a second or two.

There's a less glamorous point here too. Local AI opens up product ideas that never worked financially under a cloud billing model. If a feature runs thousands or millions of times a day, avoiding metered inference costs can change the roadmap.

The API is simple. The architecture still isn't.

The sample usage Apple showed is straightforward:

import FoundationModels

let model = try FMModel(named: "apple-gpt-small")
model.generate(prompt: "Summarize these meeting notes", maxTokens: 150) { result in
// handle output
}

That's the easy part. The hard part is everything around it.

Senior teams still need to think through:

caching and prompt design for small local models
memory pressure under multitasking
latency budgets on lower-end hardware
fallback behavior when a device can't support the intended model
how tool-calling permissions are constrained and audited
whether local inference should be the default or the fallback

Apple made the API simpler. It didn't remove the need for systems thinking.

Where this leaves the market

Apple is pushing AI as native app infrastructure instead of a generic chatbot pipe. That fits Apple's strengths: custom silicon, tight OS integration, and a developer base that will use a platform-native API if it's good enough.

It also puts pressure on Google and the rest to offer the same thing with a cleaner developer story. Android already has local AI pieces, and so do Qualcomm, Google, and Meta in different forms. What Apple has now is a more coherent package, at least inside its own ecosystem.

That alone is enough to matter.

If you build for Apple platforms, the question has changed. On-device AI is no longer the experimental path. For a lot of features, it now looks like the default choice, with the cloud reserved for tasks that genuinely need it.

A year ago, that was still an edge case. Apple just dragged it into the mainstream.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

Data science and analytics

Turn data into forecasting, experimentation, dashboards, and decision support.

Related proof

Growth analytics platform

How a growth analytics platform reduced decision lag across teams.

What Apple's Foundation Models are actually good for on iOS 26

Apple’s Foundation Models framework looked promising at WWDC 2025. Now that iOS 26 is widely deployed, there’s a better read on what it’s actually good for. The first wave of apps tells the story. Developers aren’t stuffing a chatbot into every scree...

Apple’s AI Siri overhaul arrives in beta with deeper on-device context

Apple used WWDC 2026 to show the Siri overhaul it promised two years ago, then failed to ship. The new AI-powered Siri arrives in beta later this year, with Apple pitching it as a broader assistant across chat, writing, on-device context, and system ...

Anthropic's $3.5B raise puts real weight behind Apple and Claude Dev

Anthropic has two things going on, and they connect pretty directly. The company just raised $3.5 billion at a $61.5 billion valuation, which tells you investors still believe frontier model companies can turn huge burn into durable businesses. At th...