on-device-inference pytorch-conversion apple-silicon model-compression llm-deployment

Apple releases Core AI framework for on-device LLMs

Core AI is the successor to Core ML, enabling PyTorch model conversion and deployment of up to 70B-parameter LLMs on-device via unified hardware access (CPU/GPU/Neural Engine) with quantization and palettization built-in.

June 23, 2026

Summary

Eliminates per-token cloud costs, removes server dependencies, and keeps user data local—streamlining inference pipelines for developers targeting Apple Silicon. Model specialization on first load trades one-time latency for cached subsequent runs, changing how you architect cold-start handling.

Why it matters

Implementation verdict

Replaces Core ML for neural networks and transformers. Requires PyTorch models converted via torch.export.ExportedProgram → TorchConverter().to_coreai(). Ready now with OS release, but ecosystem maturity depends on community adoption—start with vision/reasoning models if targeting iPhone/iPad/Mac only.

Sources

1.Core AI framework, the official successor to Core ML
2.support for both custom-converted PyTorch models and pre-optimized open-source models
3.unified architecture for deploying models ranging from compact 3B-parameter vision models to large-scale LLMs, including reasoning models with up to 70B-parameter reasoning models
4.ensures user data privacy, zero server dependencies, and zero per-token cloud costs
5.unified hardware access, allowing workloads to seamlessly run across the CPU, GPU, and Neural Engine under one API
6.ahead-of-time (AOT) compilation, which shifts work off the user's device, yielding near-instant load times
7.the first attempt to use a model may take significantly longer than subsequent runs, once the model has been already cached

Dev Signal

Get briefs like this in your inbox — free, 3× a week.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs