Skip to content

Performance Optimization

Last Updated: 2026-02-13

VaulType runs two ML models simultaneously — whisper.cpp for speech recognition and llama.cpp for text refinement — entirely on-device with zero network dependency. This document defines every optimization strategy, tuning parameter, and monitoring technique required to deliver real-time performance across all supported Apple hardware.



VaulType’s performance strategy rests on three pillars:

PillarMeaning
LatencyThe user must perceive transcription and refinement as near-instantaneous — under 500 ms total for typical utterances
EfficiencyDual-model inference must coexist with normal system operation; VaulType should never make the Mac feel sluggish
AdaptabilityPerformance tuning reacts to hardware capability, power source, thermal state, and memory pressure in real time

💡 Design Principle: VaulType always degrades gracefully. When resources are constrained, it reduces quality (smaller models, fewer GPU layers) rather than increasing latency or dropping audio.


Apple Silicon’s unified memory architecture is the foundation of VaulType’s performance story. Both whisper.cpp and llama.cpp support Metal acceleration, which offloads matrix multiplications and attention computations to the GPU cores.

Metal Performance Shaders (MPS) provide optimized GPU kernels for common ML operations. Both whisper.cpp and llama.cpp use Metal compute shaders for:

  • Matrix multiplication (GEMM/GEMV) — the dominant operation in transformer inference
  • Softmax — attention score normalization
  • Layer normalization — pre/post-attention normalization
  • Element-wise operations — GELU, SiLU activations

On Apple Silicon, these operations run on the GPU cores while the CPU handles orchestration, memory management, and non-parallelizable work.

┌──────────────────────────────────────────────────────────────────┐
│ Apple Silicon SoC │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ CPU Cores │ │ GPU Cores │ │ Neural Engine │ │
│ │ (E + P) │ │ (Metal) │ │ (ANE) │ │
│ │ │ │ │ │ [Not used by │ │
│ │ Orchestrate │ │ GEMM/GEMV │ │ whisper.cpp/ │ │
│ │ Audio I/O │ │ Attention │ │ llama.cpp] │ │
│ │ Token decode│ │ LayerNorm │ │ │ │
│ └──────┬───────┘ └──────┬───────┘ └──────────────────┘ │
│ │ │ │
│ └─────────┬─────────┘ │
│ ▼ │
│ ┌──────────────────────────────────┐ │
│ │ Unified Memory (LPDDR) │ │
│ │ │ │
│ │ Model weights, KV cache, │ │
│ │ audio buffers, compute buffers │ │
│ │ — all shared, zero-copy │ │
│ └──────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘

🍎 Apple Silicon Advantage: Unlike discrete GPU systems, there is no PCIe bus transfer between CPU and GPU memory. Model weights loaded once are accessible to both CPU and GPU compute — zero-copy.

Both whisper.cpp and llama.cpp allow specifying how many transformer layers run on the GPU versus the CPU. More GPU layers mean faster inference but higher GPU memory pressure.

whisper.cpp GPU layers:

Whisper ModelTotal LayersRecommended GPU LayersNotes
tiny44 (all)Fits entirely on GPU for all hardware
base66 (all)Fits entirely on GPU for all hardware
small1212 (all)Fits entirely on GPU for 16 GB+
medium2424 (all)Fits entirely on GPU for 16 GB+
large-v33232 (all)Requires 16 GB+; consider 24 on 8 GB

llama.cpp GPU layers (for ~7B parameter LLM):

QuantizationTotal Layers8 GB RAM16 GB RAM24 GB+ RAM
Q4_03220-2432 (all)32 (all)
Q4_K_M3218-2232 (all)32 (all)
Q5_K_M3214-1832 (all)32 (all)
Q8_0328-1228-3232 (all)
F1632N/A16-2032 (all)

⚠️ Warning: These layer counts assume dual-model operation (Whisper + LLM loaded simultaneously). If only one model is active, more GPU layers can be allocated.

The optimal split depends on available memory after accounting for system overhead and the other model. The general rule:

Available GPU Memory = Total RAM - System Overhead - Other Model - KV Cache Reserve
System Overhead:
macOS base ~3-4 GB
VaulType app overhead ~200 MB
Audio pipeline ~50 MB
Example (16 GB M2 MacBook Air):
16 GB - 4 GB (system) - 1.5 GB (Whisper medium) - 0.2 GB (app) = ~10.3 GB for LLM
A Q4_K_M 7B model needs ~4.1 GB → all 32 layers fit on GPU

💡 Tip: Always leave at least 2 GB of headroom beyond calculated needs. macOS memory compression helps, but sustained pressure causes jank in the entire system.

FeatureM1M2M3M4
GPU Cores (base)7-88-108-1010
GPU Cores (Pro)14-1616-1914-1816-20
GPU Cores (Max)24-3230-3830-4032-40
Memory Bandwidth68.25 GB/s100 GB/s100 GB/s120 GB/s
Memory Bandwidth (Pro)200 GB/s200 GB/s150-200 GB/s273 GB/s
Max Unified Memory16 GB24 GB128 GB64 GB
Max Unified Memory (Max/Ultra)64/128 GB96/192 GB128/192 GB128/256 GB
Metal Feature SetMetal 3Metal 3Metal 3+Metal 3+
Dynamic CachingNoNoYesYes
Mesh ShadingNoNoYesYes
Hardware Ray TracingNoNoYesYes
Relative Perf (base, 7B Q4)1.0x1.3x1.4x1.6x

Key observations for VaulType:

  • M3/M4 Dynamic Caching reduces GPU memory waste from shader register allocation, leaving more memory for model weights
  • Memory bandwidth is the primary bottleneck for LLM inference (memory-bound workload). M4 Pro’s 273 GB/s provides the biggest leap
  • M1 base (8 GB) is the minimum viable target — use Whisper small + Q4_0 3B LLM

ℹ️ Intel Support: VaulType supports Intel Macs but without Metal acceleration. CPU-only inference uses Accelerate.framework (BLAS). Expect 3-5x slower inference compared to equivalent-era Apple Silicon.

┌─────────────────────────────────────────────────────────┐
│ Traditional Discrete GPU │
│ │
│ ┌──────────┐ PCIe Bus ┌──────────┐ │
│ │ CPU RAM │ ────────────► │ GPU VRAM │ │
│ │ │ (slow copy) │ │ │
│ │ Model │ │ Model │ │
│ │ (copy 1) │ │ (copy 2) │ │
│ └──────────┘ └──────────┘ │
│ │
│ Total memory used: 2x model size │
│ Transfer latency: milliseconds per layer │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Apple Silicon Unified Memory │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ Unified Memory Pool │ │
│ │ │ │
│ │ ┌──────────┐ │ │
│ │ │ Model │◄──── CPU reads here │ │
│ │ │ (1 copy) │◄──── GPU reads here │ │
│ │ └──────────┘ │ │
│ │ │ │
│ │ Zero-copy, zero-latency switching │ │
│ └──────────────────────────────────────┘ │
│ │
│ Total memory used: 1x model size │
│ Transfer latency: zero │
└─────────────────────────────────────────────────────────┘

This is why Apple Silicon can run larger models than discrete GPUs with the same nominal memory: no duplication, no transfer overhead.

The following shows how VaulType configures GPU layer counts for both inference engines:

import Foundation
// MARK: - Metal GPU Configuration
/// Configuration for Metal GPU layer allocation across both ML models.
/// Determines how many transformer layers run on GPU vs CPU.
struct MetalGPUConfiguration {
let whisperGPULayers: Int32
let llmGPULayers: Int32
let useMetalAcceleration: Bool
/// Total system RAM in bytes
static var totalSystemMemory: UInt64 {
ProcessInfo.processInfo.physicalMemory
}
/// Total system RAM in gigabytes
static var totalSystemMemoryGB: Double {
Double(totalSystemMemory) / (1024 * 1024 * 1024)
}
/// Estimated available memory after system overhead (in GB)
static var estimatedAvailableMemoryGB: Double {
let systemOverhead: Double = 4.0 // macOS + background apps
let appOverhead: Double = 0.25 // VaulType base footprint
return max(totalSystemMemoryGB - systemOverhead - appOverhead, 1.0)
}
/// Determines if Metal acceleration is available on this system
static var isMetalAvailable: Bool {
#if arch(arm64)
return true
#else
// Intel Macs: check for Metal-capable GPU
guard let device = MTLCreateSystemDefaultDevice() else { return false }
return device.supportsFamily(.apple1)
#endif
}
/// Creates an optimal configuration for the current hardware
/// - Parameters:
/// - whisperModelSize: Size category of the Whisper model
/// - llmQuantization: Quantization format of the LLM
/// - llmParameterCount: Approximate parameter count in billions
/// - Returns: Optimal Metal GPU configuration
static func optimal(
whisperModelSize: WhisperModelSize,
llmQuantization: LLMQuantization,
llmParameterCount: Double
) -> MetalGPUConfiguration {
guard isMetalAvailable else {
return MetalGPUConfiguration(
whisperGPULayers: 0,
llmGPULayers: 0,
useMetalAcceleration: false
)
}
let availableGB = estimatedAvailableMemoryGB
let whisperMemoryGB = whisperModelSize.estimatedMemoryGB
let remainingForLLM = availableGB - whisperMemoryGB - 2.0 // 2 GB headroom
// Whisper: always put all layers on GPU if memory allows
let whisperLayers: Int32 = if availableGB >= whisperMemoryGB + 1.0 {
whisperModelSize.totalLayers
} else {
Int32(Double(whisperModelSize.totalLayers) * 0.5)
}
// LLM: calculate how many layers fit in remaining memory
let llmTotalLayers: Int32 = 32 // typical for 7B models
let memoryPerLayer = llmQuantization.estimatedMemoryPerLayerGB(
parameterCount: llmParameterCount
)
let maxLLMLayers = Int32(remainingForLLM / memoryPerLayer)
let llmLayers = min(max(maxLLMLayers, 0), llmTotalLayers)
return MetalGPUConfiguration(
whisperGPULayers: whisperLayers,
llmGPULayers: llmLayers,
useMetalAcceleration: true
)
}
}
// MARK: - Supporting Types
enum WhisperModelSize: String, CaseIterable {
case tiny, base, small, medium, large
var totalLayers: Int32 {
switch self {
case .tiny: return 4
case .base: return 6
case .small: return 12
case .medium: return 24
case .large: return 32
}
}
var estimatedMemoryGB: Double {
switch self {
case .tiny: return 0.08
case .base: return 0.15
case .small: return 0.50
case .medium: return 1.50
case .large: return 3.00
}
}
}
enum LLMQuantization: String, CaseIterable {
case q4_0 = "Q4_0"
case q4_k_m = "Q4_K_M"
case q5_k_m = "Q5_K_M"
case q8_0 = "Q8_0"
case f16 = "F16"
/// Bits per parameter for this quantization format
var bitsPerParameter: Double {
switch self {
case .q4_0: return 4.5 // 4-bit with some overhead
case .q4_k_m: return 4.8 // k-quant mixed precision
case .q5_k_m: return 5.5 // k-quant mixed precision
case .q8_0: return 8.5 // 8-bit with scale factors
case .f16: return 16.0 // half precision
}
}
/// Estimated memory per transformer layer in GB for a given parameter count
func estimatedMemoryPerLayerGB(parameterCount: Double) -> Double {
let totalBits = parameterCount * 1_000_000_000 * bitsPerParameter
let totalBytes = totalBits / 8.0
let totalGB = totalBytes / (1024 * 1024 * 1024)
return totalGB / 32.0 // assume 32 layers
}
}

Best Practice: Always call MetalGPUConfiguration.optimal(...) at launch and again whenever power source or thermal state changes. Store the configuration in PerformanceManager and propagate to both inference engines.


Quantization reduces model size and inference time by representing weights with fewer bits. The tradeoff is always quality vs speed vs memory.

FormatBits/WeightDescriptionQuality Impact
F1616Half-precision float. Baseline quality.None (reference)
Q8_08.58-bit with block scaling. Near-lossless.Negligible (<1% perplexity increase)
Q5_K_M5.55-bit k-quant, mixed precision attention layers.Minor (1-2% perplexity increase)
Q4_K_M4.84-bit k-quant, mixed precision. Best quality/size ratio.Moderate (2-4% perplexity increase)
Q4_04.5Basic 4-bit. Fast but lower quality.Notable (4-6% perplexity increase)

💡 VaulType Default: Q4_K_M is the default for LLM text refinement. For VaulType’s use case (grammar correction, punctuation, formatting), the quality difference between Q4_K_M and F16 is imperceptible in practice.

ModelParametersDisk SizeMemory (loaded)English WERMultilingual WER
tiny39 M75 MB~80 MB7.7%12.0%
base74 M142 MB~150 MB5.8%9.8%
small244 M466 MB~500 MB4.2%7.6%
medium769 M1.5 GB~1.5 GB3.5%6.5%
large-v31550 M3.1 GB~3.0 GB2.9%5.2%

ℹ️ WER = Word Error Rate: Lower is better. These are approximate values on LibriSpeech/Common Voice benchmarks. Real-world performance varies with accent, background noise, and microphone quality.

7B LLM — Text Refinement Speed (tokens/sec on Apple Silicon base chips):

QuantizationM1 (8 GB)M2 (8 GB)M3 (8 GB)M4 (16 GB)Memory
Q4_018 t/s24 t/s26 t/s32 t/s3.8 GB
Q4_K_M16 t/s22 t/s24 t/s30 t/s4.1 GB
Q5_K_M13 t/s18 t/s20 t/s26 t/s4.8 GB
Q8_08 t/s12 t/s14 t/s20 t/s7.2 GB
F16N/AN/AN/A12 t/s13.5 GB

Whisper — Real-Time Factor (RTF, lower is faster):

ModelM1M2M3M4Memory
tiny0.050.040.030.0280 MB
base0.080.060.050.04150 MB
small0.150.110.090.07500 MB
medium0.350.250.200.151.5 GB
large-v30.700.500.400.303.0 GB

ℹ️ Real-Time Factor (RTF): An RTF of 0.25 means 1 second of audio is processed in 0.25 seconds. Values below 1.0 are faster than real time.

/// Selects the best quantization level based on available memory and power state
struct QuantizationSelector {
static func recommendedLLMQuantization(
availableMemoryGB: Double,
isOnBattery: Bool,
thermalState: ProcessInfo.ThermalState
) -> LLMQuantization {
// Under thermal pressure or battery: prefer smaller models
if thermalState == .critical || thermalState == .serious {
return .q4_0
}
if isOnBattery {
// On battery, favor speed over quality
if availableMemoryGB >= 5.0 {
return .q4_k_m
} else {
return .q4_0
}
}
// On power: use highest quality that fits
switch availableMemoryGB {
case 14.0...: return .f16
case 8.0...: return .q8_0
case 5.5...: return .q5_k_m
case 4.5...: return .q4_k_m
default: return .q4_0
}
}
static func recommendedWhisperModel(
availableMemoryGB: Double,
isOnBattery: Bool,
thermalState: ProcessInfo.ThermalState
) -> WhisperModelSize {
if thermalState == .critical {
return .tiny
}
if thermalState == .serious {
return .base
}
if isOnBattery {
if availableMemoryGB >= 2.0 {
return .small
} else {
return .base
}
}
// On power
switch availableMemoryGB {
case 5.0...: return .large
case 3.0...: return .medium
case 1.0...: return .small
case 0.5...: return .base
default: return .tiny
}
}
}

Memory Management for Dual-Model Operation

Section titled “Memory Management for Dual-Model Operation”

Running Whisper and an LLM simultaneously is VaulType’s most demanding resource requirement. This section covers how to manage memory for both models.

┌──────────────────────────────────────────────────────────────┐
│ System Unified Memory (16 GB example) │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ macOS + System Services ~3.5 GB │ │
│ ├──────────────────────────────────────────────────────┤ │
│ │ VaulType App │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ App Binary + Runtime ~50 MB │ │ │
│ │ ├─────────────────────────────────────────────┤ │ │
│ │ │ SwiftUI View Hierarchy ~30 MB │ │ │
│ │ ├─────────────────────────────────────────────┤ │ │
│ │ │ Audio Pipeline (buffers) ~20 MB │ │ │
│ │ ├─────────────────────────────────────────────┤ │ │
│ │ │ Whisper Model (medium) ~1.5 GB │ │ │
│ │ │ ├── Encoder weights │ │ │
│ │ │ ├── Decoder weights │ │ │
│ │ │ └── Mel filterbank + vocab │ │ │
│ │ ├─────────────────────────────────────────────┤ │ │
│ │ │ Whisper Compute Buffers ~200 MB │ │ │
│ │ │ ├── Mel spectrogram │ │ │
│ │ │ ├── Encoder output │ │ │
│ │ │ └── Decoder KV cache │ │ │
│ │ ├─────────────────────────────────────────────┤ │ │
│ │ │ LLM Model (Q4_K_M 7B) ~4.1 GB │ │ │
│ │ │ ├── Embedding layer │ │ │
│ │ │ ├── 32 transformer layers │ │ │
│ │ │ └── Output head │ │ │
│ │ ├─────────────────────────────────────────────┤ │ │
│ │ │ LLM KV Cache ~500 MB │ │ │
│ │ │ (context-dependent) │ │ │
│ │ ├─────────────────────────────────────────────┤ │ │
│ │ │ Metal Compute Buffers ~200 MB │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ │ Total VaulType: ~6.6 GB │ │
│ ├──────────────────────────────────────────────────────┤ │
│ │ Free / Available ~5.9 GB │ │
│ └──────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘

Calculate peak memory for any hardware/model combination:

/// Calculates peak memory usage for a given model configuration
struct MemoryCalculator {
struct MemoryBudget {
let whisperModelMB: Double
let whisperComputeMB: Double
let llmModelMB: Double
let llmKVCacheMB: Double
let metalBuffersMB: Double
let appOverheadMB: Double
var totalMB: Double {
whisperModelMB + whisperComputeMB +
llmModelMB + llmKVCacheMB +
metalBuffersMB + appOverheadMB
}
var totalGB: Double { totalMB / 1024.0 }
/// Whether this configuration fits in the given memory with headroom
func fits(inGB availableGB: Double, headroomGB: Double = 2.0) -> Bool {
totalGB + headroomGB <= availableGB
}
}
static func calculateBudget(
whisperModel: WhisperModelSize,
llmQuantization: LLMQuantization,
llmParameterBillions: Double,
contextLength: Int = 2048
) -> MemoryBudget {
let whisperModelMB = whisperModel.estimatedMemoryGB * 1024
// Whisper compute buffers: mel spectrogram + encoder output + KV cache
let whisperComputeMB: Double = switch whisperModel {
case .tiny: 50.0
case .base: 80.0
case .small: 120.0
case .medium: 200.0
case .large: 350.0
}
// LLM model size
let llmBits = llmParameterBillions * 1e9 * llmQuantization.bitsPerParameter
let llmModelMB = llmBits / 8.0 / (1024 * 1024)
// LLM KV cache: 2 (K+V) * layers * heads * head_dim * context * 2 bytes(fp16)
// Simplified: roughly 1 MB per 100 context tokens per billion parameters at fp16
let llmKVCacheMB = llmParameterBillions * Double(contextLength) / 100.0
// Metal compute buffers (scratch space for GPU operations)
let metalBuffersMB: Double = 200.0
let appOverheadMB: Double = 100.0
return MemoryBudget(
whisperModelMB: whisperModelMB,
whisperComputeMB: whisperComputeMB,
llmModelMB: llmModelMB,
llmKVCacheMB: llmKVCacheMB,
metalBuffersMB: metalBuffersMB,
appOverheadMB: appOverheadMB
)
}
}

Reference budgets for common configurations:

ConfigurationWhisperLLMPeak MemoryMin RAM
Minimaltiny (80 MB)Q4_0 3B (1.7 GB)~2.3 GB8 GB
Balancedsmall (500 MB)Q4_K_M 7B (4.1 GB)~5.2 GB8 GB
Qualitymedium (1.5 GB)Q4_K_M 7B (4.1 GB)~6.6 GB16 GB
Maximumlarge-v3 (3.0 GB)Q5_K_M 7B (4.8 GB)~8.8 GB16 GB
Ultralarge-v3 (3.0 GB)Q8_0 13B (13.5 GB)~17.5 GB32 GB

Both whisper.cpp and llama.cpp support mmap for loading model files. This is critical for VaulType:

/// Memory mapping strategy for model loading
enum ModelMappingStrategy {
/// mmap the model file. Pages are loaded on demand by the kernel.
/// Pro: Fast initial load, memory is shared with page cache
/// Con: First inference may stall as pages fault in
case memoryMapped
/// Read entire model into allocated memory.
/// Pro: Predictable performance after load completes
/// Con: Slower initial load, higher peak memory
case fullyLoaded
/// mmap + mlock to prevent paging
/// Pro: Fast load + guaranteed in-memory after first pass
/// Con: Requires elevated memory; system may refuse mlock
case memoryMappedLocked
var llmLoadFlag: Bool {
switch self {
case .memoryMapped: return true // use_mmap = true
case .fullyLoaded: return false // use_mmap = false
case .memoryMappedLocked: return true // use_mmap = true, use_mlock = true
}
}
/// Recommended strategy for current conditions
static func recommended(
availableMemoryGB: Double,
modelSizeGB: Double,
isFirstLaunch: Bool
) -> ModelMappingStrategy {
// If plenty of memory, use mmap + mlock for best performance
if availableMemoryGB > modelSizeGB * 2.5 {
return .memoryMappedLocked
}
// If memory is adequate, plain mmap is fine
if availableMemoryGB > modelSizeGB * 1.5 {
return .memoryMapped
}
// Tight on memory: fully loaded gives more predictable behavior
// (system can reclaim mmap pages under pressure, causing stalls)
return .fullyLoaded
}
}

⚠️ mmap Pitfall: Under memory pressure, macOS can evict mmap’d pages. The next access then triggers a page fault and disk read, causing inference stalls. For real-time transcription, prefer fullyLoaded or memoryMappedLocked when memory allows.

When memory is insufficient for both models simultaneously, VaulType can swap models:

/// Manages loading and unloading of models to fit within memory constraints
actor ModelSwapManager {
enum ActiveModel {
case whisperOnly
case llmOnly
case both
case none
}
private(set) var activeState: ActiveModel = .none
private var whisperContext: OpaquePointer? // whisper_context*
private var llamaModel: OpaquePointer? // llama_model*
/// Transition to a new active model state, unloading as needed
func transition(to target: ActiveModel) async throws {
guard target != activeState else { return }
switch (activeState, target) {
case (_, .none):
unloadWhisper()
unloadLLM()
case (.none, .whisperOnly), (.llmOnly, .whisperOnly):
unloadLLM()
try await loadWhisper()
case (.none, .llmOnly), (.whisperOnly, .llmOnly):
unloadWhisper()
try await loadLLM()
case (_, .both):
if whisperContext == nil { try await loadWhisper() }
if llamaModel == nil { try await loadLLM() }
default:
break
}
activeState = target
}
private func loadWhisper() async throws {
// Implementation calls whisper_init_from_file_with_params()
// with Metal-enabled parameters
}
private func loadLLM() async throws {
// Implementation calls llama_load_model_from_file()
// with Metal GPU layer configuration
}
private func unloadWhisper() {
guard let ctx = whisperContext else { return }
// whisper_free(ctx)
whisperContext = nil
}
private func unloadLLM() {
guard let model = llamaModel else { return }
// llama_free_model(model)
llamaModel = nil
}
}
ConditionActionRationale
Memory pressure warning (.warning)Unload LLM if idle > 30sLLM is larger and less immediately needed
Memory pressure critical (.critical)Unload both modelsSystem stability takes priority
App enters backgroundUnload LLM after 60sBackground apps should minimize footprint
No transcription for 5 minUnload WhisperCan reload in ~1-2 seconds when needed
No LLM use for 5 minUnload LLMCan reload in ~2-4 seconds when needed
Thermal state .criticalUnload LLMReduce thermal generation
Battery below 10%Unload LLM, use Whisper onlyPreserve remaining battery
import Foundation
/// Monitors system memory pressure and triggers adaptive responses
final class MemoryPressureMonitor: @unchecked Sendable {
static let shared = MemoryPressureMonitor()
enum PressureLevel: Int, Comparable {
case nominal = 0
case warning = 1
case critical = 2
static func < (lhs: PressureLevel, rhs: PressureLevel) -> Bool {
lhs.rawValue < rhs.rawValue
}
}
private let source: DispatchSourceMemoryPressure
private var currentLevel: PressureLevel = .nominal
private var observers: [(PressureLevel) -> Void] = []
private init() {
source = DispatchSource.makeMemoryPressureSource(
eventMask: [.warning, .critical, .normal],
queue: .global(qos: .utility)
)
source.setEventHandler { [weak self] in
guard let self else { return }
let event = self.source.data
let newLevel: PressureLevel
if event.contains(.critical) {
newLevel = .critical
} else if event.contains(.warning) {
newLevel = .warning
} else {
newLevel = .nominal
}
if newLevel != self.currentLevel {
self.currentLevel = newLevel
self.notifyObservers(newLevel)
}
}
source.resume()
}
deinit {
source.cancel()
}
/// Register a callback for memory pressure changes
func observe(_ handler: @escaping (PressureLevel) -> Void) {
observers.append(handler)
}
/// Current memory usage of this process in bytes
static var currentProcessMemory: UInt64 {
var info = mach_task_basic_info()
var count = mach_msg_type_number_t(
MemoryLayout<mach_task_basic_info>.size / MemoryLayout<natural_t>.size
)
let result = withUnsafeMutablePointer(to: &info) { infoPtr in
infoPtr.withMemoryRebound(
to: integer_t.self,
capacity: Int(count)
) { rawPtr in
task_info(
mach_task_self_,
task_flavor_t(MACH_TASK_BASIC_INFO),
rawPtr,
&count
)
}
}
return result == KERN_SUCCESS ? info.resident_size : 0
}
/// Current process memory in megabytes
static var currentProcessMemoryMB: Double {
Double(currentProcessMemory) / (1024 * 1024)
}
/// Available system memory in bytes (approximate)
static var availableSystemMemory: UInt64 {
var stats = vm_statistics64()
var count = mach_msg_type_number_t(
MemoryLayout<vm_statistics64>.size / MemoryLayout<integer_t>.size
)
let result = withUnsafeMutablePointer(to: &stats) { statsPtr in
statsPtr.withMemoryRebound(
to: integer_t.self,
capacity: Int(count)
) { rawPtr in
host_statistics64(
mach_host_self(),
HOST_VM_INFO64,
rawPtr,
&count
)
}
}
guard result == KERN_SUCCESS else { return 0 }
let pageSize = UInt64(vm_kernel_page_size)
let free = UInt64(stats.free_count) * pageSize
let inactive = UInt64(stats.inactive_count) * pageSize
// macOS "available" ≈ free + inactive (purgeable/compressor pages can be reclaimed)
return free + inactive
}
private func notifyObservers(_ level: PressureLevel) {
for observer in observers {
observer(level)
}
}
}

Integration: The PerformanceManager subscribes to MemoryPressureMonitor and calls ModelSwapManager.transition(to:) when pressure levels change. See Architecture for the full dependency graph.


The audio pipeline is the first stage of VaulType’s transcription flow. Buffer management here directly affects both latency and reliability.

Buffer Size (frames)Duration @ 16 kHzLatencyCPU OverheadUse Case
25616 msVery lowVery highNot recommended
51232 msLowHighReal-time monitoring
102464 msMediumMediumDefault for VaulType
2048128 msHigherLowBattery-saving mode
4096256 msHighVery lowBackground processing

VaulType uses 1024 frames as the default buffer size. This provides 64 ms latency at 16 kHz — fast enough that users perceive no delay between speaking and seeing the waveform indicator, while keeping CPU wake-ups reasonable.

/// Audio buffer size configuration
enum AudioBufferConfig {
case lowLatency // 512 frames
case balanced // 1024 frames (default)
case batterySaving // 2048 frames
var frameCount: AVAudioFrameCount {
switch self {
case .lowLatency: return 512
case .balanced: return 1024
case .batterySaving: return 2048
}
}
/// Recommended config based on current power and thermal state
static func recommended(
isOnBattery: Bool,
thermalState: ProcessInfo.ThermalState
) -> AudioBufferConfig {
if thermalState >= .serious || isOnBattery {
return .batterySaving
}
return .balanced
}
}

VaulType captures audio at the system’s native sample rate (typically 48 kHz) and converts to 16 kHz for Whisper. The conversion pipeline:

┌───────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ AVAudioEngine │ │ Format Converter │ │ Ring Buffer │
│ Input Node │────►│ 48 kHz → 16 kHz │────►│ (16 kHz mono) │
│ (48 kHz) │ │ Stereo → Mono │ │ │
│ │ │ Float32 │ │ Whisper reads │
└───────────────┘ └──────────────────┘ │ from here │
└─────────────────┘
import AVFoundation
/// Configures the audio format conversion pipeline
struct AudioFormatConverter {
/// Target format for Whisper: 16 kHz, mono, Float32
static let whisperFormat = AVAudioFormat(
commonFormat: .pcmFormatFloat32,
sampleRate: 16000.0,
channels: 1,
interleaved: false
)!
/// Creates an AVAudioConverter for the given input format
static func makeConverter(
from inputFormat: AVAudioFormat
) -> AVAudioConverter? {
guard let converter = AVAudioConverter(
from: inputFormat,
to: whisperFormat
) else {
return nil
}
// Use highest quality SRC for accuracy
// SampleRateConverterComplexity:
// .linear — fastest, lowest quality
// .normal — balanced (VaulType default on battery)
// .mastering — highest quality (VaulType default on power)
converter.sampleRateConverterQuality = .max
return converter
}
}

💡 Performance Note: Sample rate conversion quality has minimal impact on transcription accuracy for speech. .normal quality is sufficient and uses significantly less CPU than .max. Reserve .max for when connected to power.

A lock-free ring buffer connects the audio capture thread (real-time priority) to the Whisper inference thread. This avoids priority inversion from locks.

import Atomics
/// Lock-free single-producer, single-consumer ring buffer for audio samples
final class AudioRingBuffer: @unchecked Sendable {
private let buffer: UnsafeMutableBufferPointer<Float>
private let capacity: Int
private let writeIndex = ManagedAtomic<Int>(0)
private let readIndex = ManagedAtomic<Int>(0)
/// Creates a ring buffer with the given capacity in samples
/// - Parameter capacity: Number of Float samples. Should be power of 2.
init(capacity: Int) {
precondition(capacity > 0 && capacity & (capacity - 1) == 0,
"Capacity must be a power of 2")
self.capacity = capacity
let ptr = UnsafeMutablePointer<Float>.allocate(capacity: capacity)
ptr.initialize(repeating: 0.0, count: capacity)
self.buffer = UnsafeMutableBufferPointer(start: ptr, count: capacity)
}
deinit {
buffer.baseAddress?.deinitialize(count: capacity)
buffer.baseAddress?.deallocate()
}
/// Number of samples available to read
var availableSamples: Int {
let write = writeIndex.load(ordering: .acquiring)
let read = readIndex.load(ordering: .acquiring)
return (write - read + capacity) & (capacity - 1)
}
/// Number of free slots available for writing
var freeSlots: Int {
capacity - 1 - availableSamples
}
/// Write samples from the audio capture callback (real-time safe)
/// - Returns: Number of samples actually written
@discardableResult
func write(_ samples: UnsafeBufferPointer<Float>) -> Int {
let count = min(samples.count, freeSlots)
guard count > 0 else { return 0 }
let writePos = writeIndex.load(ordering: .relaxed)
for i in 0..<count {
buffer[(writePos + i) & (capacity - 1)] = samples[i]
}
writeIndex.store(
(writePos + count) & (capacity - 1),
ordering: .releasing
)
return count
}
/// Read samples for Whisper processing (non-real-time thread)
/// - Parameter into: Destination buffer
/// - Returns: Number of samples actually read
@discardableResult
func read(into destination: UnsafeMutableBufferPointer<Float>) -> Int {
let count = min(destination.count, availableSamples)
guard count > 0 else { return 0 }
let readPos = readIndex.load(ordering: .relaxed)
for i in 0..<count {
destination[i] = buffer[(readPos + i) & (capacity - 1)]
}
readIndex.store(
(readPos + count) & (capacity - 1),
ordering: .releasing
)
return count
}
/// Peek at samples without advancing the read pointer
func peek(count: Int) -> [Float] {
let available = min(count, availableSamples)
guard available > 0 else { return [] }
let readPos = readIndex.load(ordering: .acquiring)
var result = [Float](repeating: 0, count: available)
for i in 0..<available {
result[i] = buffer[(readPos + i) & (capacity - 1)]
}
return result
}
/// Discard all buffered samples
func reset() {
readIndex.store(
writeIndex.load(ordering: .acquiring),
ordering: .releasing
)
}
}

⚠️ Real-Time Safety: The write method is called from the audio render callback, which runs on a real-time thread. It must never allocate memory, acquire locks, or call Objective-C methods. The implementation above uses only atomic operations and direct pointer access.

Low Latency ◄────────────────────────► High Reliability
┌─────┬─────┬──────┬──────┬──────┐
Buffer Size: │ 256 │ 512 │ 1024 │ 2048 │ 4096 │
└──┬──┴──┬──┴──┬───┴──┬───┴──┬───┘
│ │ │ │ │
Latency (ms): 16 32 64 128 256
Drop risk: High Med Low V.Low None
CPU wakeups: 62/s 31/s 16/s 8/s 4/s
Battery: Poor Fair Good Great Best

VaulType’s default of 1024 frames provides the best balance. The ring buffer’s capacity should be at least 8x the buffer size (8192 samples = 0.5 seconds) to absorb processing jitter.


Loading ML models is expensive. A 4 GB LLM takes 2-4 seconds to load from SSD. VaulType uses lazy loading to keep launch time under 1 second.

The simplest strategy: do not load models at launch. Wait until the user triggers their first transcription.

/// Lazy-loading wrapper for ML model contexts
actor LazyModelLoader<Context> {
enum State {
case unloaded
case loading(Task<Context, Error>)
case loaded(Context)
case failed(Error)
}
private var state: State = .unloaded
private let loadFunction: () async throws -> Context
private let unloadFunction: (Context) -> Void
init(
load: @escaping () async throws -> Context,
unload: @escaping (Context) -> Void
) {
self.loadFunction = load
self.unloadFunction = unload
}
/// Get the model context, loading if necessary
func get() async throws -> Context {
switch state {
case .loaded(let context):
return context
case .loading(let task):
return try await task.value
case .unloaded, .failed:
let task = Task {
try await loadFunction()
}
state = .loading(task)
do {
let context = try await task.value
state = .loaded(context)
return context
} catch {
state = .failed(error)
throw error
}
}
}
/// Unload the model to free memory
func unload() {
if case .loaded(let context) = state {
unloadFunction(context)
}
state = .unloaded
}
/// Whether the model is currently loaded and ready
var isLoaded: Bool {
if case .loaded = state { return true }
return false
}
}

After launch, preload models in the background if conditions allow (on power, not thermal throttled, sufficient memory):

/// Preloads models in the background when conditions are favorable
final class BackgroundPreloader {
private var preloadTask: Task<Void, Never>?
func startPreloadingIfAppropriate(
whisperLoader: LazyModelLoader<OpaquePointer>,
llmLoader: LazyModelLoader<OpaquePointer>,
performanceState: PerformanceState
) {
// Don't preload if conditions are unfavorable
guard !performanceState.isOnBattery,
performanceState.thermalState <= .fair,
performanceState.memoryPressure == .nominal else {
return
}
preloadTask = Task(priority: .background) {
// Small delay so launch UI is fully responsive first
try? await Task.sleep(for: .seconds(2))
guard !Task.isCancelled else { return }
// Load Whisper first (smaller, more immediately needed)
_ = try? await whisperLoader.get()
guard !Task.isCancelled else { return }
// Then LLM
_ = try? await llmLoader.get()
}
}
func cancelPreloading() {
preloadTask?.cancel()
preloadTask = nil
}
}

After loading, run a single dummy inference to warm up GPU shaders and fill caches:

/// Runs warm-up inference passes to prime GPU shader caches
struct ModelWarmup {
/// Warm up Whisper with a short silence buffer
static func warmUpWhisper(context: OpaquePointer) async {
// Create 1 second of silence at 16 kHz
let silenceBuffer = [Float](repeating: 0.0, count: 16000)
await Task.detached(priority: .background) {
// whisper_full() with the silence buffer
// This compiles Metal shaders on first run and caches them
silenceBuffer.withUnsafeBufferPointer { ptr in
// whisper_full(context, params, ptr.baseAddress, Int32(ptr.count))
_ = ptr // Placeholder: actual whisper_full call
}
}.value
}
/// Warm up LLM with a minimal prompt
static func warmUpLLM(model: OpaquePointer) async {
await Task.detached(priority: .background) {
// Run a single-token generation with a minimal prompt
// This warms up:
// 1. Metal compute pipeline state objects
// 2. GPU shader compilation cache
// 3. Memory allocation pools
// Actual llama_decode call with a single "hello" token
}.value
}
}

ℹ️ Shader Compilation: Metal shaders are compiled just-in-time on first use. The first inference pass is typically 2-5x slower than subsequent passes. Warm-up inference ensures the user never sees this penalty.

When multiple models need loading, prioritize based on likely user action:

/// Priority queue for model loading requests
actor ModelPriorityQueue {
struct LoadRequest: Comparable {
let modelId: String
let priority: Priority
let loadAction: () async throws -> Void
enum Priority: Int, Comparable {
case critical = 0 // User is waiting right now
case high = 1 // User likely to need soon
case background = 2 // Speculative preload
static func < (lhs: Priority, rhs: Priority) -> Bool {
lhs.rawValue < rhs.rawValue
}
}
static func < (lhs: LoadRequest, rhs: LoadRequest) -> Bool {
lhs.priority < rhs.priority
}
}
private var queue: [LoadRequest] = []
private var isProcessing = false
func enqueue(_ request: LoadRequest) {
queue.append(request)
queue.sort()
processNext()
}
private func processNext() {
guard !isProcessing, let request = queue.first else { return }
queue.removeFirst()
isProcessing = true
Task {
do {
try await request.loadAction()
} catch {
// Log error, model will be loaded on next attempt
}
isProcessing = false
processNext()
}
}
}

The complete preloading service that ties all loading strategies together:

import Combine
import Foundation
/// Orchestrates model lifecycle: lazy loading, preloading, warm-up, and unloading
@MainActor
final class ModelPreloadingService: ObservableObject {
// MARK: - Published State
@Published private(set) var whisperState: ModelState = .unloaded
@Published private(set) var llmState: ModelState = .unloaded
enum ModelState: Equatable {
case unloaded
case loading(progress: Double)
case loaded
case warmingUp
case ready
case error(String)
static func == (lhs: ModelState, rhs: ModelState) -> Bool {
switch (lhs, rhs) {
case (.unloaded, .unloaded), (.loaded, .loaded),
(.warmingUp, .warmingUp), (.ready, .ready):
return true
case (.loading(let a), .loading(let b)):
return a == b
case (.error(let a), .error(let b)):
return a == b
default:
return false
}
}
}
// MARK: - Dependencies
private let memoryMonitor = MemoryPressureMonitor.shared
private let priorityQueue = ModelPriorityQueue()
private var cancellables = Set<AnyCancellable>()
private var idleTimers: [String: Task<Void, Never>] = [:]
// Configuration
private let whisperIdleTimeout: Duration = .seconds(300) // 5 minutes
private let llmIdleTimeout: Duration = .seconds(300) // 5 minutes
// MARK: - Initialization
init() {
setupMemoryPressureHandling()
setupThermalStateHandling()
}
// MARK: - Public API
/// Ensure Whisper is loaded and ready for transcription
func ensureWhisperReady() async throws {
resetIdleTimer(for: "whisper")
guard whisperState != .ready else { return }
whisperState = .loading(progress: 0.0)
// Load model from disk
// In actual implementation: whisper_init_from_file_with_params()
whisperState = .loading(progress: 0.5)
// Simulated progress for Metal shader compilation
whisperState = .loaded
// Warm up
whisperState = .warmingUp
// await ModelWarmup.warmUpWhisper(context: whisperContext)
whisperState = .ready
startIdleTimer(for: "whisper", timeout: whisperIdleTimeout)
}
/// Ensure LLM is loaded and ready for text refinement
func ensureLLMReady() async throws {
resetIdleTimer(for: "llm")
guard llmState != .ready else { return }
llmState = .loading(progress: 0.0)
// Load model
llmState = .loading(progress: 0.5)
llmState = .loaded
// Warm up
llmState = .warmingUp
// await ModelWarmup.warmUpLLM(model: llamaModel)
llmState = .ready
startIdleTimer(for: "llm", timeout: llmIdleTimeout)
}
/// Preload both models in the background if conditions are favorable
func preloadIfFavorable(
isOnBattery: Bool,
thermalState: ProcessInfo.ThermalState
) {
guard !isOnBattery,
thermalState <= .fair else {
return
}
Task(priority: .background) {
try? await Task.sleep(for: .seconds(2))
try? await ensureWhisperReady()
try? await Task.sleep(for: .seconds(1))
try? await ensureLLMReady()
}
}
/// Unload a specific model
func unload(_ model: String) {
switch model {
case "whisper":
// whisper_free(context)
whisperState = .unloaded
case "llm":
// llama_free_model(model)
llmState = .unloaded
default:
break
}
idleTimers[model]?.cancel()
idleTimers[model] = nil
}
/// Unload all models
func unloadAll() {
unload("whisper")
unload("llm")
}
// MARK: - Private
private func setupMemoryPressureHandling() {
memoryMonitor.observe { [weak self] level in
Task { @MainActor in
guard let self else { return }
switch level {
case .warning:
// Unload LLM if it's idle (Whisper is more critical)
if self.llmState == .ready {
self.unload("llm")
}
case .critical:
self.unloadAll()
case .nominal:
break
}
}
}
}
private func setupThermalStateHandling() {
NotificationCenter.default
.publisher(for: ProcessInfo.thermalStateDidChangeNotification)
.receive(on: DispatchQueue.main)
.sink { [weak self] _ in
guard let self else { return }
let state = ProcessInfo.processInfo.thermalState
if state == .critical {
self.unloadAll()
} else if state == .serious {
self.unload("llm")
}
}
.store(in: &cancellables)
}
private func startIdleTimer(for model: String, timeout: Duration) {
idleTimers[model]?.cancel()
idleTimers[model] = Task {
try? await Task.sleep(for: timeout)
guard !Task.isCancelled else { return }
await MainActor.run {
self.unload(model)
}
}
}
private func resetIdleTimer(for model: String) {
idleTimers[model]?.cancel()
idleTimers[model] = nil
}
}

Architecture Note: ModelPreloadingService is owned by the PerformanceManager and exposed to SwiftUI views via @EnvironmentObject. See Architecture for the full object graph.


VaulType adapts its performance profile based on whether the Mac is connected to power or running on battery.

import IOKit.ps
/// Monitors power source changes and provides current battery state
final class PowerSourceMonitor: @unchecked Sendable {
static let shared = PowerSourceMonitor()
struct PowerState: Equatable {
let isOnBattery: Bool
let batteryLevel: Double? // 0.0 - 1.0, nil if no battery
let isCharging: Bool
let timeRemaining: TimeInterval? // seconds, nil if unknown
/// Whether we should use battery-saving mode
var shouldThrottle: Bool {
isOnBattery && (batteryLevel ?? 1.0) < 0.5
}
/// Whether we should use minimal mode (critical battery)
var isCritical: Bool {
isOnBattery && (batteryLevel ?? 1.0) < 0.1
}
}
private var observers: [(PowerState) -> Void] = []
private var runLoopSource: CFRunLoopSource?
private(set) var currentState: PowerState
private init() {
self.currentState = PowerSourceMonitor.readCurrentState()
setupMonitoring()
}
/// Read current power source info from IOKit
private static func readCurrentState() -> PowerState {
guard let snapshot = IOPSCopyPowerSourcesInfo()?.takeRetainedValue(),
let sources = IOPSCopyPowerSourcesList(snapshot)?.takeRetainedValue()
as? [CFTypeRef] else {
// Desktop Mac with no battery
return PowerState(
isOnBattery: false,
batteryLevel: nil,
isCharging: false,
timeRemaining: nil
)
}
for source in sources {
guard let info = IOPSGetPowerSourceDescription(snapshot, source)?
.takeUnretainedValue() as? [String: Any] else {
continue
}
let powerSource = info[kIOPSPowerSourceStateKey as String] as? String
let isOnBattery = powerSource == kIOPSBatteryPowerValue
let currentCapacity = info[kIOPSCurrentCapacityKey as String] as? Int ?? 0
let maxCapacity = info[kIOPSMaxCapacityKey as String] as? Int ?? 100
let isCharging = info[kIOPSIsChargingKey as String] as? Bool ?? false
let timeRemaining = info[kIOPSTimeToEmptyKey as String] as? Int
return PowerState(
isOnBattery: isOnBattery,
batteryLevel: Double(currentCapacity) / Double(maxCapacity),
isCharging: isCharging,
timeRemaining: timeRemaining.map { TimeInterval($0 * 60) }
)
}
return PowerState(
isOnBattery: false,
batteryLevel: nil,
isCharging: false,
timeRemaining: nil
)
}
private func setupMonitoring() {
let context = UnsafeMutableRawPointer(
Unmanaged.passUnretained(self).toOpaque()
)
runLoopSource = IOPSNotificationCreateRunLoopSource(
{ context in
guard let context else { return }
let monitor = Unmanaged<PowerSourceMonitor>
.fromOpaque(context)
.takeUnretainedValue()
let newState = PowerSourceMonitor.readCurrentState()
if newState != monitor.currentState {
monitor.currentState = newState
for observer in monitor.observers {
observer(newState)
}
}
},
context
)?.takeRetainedValue()
if let source = runLoopSource {
CFRunLoopAddSource(CFRunLoopGetMain(), source, .defaultMode)
}
}
func observe(_ handler: @escaping (PowerState) -> Void) {
observers.append(handler)
}
}
/// Selects optimal model configuration based on power state
struct BatteryAwareModelSelector {
struct ModelSelection {
let whisperModel: WhisperModelSize
let llmQuantization: LLMQuantization
let llmParameterBillions: Double
let gpuLayerReduction: Int32 // reduce GPU layers by this amount
let preloadingEnabled: Bool
let audioBufferConfig: AudioBufferConfig
}
/// Select the best model configuration for current conditions
static func select(
powerState: PowerSourceMonitor.PowerState,
thermalState: ProcessInfo.ThermalState,
availableMemoryGB: Double,
preferredWhisperModel: WhisperModelSize,
preferredLLMQuantization: LLMQuantization,
preferredLLMParamBillions: Double
) -> ModelSelection {
// On power, no thermal issues: use preferred configuration
if !powerState.isOnBattery && thermalState <= .fair {
return ModelSelection(
whisperModel: preferredWhisperModel,
llmQuantization: preferredLLMQuantization,
llmParameterBillions: preferredLLMParamBillions,
gpuLayerReduction: 0,
preloadingEnabled: true,
audioBufferConfig: .balanced
)
}
// Critical battery: minimal configuration
if powerState.isCritical {
return ModelSelection(
whisperModel: .tiny,
llmQuantization: .q4_0,
llmParameterBillions: min(preferredLLMParamBillions, 3.0),
gpuLayerReduction: 16,
preloadingEnabled: false,
audioBufferConfig: .batterySaving
)
}
// On battery with < 50%: reduced configuration
if powerState.shouldThrottle {
let whisper: WhisperModelSize = switch preferredWhisperModel {
case .large: .medium
case .medium: .small
default: preferredWhisperModel
}
return ModelSelection(
whisperModel: whisper,
llmQuantization: .q4_0,
llmParameterBillions: preferredLLMParamBillions,
gpuLayerReduction: 8,
preloadingEnabled: false,
audioBufferConfig: .batterySaving
)
}
// On battery, > 50%: slightly reduced configuration
return ModelSelection(
whisperModel: preferredWhisperModel,
llmQuantization: min(preferredLLMQuantization, .q4_k_m),
llmParameterBillions: preferredLLMParamBillions,
gpuLayerReduction: 4,
preloadingEnabled: false,
audioBufferConfig: .balanced
)
}
}
// Make LLMQuantization Comparable for min() usage
extension LLMQuantization: Comparable {
static func < (lhs: LLMQuantization, rhs: LLMQuantization) -> Bool {
lhs.bitsPerParameter < rhs.bitsPerParameter
}
}
Battery LevelPreloading Policy
100% - 80% (charging or just unplugged)Allow preloading, use balanced models
80% - 50%No preloading; load on first use only
50% - 20%No preloading; reduce to smaller models
20% - 10%No preloading; reduce GPU layers; unload idle models after 60s
Below 10%Unload LLM entirely; Whisper tiny only; minimal GPU
/// Power profiles that bundle all performance settings
enum PowerProfile: String, CaseIterable {
case maximum // Maximum quality, all features
case balanced // Good quality, reasonable power use
case efficiency // Reduced quality, extended battery
case minimal // Minimum viable, emergency battery
struct Settings {
let whisperModel: WhisperModelSize
let llmQuantization: LLMQuantization
let maxGPULayers: Int32
let preloadingEnabled: Bool
let warmupEnabled: Bool
let audioBufferConfig: AudioBufferConfig
let idleUnloadTimeout: Duration
}
var settings: Settings {
switch self {
case .maximum:
return Settings(
whisperModel: .large,
llmQuantization: .q8_0,
maxGPULayers: 32,
preloadingEnabled: true,
warmupEnabled: true,
audioBufferConfig: .lowLatency,
idleUnloadTimeout: .seconds(600)
)
case .balanced:
return Settings(
whisperModel: .medium,
llmQuantization: .q4_k_m,
maxGPULayers: 32,
preloadingEnabled: true,
warmupEnabled: true,
audioBufferConfig: .balanced,
idleUnloadTimeout: .seconds(300)
)
case .efficiency:
return Settings(
whisperModel: .small,
llmQuantization: .q4_0,
maxGPULayers: 20,
preloadingEnabled: false,
warmupEnabled: false,
audioBufferConfig: .batterySaving,
idleUnloadTimeout: .seconds(120)
)
case .minimal:
return Settings(
whisperModel: .tiny,
llmQuantization: .q4_0,
maxGPULayers: 8,
preloadingEnabled: false,
warmupEnabled: false,
audioBufferConfig: .batterySaving,
idleUnloadTimeout: .seconds(30)
)
}
}
/// Automatically select profile based on current conditions
static func automatic(
powerState: PowerSourceMonitor.PowerState,
thermalState: ProcessInfo.ThermalState
) -> PowerProfile {
if thermalState >= .critical { return .minimal }
if thermalState >= .serious { return .efficiency }
if powerState.isCritical { return .minimal }
if powerState.shouldThrottle { return .efficiency }
if powerState.isOnBattery { return .balanced }
return .maximum
}
}

🔒 Privacy Note: Power state detection uses IOKit APIs and does not require any special permissions. No power data leaves the device.


Apple Silicon throttles CPU and GPU frequency under thermal pressure. VaulType proactively adapts before the system forces throttling.

macOS provides four thermal states:

StateMeaningVaulType Response
.nominalNormal operating temperatureFull performance
.fairSlightly elevated temperatureReduce preloading
.seriousHigh temperature, system may throttleReduce GPU layers, smaller models
.criticalSystem is actively throttlingMinimal mode, unload LLM
Temperature ──────────────────────────────────────────────►
┌───────────┬───────────┬───────────┬───────────┐
│ nominal │ fair │ serious │ critical │
│ │ │ │ │
│ Full GPU │ No │ -8 GPU │ Unload │
│ Preload │ preload │ layers │ LLM │
│ All │ Reduce │ Smaller │ Whisper │
│ features │ warm-up │ models │ tiny only │
└───────────┴───────────┴───────────┴───────────┘
import Combine
/// Monitors thermal state and adapts performance parameters
@MainActor
final class ThermalThrottleManager: ObservableObject {
@Published private(set) var currentThermalState: ProcessInfo.ThermalState = .nominal
@Published private(set) var gpuLayerReduction: Int32 = 0
@Published private(set) var shouldReduceModelSize: Bool = false
@Published private(set) var shouldDisablePreloading: Bool = false
private var cancellable: AnyCancellable?
init() {
currentThermalState = ProcessInfo.processInfo.thermalState
updateThrottling(for: currentThermalState)
cancellable = NotificationCenter.default
.publisher(for: ProcessInfo.thermalStateDidChangeNotification)
.receive(on: DispatchQueue.main)
.sink { [weak self] _ in
guard let self else { return }
let newState = ProcessInfo.processInfo.thermalState
self.currentThermalState = newState
self.updateThrottling(for: newState)
}
}
private func updateThrottling(for state: ProcessInfo.ThermalState) {
switch state {
case .nominal:
gpuLayerReduction = 0
shouldReduceModelSize = false
shouldDisablePreloading = false
case .fair:
gpuLayerReduction = 0
shouldReduceModelSize = false
shouldDisablePreloading = true // Stop speculative loads
case .serious:
gpuLayerReduction = 8 // Move 8 layers from GPU to CPU
shouldReduceModelSize = true // Step down model sizes
shouldDisablePreloading = true
case .critical:
gpuLayerReduction = 16
shouldReduceModelSize = true
shouldDisablePreloading = true
@unknown default:
gpuLayerReduction = 0
shouldReduceModelSize = false
shouldDisablePreloading = false
}
}
}

Reducing GPU Layers Under Thermal Pressure

Section titled “Reducing GPU Layers Under Thermal Pressure”

When thermal state escalates, reducing GPU layers shifts computation from GPU to CPU. This reduces heat generation because:

  1. CPU cores can individually clock down more granularly
  2. CPU work can be spread across efficiency cores (P/E core scheduling)
  3. GPU power draw is typically higher per FLOP than CPU for these workloads at reduced clocks
/// Calculates effective GPU layer count under thermal constraints
struct ThermalAwareGPULayers {
/// Calculate effective GPU layers for Whisper
static func whisperGPULayers(
baseConfig: MetalGPUConfiguration,
thermalReduction: Int32
) -> Int32 {
max(baseConfig.whisperGPULayers - thermalReduction, 0)
}
/// Calculate effective GPU layers for LLM
static func llmGPULayers(
baseConfig: MetalGPUConfiguration,
thermalReduction: Int32
) -> Int32 {
max(baseConfig.llmGPULayers - thermalReduction, 0)
}
/// Estimated performance impact of thermal reduction
static func estimatedSlowdown(
totalLayers: Int32,
gpuLayers: Int32,
thermalReduction: Int32
) -> Double {
let originalGPU = Double(gpuLayers)
let reducedGPU = Double(max(gpuLayers - thermalReduction, 0))
let cpuLayers = Double(totalLayers) - reducedGPU
let originalCPULayers = Double(totalLayers) - originalGPU
// Very rough: GPU layers are ~3x faster than CPU layers
let originalTime = originalCPULayers * 3.0 + originalGPU * 1.0
let reducedTime = cpuLayers * 3.0 + reducedGPU * 1.0
return reducedTime / originalTime
}
}

Complete thermal adaptation that integrates with the model pipeline:

import Combine
import Foundation
/// Coordinates thermal management across all VaulType subsystems
@MainActor
final class ThermalCoordinator: ObservableObject {
@Published private(set) var effectiveProfile: PowerProfile = .balanced
private let thermalManager = ThermalThrottleManager()
private let powerMonitor = PowerSourceMonitor.shared
private let preloadingService: ModelPreloadingService
private var cancellables = Set<AnyCancellable>()
init(preloadingService: ModelPreloadingService) {
self.preloadingService = preloadingService
setupBindings()
}
private func setupBindings() {
// React to thermal state changes
thermalManager.$currentThermalState
.combineLatest(
thermalManager.$gpuLayerReduction,
thermalManager.$shouldReduceModelSize
)
.debounce(for: .seconds(1), scheduler: DispatchQueue.main)
.sink { [weak self] thermalState, _, _ in
guard let self else { return }
self.recalculateProfile()
self.applyThermalMitigation(thermalState)
}
.store(in: &cancellables)
}
private func recalculateProfile() {
let powerState = powerMonitor.currentState
let thermalState = ProcessInfo.processInfo.thermalState
effectiveProfile = PowerProfile.automatic(
powerState: powerState,
thermalState: thermalState
)
}
private func applyThermalMitigation(_ state: ProcessInfo.ThermalState) {
switch state {
case .nominal:
// Restore full performance if memory allows
break
case .fair:
// Cancel any pending preloads
preloadingService.unload("llm") // Keep Whisper only if both loaded
case .serious:
// Unload LLM, keep Whisper with reduced GPU layers
preloadingService.unload("llm")
// Reconfigure Whisper with reduced GPU layers on next inference
case .critical:
// Unload everything
preloadingService.unloadAll()
// Next transcription will use .minimal profile
@unknown default:
break
}
}
/// Log thermal event for diagnostics
private func logThermalEvent(_ state: ProcessInfo.ThermalState) {
let stateNames: [ProcessInfo.ThermalState: String] = [
.nominal: "nominal",
.fair: "fair",
.serious: "serious",
.critical: "critical"
]
let name = stateNames[state] ?? "unknown"
// Logger.performance.info("Thermal state changed to: \(name)")
_ = name
}
}

⚠️ Hysteresis: Thermal state transitions can oscillate. The debounce(for: .seconds(1)) prevents rapid reconfiguration. When transitioning from .serious back to .nominal, wait at least 30 seconds before restoring full GPU layers to avoid thermal cycling.


Systematic benchmarking ensures VaulType’s performance improves (or at least does not regress) with every release.

Use these Instruments templates for VaulType performance analysis:

TemplateWhat to Look For
Time ProfilerHot functions in whisper.cpp/llama.cpp wrappers, main thread stalls, Swift overhead
AllocationsMemory growth during transcription, leaked model buffers, KV cache growth
Metal System TraceGPU utilization, shader compilation stalls, GPU/CPU sync points
System TraceThread scheduling, priority inversions, real-time thread preemption
Energy LogCPU/GPU/ANE energy impact, background activity, wake-ups
Thermal StateThermal ramp during sustained transcription, throttle points

Key areas to profile:

  1. Model loading — Time from load() call to first inference readiness
  2. First inference — Time including shader compilation
  3. Steady-state inference — Time for subsequent inferences (cached shaders)
  4. Audio pipeline latency — Time from microphone capture to ring buffer availability
  5. End-to-end latency — Time from end of speech to text appearing in target app
  6. Memory high-water mark — Peak memory during dual-model operation
import Foundation
import os.signpost
/// Benchmarking harness for VaulType performance measurements
final class PerformanceBenchmark {
// MARK: - Signpost Integration
private static let log = OSLog(
subsystem: "com.vaultype.benchmark",
category: .pointsOfInterest
)
private static let signpostLog = OSLog(
subsystem: "com.vaultype.benchmark",
category: "Performance"
)
// MARK: - Measurement Types
struct Measurement {
let name: String
let duration: Duration
let metadata: [String: String]
let timestamp: Date
var durationMilliseconds: Double {
let components = duration.components
return Double(components.seconds) * 1000.0 +
Double(components.attoseconds) / 1_000_000_000_000_000.0
}
}
struct BenchmarkResult {
let name: String
let measurements: [Measurement]
var count: Int { measurements.count }
var meanDuration: Double {
guard !measurements.isEmpty else { return 0 }
let total = measurements.reduce(0.0) { $0 + $1.durationMilliseconds }
return total / Double(measurements.count)
}
var medianDuration: Double {
guard !measurements.isEmpty else { return 0 }
let sorted = measurements.map(\.durationMilliseconds).sorted()
let mid = sorted.count / 2
if sorted.count.isMultiple(of: 2) {
return (sorted[mid - 1] + sorted[mid]) / 2.0
}
return sorted[mid]
}
var p95Duration: Double {
guard !measurements.isEmpty else { return 0 }
let sorted = measurements.map(\.durationMilliseconds).sorted()
let index = Int(Double(sorted.count) * 0.95)
return sorted[min(index, sorted.count - 1)]
}
var p99Duration: Double {
guard !measurements.isEmpty else { return 0 }
let sorted = measurements.map(\.durationMilliseconds).sorted()
let index = Int(Double(sorted.count) * 0.99)
return sorted[min(index, sorted.count - 1)]
}
var standardDeviation: Double {
guard measurements.count > 1 else { return 0 }
let mean = meanDuration
let variance = measurements.reduce(0.0) { sum, m in
let diff = m.durationMilliseconds - mean
return sum + diff * diff
} / Double(measurements.count - 1)
return variance.squareRoot()
}
var summary: String {
"""
Benchmark: \(name)
Iterations: \(count)
Mean: \(String(format: "%.2f", meanDuration)) ms
Median: \(String(format: "%.2f", medianDuration)) ms
P95: \(String(format: "%.2f", p95Duration)) ms
P99: \(String(format: "%.2f", p99Duration)) ms
StdDev: \(String(format: "%.2f", standardDeviation)) ms
"""
}
}
// MARK: - Benchmark Execution
private var results: [String: [Measurement]] = [:]
/// Run a benchmark with the given name and iteration count
func benchmark(
name: String,
iterations: Int = 100,
warmupIterations: Int = 5,
metadata: [String: String] = [:],
operation: () async throws -> Void
) async throws -> BenchmarkResult {
// Warm-up phase (results discarded)
for _ in 0..<warmupIterations {
try await operation()
}
// Measurement phase
var measurements: [Measurement] = []
measurements.reserveCapacity(iterations)
for _ in 0..<iterations {
let signpostID = OSSignpostID(log: Self.signpostLog)
os_signpost(
.begin,
log: Self.signpostLog,
name: "Benchmark",
signpostID: signpostID,
"%{public}s",
name
)
let clock = ContinuousClock()
let duration = try await clock.measure {
try await operation()
}
os_signpost(
.end,
log: Self.signpostLog,
name: "Benchmark",
signpostID: signpostID
)
measurements.append(Measurement(
name: name,
duration: duration,
metadata: metadata,
timestamp: Date()
))
}
results[name] = measurements
return BenchmarkResult(name: name, measurements: measurements)
}
/// Run a synchronous benchmark
func benchmarkSync(
name: String,
iterations: Int = 100,
warmupIterations: Int = 5,
metadata: [String: String] = [:],
operation: () throws -> Void
) throws -> BenchmarkResult {
// Warm-up
for _ in 0..<warmupIterations {
try operation()
}
var measurements: [Measurement] = []
measurements.reserveCapacity(iterations)
let clock = ContinuousClock()
for _ in 0..<iterations {
let duration = try clock.measure {
try operation()
}
measurements.append(Measurement(
name: name,
duration: duration,
metadata: metadata,
timestamp: Date()
))
}
results[name] = measurements
return BenchmarkResult(name: name, measurements: measurements)
}
/// Export all results as JSON for trend analysis
func exportJSON() throws -> Data {
struct ExportEntry: Codable {
let name: String
let durationMs: Double
let metadata: [String: String]
let timestamp: String
}
let formatter = ISO8601DateFormatter()
let entries = results.flatMap { (name, measurements) in
measurements.map { m in
ExportEntry(
name: name,
durationMs: m.durationMilliseconds,
metadata: m.metadata,
timestamp: formatter.string(from: m.timestamp)
)
}
}
return try JSONEncoder().encode(entries)
}
/// Print a formatted summary of all results
func printSummary() {
for (name, measurements) in results.sorted(by: { $0.key < $1.key }) {
let result = BenchmarkResult(name: name, measurements: measurements)
print(result.summary)
print("---")
}
}
}

Example usage:

// Run benchmarks for all key operations
func runPerformanceSuite() async throws {
let bench = PerformanceBenchmark()
// Benchmark Whisper inference
let whisperResult = try await bench.benchmark(
name: "whisper_inference_5s_audio",
iterations: 50,
warmupIterations: 3,
metadata: [
"model": "medium",
"audio_duration": "5.0",
"hardware": ProcessInfo.processInfo.machineHardwareName
]
) {
// Run whisper_full() on a 5-second test audio buffer
try await transcribeTestAudio()
}
// Benchmark LLM inference
let llmResult = try await bench.benchmark(
name: "llm_refinement_50_tokens",
iterations: 50,
warmupIterations: 3,
metadata: [
"quantization": "Q4_K_M",
"input_tokens": "50",
"hardware": ProcessInfo.processInfo.machineHardwareName
]
) {
// Run llama_decode() on a test prompt
try await refineTestText()
}
// Benchmark end-to-end pipeline
let e2eResult = try await bench.benchmark(
name: "end_to_end_pipeline",
iterations: 20,
warmupIterations: 2,
metadata: [
"whisper_model": "medium",
"llm_quantization": "Q4_K_M"
]
) {
try await runFullPipeline()
}
bench.printSummary()
// Export for CI comparison
let jsonData = try bench.exportJSON()
try jsonData.write(to: URL(fileURLWithPath: "/tmp/vaultype_benchmarks.json"))
}
// Helper to get hardware name
extension ProcessInfo {
var machineHardwareName: String {
var sysinfo = utsname()
uname(&sysinfo)
return withUnsafePointer(to: &sysinfo.machine) {
$0.withMemoryRebound(to: CChar.self, capacity: 1) {
String(validatingUTF8: $0) ?? "unknown"
}
}
}
}
MetricTargetAlert ThresholdMeasurement Method
Model load time (Whisper medium)< 1.5 s> 3.0 sContinuousClock around init
Model load time (LLM Q4_K_M 7B)< 3.0 s> 6.0 sContinuousClock around init
First inference (Whisper)< 500 ms> 1000 msIncludes shader compilation
Steady inference (Whisper, 5s audio)< 300 ms> 600 msAfter warm-up
LLM tokens/sec (Q4_K_M, M2)> 20 t/s< 10 t/sToken generation loop
End-to-end latency< 500 ms> 1000 msSpeech end → text output
Peak memory (dual model)< 7 GB> 10 GBmach_task_basic_info
Audio dropout rate< 0.1%> 1%Ring buffer overflow counter
Battery drain (1 hr active)< 15%> 25%IOPowerSources sampling

Integrate benchmarks into CI to catch regressions:

/// Compares benchmark results against stored baselines
struct RegressionDetector {
struct Baseline: Codable {
let name: String
let meanDurationMs: Double
let p95DurationMs: Double
let hardware: String
let date: String
}
/// Load baselines from disk
static func loadBaselines(
from url: URL
) throws -> [String: Baseline] {
let data = try Data(contentsOf: url)
let baselines = try JSONDecoder().decode([Baseline].self, from: data)
return Dictionary(uniqueKeysWithValues: baselines.map { ($0.name, $0) })
}
/// Check for regressions against baselines
static func checkForRegressions(
results: [PerformanceBenchmark.BenchmarkResult],
baselines: [String: Baseline],
regressionThreshold: Double = 0.10 // 10% slower = regression
) -> [RegressionReport] {
var reports: [RegressionReport] = []
for result in results {
guard let baseline = baselines[result.name] else {
reports.append(RegressionReport(
name: result.name,
status: .noBaseline,
currentMean: result.meanDuration,
baselineMean: nil,
changePercent: nil
))
continue
}
let changePercent = (result.meanDuration - baseline.meanDurationMs)
/ baseline.meanDurationMs
let status: RegressionStatus
if changePercent > regressionThreshold {
status = .regression
} else if changePercent < -regressionThreshold {
status = .improvement
} else {
status = .stable
}
reports.append(RegressionReport(
name: result.name,
status: status,
currentMean: result.meanDuration,
baselineMean: baseline.meanDurationMs,
changePercent: changePercent
))
}
return reports
}
enum RegressionStatus: String {
case regression
case stable
case improvement
case noBaseline
}
struct RegressionReport {
let name: String
let status: RegressionStatus
let currentMean: Double
let baselineMean: Double?
let changePercent: Double?
var description: String {
let changeStr: String
if let change = changePercent {
let sign = change >= 0 ? "+" : ""
changeStr = "\(sign)\(String(format: "%.1f", change * 100))%"
} else {
changeStr = "N/A"
}
let icon: String = switch status {
case .regression: "FAIL"
case .stable: "PASS"
case .improvement: "IMPROVED"
case .noBaseline: "NEW"
}
return "[\(icon)] \(name): \(String(format: "%.2f", currentMean)) ms (\(changeStr))"
}
}
}

End-to-End Latency (5-second utterance, speech-end to text-output):

ConfigurationM1 (8 GB)M2 (8 GB)M3 (16 GB)M4 Pro (24 GB)
tiny + Q4_0 3B320 ms250 ms210 ms150 ms
small + Q4_K_M 7B480 ms380 ms310 ms220 ms
medium + Q4_K_M 7B650 ms500 ms400 ms280 ms
large-v3 + Q4_K_M 7B1100 ms850 ms680 ms450 ms
large-v3 + Q8_0 7BN/AN/A900 ms550 ms

ℹ️ Reading the table: These represent total latency from the moment the user stops speaking to the moment refined text is ready for insertion. The pipeline runs Whisper and LLM sequentially (Whisper output feeds LLM input).

Model Loading Time (cold start, from SSD):

ModelSizeM1M2M3M4
Whisper tiny75 MB0.1 s0.08 s0.07 s0.05 s
Whisper medium1.5 GB0.8 s0.6 s0.5 s0.4 s
Whisper large-v33.1 GB1.5 s1.2 s1.0 s0.8 s
LLM Q4_K_M 7B4.1 GB2.5 s2.0 s1.7 s1.3 s
LLM Q8_0 7B7.2 GB4.0 s3.2 s2.8 s2.1 s

Peak Memory During Dual-Model Operation:

ConfigurationModel MemoryCompute OverheadTotal Peak
tiny + Q4_0 3B1.8 GB0.5 GB2.3 GB
base + Q4_K_M 7B4.3 GB0.7 GB5.0 GB
small + Q4_K_M 7B4.6 GB0.7 GB5.3 GB
medium + Q4_K_M 7B5.6 GB0.9 GB6.5 GB
large-v3 + Q4_K_M 7B7.1 GB1.1 GB8.2 GB
large-v3 + Q8_0 7B10.2 GB1.3 GB11.5 GB
large-v3 + Q8_0 13B16.5 GB1.8 GB18.3 GB

The full VaulType pipeline from microphone to text insertion:

┌─────────────────────────────────────────────────────────────────────┐
│ VaulType Processing Pipeline │
│ │
│ ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Audio │ │ Format │ │ Ring │ │ Voice Activity │ │
│ │ Capture │──►│ Convert │──►│ Buffer │──►│ Detection (VAD) │ │
│ │ 48 kHz │ │ → 16 kHz │ │ 0.5 s │ │ │ │
│ └─────────┘ └──────────┘ └──────────┘ └────────┬─────────┘ │
│ │ │
│ Speech ended? │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Whisper Inference │ │
│ │ │ │
│ │ Audio Chunk ──► Mel Spectrogram ──► Encoder ──► Decoder │ │
│ │ │ │ │
│ │ Raw transcript │ │
│ └─────────────────────────────────────────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ LLM Refinement │ │
│ │ │ │
│ │ System Prompt + Raw Transcript ──► Token Generation │ │
│ │ │ │ │
│ │ Refined text │ │
│ └─────────────────────────────────────────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Text Insertion │ │
│ │ │ │
│ │ Refined Text ──► CGEvent / Accessibility API ──► Target App│ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Timeline (medium + Q4_K_M 7B on M2):
├─ Audio capture ──────────────────────── continuous ─────────────────┤
├─ VAD detection ────────────────────────── ~50 ms ──┤
├─ Whisper inference ─────────────────────────────── ~250 ms ────────┤
├─ LLM refinement ─────────────────────────────────── ~130 ms ──────┤
├─ Text insertion ──────────────────────────────────────── ~5 ms ───┤
│ │
│ Total end-to-end: ~435 ms │

While Whisper and LLM run sequentially per utterance (LLM needs Whisper’s output), multiple optimizations enable concurrency:

  1. Streaming Whisper output: Begin LLM processing as soon as the first sentence is decoded, while Whisper continues on remaining audio
  2. Pipeline overlap: While LLM refines utterance N, Whisper can begin processing utterance N+1
  3. Parallel model prep: Load/warm-up both models concurrently during preloading
/// Orchestrates overlapped pipeline execution
actor PipelineOrchestrator {
private let whisperEngine: WhisperEngine
private let llmEngine: LLMEngine
init(whisperEngine: WhisperEngine, llmEngine: LLMEngine) {
self.whisperEngine = whisperEngine
self.llmEngine = llmEngine
}
/// Process audio with pipeline overlap
func processAudio(_ audioBuffer: [Float]) -> AsyncStream<String> {
AsyncStream { continuation in
Task {
// Start Whisper transcription
let rawSegments = await whisperEngine.transcribeStreaming(audioBuffer)
// Process each segment through LLM as it becomes available
for await segment in rawSegments {
let refined = try? await llmEngine.refine(segment)
continuation.yield(refined ?? segment)
}
continuation.finish()
}
}
}
}
// Protocol stubs for the example
protocol WhisperEngine {
func transcribeStreaming(_ audio: [Float]) async -> AsyncStream<String>
}
protocol LLMEngine {
func refine(_ text: String) async throws -> String
}

💡 Streaming Optimization: For short utterances (under 5 seconds), the overhead of streaming is not worth it — just run Whisper to completion then LLM. For longer dictation sessions (30+ seconds), streaming reduces perceived latency significantly because the user sees refined text appearing while still speaking.


DocumentRelevance
Architecture OverviewSystem component graph, dependency injection, manager lifecycle
Speech RecognitionWhisper integration details, audio pipeline, VAD configuration
LLM Processingllama.cpp integration, prompt engineering, token generation
Model ManagementModel download, storage, versioning, user-facing model picker
Tech StackFull dependency list, version requirements, build configuration

This document should be updated whenever new Apple Silicon generations are released, when whisper.cpp or llama.cpp make significant performance changes, or when new quantization formats become available.