Performance Optimization
Last Updated: 2026-02-13
VaulType runs two ML models simultaneously — whisper.cpp for speech recognition and llama.cpp for text refinement — entirely on-device with zero network dependency. This document defines every optimization strategy, tuning parameter, and monitoring technique required to deliver real-time performance across all supported Apple hardware.
Table of Contents
Section titled “Table of Contents”- Performance Philosophy
- Apple Silicon Metal Acceleration
- Model Quantization
- Memory Management for Dual-Model Operation
- Audio Buffer Optimization
- Lazy Model Loading Strategies
- Battery-Aware Performance Throttling
- Thermal Management
- Benchmarking Methodology and Tools
- Pipeline Optimization
- Related Documentation
Performance Philosophy
Section titled “Performance Philosophy”VaulType’s performance strategy rests on three pillars:
| Pillar | Meaning |
|---|---|
| Latency | The user must perceive transcription and refinement as near-instantaneous — under 500 ms total for typical utterances |
| Efficiency | Dual-model inference must coexist with normal system operation; VaulType should never make the Mac feel sluggish |
| Adaptability | Performance tuning reacts to hardware capability, power source, thermal state, and memory pressure in real time |
💡 Design Principle: VaulType always degrades gracefully. When resources are constrained, it reduces quality (smaller models, fewer GPU layers) rather than increasing latency or dropping audio.
Apple Silicon Metal Acceleration
Section titled “Apple Silicon Metal Acceleration”Apple Silicon’s unified memory architecture is the foundation of VaulType’s performance story. Both whisper.cpp and llama.cpp support Metal acceleration, which offloads matrix multiplications and attention computations to the GPU cores.
Metal Performance Shaders Overview
Section titled “Metal Performance Shaders Overview”Metal Performance Shaders (MPS) provide optimized GPU kernels for common ML operations. Both whisper.cpp and llama.cpp use Metal compute shaders for:
- Matrix multiplication (GEMM/GEMV) — the dominant operation in transformer inference
- Softmax — attention score normalization
- Layer normalization — pre/post-attention normalization
- Element-wise operations — GELU, SiLU activations
On Apple Silicon, these operations run on the GPU cores while the CPU handles orchestration, memory management, and non-parallelizable work.
┌──────────────────────────────────────────────────────────────────┐│ Apple Silicon SoC ││ ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ ││ │ CPU Cores │ │ GPU Cores │ │ Neural Engine │ ││ │ (E + P) │ │ (Metal) │ │ (ANE) │ ││ │ │ │ │ │ [Not used by │ ││ │ Orchestrate │ │ GEMM/GEMV │ │ whisper.cpp/ │ ││ │ Audio I/O │ │ Attention │ │ llama.cpp] │ ││ │ Token decode│ │ LayerNorm │ │ │ ││ └──────┬───────┘ └──────┬───────┘ └──────────────────┘ ││ │ │ ││ └─────────┬─────────┘ ││ ▼ ││ ┌──────────────────────────────────┐ ││ │ Unified Memory (LPDDR) │ ││ │ │ ││ │ Model weights, KV cache, │ ││ │ audio buffers, compute buffers │ ││ │ — all shared, zero-copy │ ││ └──────────────────────────────────┘ │└──────────────────────────────────────────────────────────────────┘🍎 Apple Silicon Advantage: Unlike discrete GPU systems, there is no PCIe bus transfer between CPU and GPU memory. Model weights loaded once are accessible to both CPU and GPU compute — zero-copy.
GPU Layer Allocation
Section titled “GPU Layer Allocation”Both whisper.cpp and llama.cpp allow specifying how many transformer layers run on the GPU versus the CPU. More GPU layers mean faster inference but higher GPU memory pressure.
whisper.cpp GPU layers:
| Whisper Model | Total Layers | Recommended GPU Layers | Notes |
|---|---|---|---|
| tiny | 4 | 4 (all) | Fits entirely on GPU for all hardware |
| base | 6 | 6 (all) | Fits entirely on GPU for all hardware |
| small | 12 | 12 (all) | Fits entirely on GPU for 16 GB+ |
| medium | 24 | 24 (all) | Fits entirely on GPU for 16 GB+ |
| large-v3 | 32 | 32 (all) | Requires 16 GB+; consider 24 on 8 GB |
llama.cpp GPU layers (for ~7B parameter LLM):
| Quantization | Total Layers | 8 GB RAM | 16 GB RAM | 24 GB+ RAM |
|---|---|---|---|---|
| Q4_0 | 32 | 20-24 | 32 (all) | 32 (all) |
| Q4_K_M | 32 | 18-22 | 32 (all) | 32 (all) |
| Q5_K_M | 32 | 14-18 | 32 (all) | 32 (all) |
| Q8_0 | 32 | 8-12 | 28-32 | 32 (all) |
| F16 | 32 | N/A | 16-20 | 32 (all) |
⚠️ Warning: These layer counts assume dual-model operation (Whisper + LLM loaded simultaneously). If only one model is active, more GPU layers can be allocated.
Optimal GPU/CPU Split
Section titled “Optimal GPU/CPU Split”The optimal split depends on available memory after accounting for system overhead and the other model. The general rule:
Available GPU Memory = Total RAM - System Overhead - Other Model - KV Cache Reserve
System Overhead: macOS base ~3-4 GB VaulType app overhead ~200 MB Audio pipeline ~50 MB
Example (16 GB M2 MacBook Air): 16 GB - 4 GB (system) - 1.5 GB (Whisper medium) - 0.2 GB (app) = ~10.3 GB for LLM A Q4_K_M 7B model needs ~4.1 GB → all 32 layers fit on GPU💡 Tip: Always leave at least 2 GB of headroom beyond calculated needs. macOS memory compression helps, but sustained pressure causes jank in the entire system.
Chip Generation Differences
Section titled “Chip Generation Differences”| Feature | M1 | M2 | M3 | M4 |
|---|---|---|---|---|
| GPU Cores (base) | 7-8 | 8-10 | 8-10 | 10 |
| GPU Cores (Pro) | 14-16 | 16-19 | 14-18 | 16-20 |
| GPU Cores (Max) | 24-32 | 30-38 | 30-40 | 32-40 |
| Memory Bandwidth | 68.25 GB/s | 100 GB/s | 100 GB/s | 120 GB/s |
| Memory Bandwidth (Pro) | 200 GB/s | 200 GB/s | 150-200 GB/s | 273 GB/s |
| Max Unified Memory | 16 GB | 24 GB | 128 GB | 64 GB |
| Max Unified Memory (Max/Ultra) | 64/128 GB | 96/192 GB | 128/192 GB | 128/256 GB |
| Metal Feature Set | Metal 3 | Metal 3 | Metal 3+ | Metal 3+ |
| Dynamic Caching | No | No | Yes | Yes |
| Mesh Shading | No | No | Yes | Yes |
| Hardware Ray Tracing | No | No | Yes | Yes |
| Relative Perf (base, 7B Q4) | 1.0x | 1.3x | 1.4x | 1.6x |
Key observations for VaulType:
- M3/M4 Dynamic Caching reduces GPU memory waste from shader register allocation, leaving more memory for model weights
- Memory bandwidth is the primary bottleneck for LLM inference (memory-bound workload). M4 Pro’s 273 GB/s provides the biggest leap
- M1 base (8 GB) is the minimum viable target — use Whisper small + Q4_0 3B LLM
ℹ️ Intel Support: VaulType supports Intel Macs but without Metal acceleration. CPU-only inference uses Accelerate.framework (BLAS). Expect 3-5x slower inference compared to equivalent-era Apple Silicon.
Unified Memory Advantages
Section titled “Unified Memory Advantages”┌─────────────────────────────────────────────────────────┐│ Traditional Discrete GPU ││ ││ ┌──────────┐ PCIe Bus ┌──────────┐ ││ │ CPU RAM │ ────────────► │ GPU VRAM │ ││ │ │ (slow copy) │ │ ││ │ Model │ │ Model │ ││ │ (copy 1) │ │ (copy 2) │ ││ └──────────┘ └──────────┘ ││ ││ Total memory used: 2x model size ││ Transfer latency: milliseconds per layer │└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐│ Apple Silicon Unified Memory ││ ││ ┌──────────────────────────────────────┐ ││ │ Unified Memory Pool │ ││ │ │ ││ │ ┌──────────┐ │ ││ │ │ Model │◄──── CPU reads here │ ││ │ │ (1 copy) │◄──── GPU reads here │ ││ │ └──────────┘ │ ││ │ │ ││ │ Zero-copy, zero-latency switching │ ││ └──────────────────────────────────────┘ ││ ││ Total memory used: 1x model size ││ Transfer latency: zero │└─────────────────────────────────────────────────────────┘This is why Apple Silicon can run larger models than discrete GPUs with the same nominal memory: no duplication, no transfer overhead.
Metal Configuration in Code
Section titled “Metal Configuration in Code”The following shows how VaulType configures GPU layer counts for both inference engines:
import Foundation
// MARK: - Metal GPU Configuration
/// Configuration for Metal GPU layer allocation across both ML models./// Determines how many transformer layers run on GPU vs CPU.struct MetalGPUConfiguration { let whisperGPULayers: Int32 let llmGPULayers: Int32 let useMetalAcceleration: Bool
/// Total system RAM in bytes static var totalSystemMemory: UInt64 { ProcessInfo.processInfo.physicalMemory }
/// Total system RAM in gigabytes static var totalSystemMemoryGB: Double { Double(totalSystemMemory) / (1024 * 1024 * 1024) }
/// Estimated available memory after system overhead (in GB) static var estimatedAvailableMemoryGB: Double { let systemOverhead: Double = 4.0 // macOS + background apps let appOverhead: Double = 0.25 // VaulType base footprint return max(totalSystemMemoryGB - systemOverhead - appOverhead, 1.0) }
/// Determines if Metal acceleration is available on this system static var isMetalAvailable: Bool { #if arch(arm64) return true #else // Intel Macs: check for Metal-capable GPU guard let device = MTLCreateSystemDefaultDevice() else { return false } return device.supportsFamily(.apple1) #endif }
/// Creates an optimal configuration for the current hardware /// - Parameters: /// - whisperModelSize: Size category of the Whisper model /// - llmQuantization: Quantization format of the LLM /// - llmParameterCount: Approximate parameter count in billions /// - Returns: Optimal Metal GPU configuration static func optimal( whisperModelSize: WhisperModelSize, llmQuantization: LLMQuantization, llmParameterCount: Double ) -> MetalGPUConfiguration { guard isMetalAvailable else { return MetalGPUConfiguration( whisperGPULayers: 0, llmGPULayers: 0, useMetalAcceleration: false ) }
let availableGB = estimatedAvailableMemoryGB let whisperMemoryGB = whisperModelSize.estimatedMemoryGB let remainingForLLM = availableGB - whisperMemoryGB - 2.0 // 2 GB headroom
// Whisper: always put all layers on GPU if memory allows let whisperLayers: Int32 = if availableGB >= whisperMemoryGB + 1.0 { whisperModelSize.totalLayers } else { Int32(Double(whisperModelSize.totalLayers) * 0.5) }
// LLM: calculate how many layers fit in remaining memory let llmTotalLayers: Int32 = 32 // typical for 7B models let memoryPerLayer = llmQuantization.estimatedMemoryPerLayerGB( parameterCount: llmParameterCount ) let maxLLMLayers = Int32(remainingForLLM / memoryPerLayer) let llmLayers = min(max(maxLLMLayers, 0), llmTotalLayers)
return MetalGPUConfiguration( whisperGPULayers: whisperLayers, llmGPULayers: llmLayers, useMetalAcceleration: true ) }}
// MARK: - Supporting Types
enum WhisperModelSize: String, CaseIterable { case tiny, base, small, medium, large
var totalLayers: Int32 { switch self { case .tiny: return 4 case .base: return 6 case .small: return 12 case .medium: return 24 case .large: return 32 } }
var estimatedMemoryGB: Double { switch self { case .tiny: return 0.08 case .base: return 0.15 case .small: return 0.50 case .medium: return 1.50 case .large: return 3.00 } }}
enum LLMQuantization: String, CaseIterable { case q4_0 = "Q4_0" case q4_k_m = "Q4_K_M" case q5_k_m = "Q5_K_M" case q8_0 = "Q8_0" case f16 = "F16"
/// Bits per parameter for this quantization format var bitsPerParameter: Double { switch self { case .q4_0: return 4.5 // 4-bit with some overhead case .q4_k_m: return 4.8 // k-quant mixed precision case .q5_k_m: return 5.5 // k-quant mixed precision case .q8_0: return 8.5 // 8-bit with scale factors case .f16: return 16.0 // half precision } }
/// Estimated memory per transformer layer in GB for a given parameter count func estimatedMemoryPerLayerGB(parameterCount: Double) -> Double { let totalBits = parameterCount * 1_000_000_000 * bitsPerParameter let totalBytes = totalBits / 8.0 let totalGB = totalBytes / (1024 * 1024 * 1024) return totalGB / 32.0 // assume 32 layers }}✅ Best Practice: Always call
MetalGPUConfiguration.optimal(...)at launch and again whenever power source or thermal state changes. Store the configuration inPerformanceManagerand propagate to both inference engines.
Model Quantization
Section titled “Model Quantization”Quantization reduces model size and inference time by representing weights with fewer bits. The tradeoff is always quality vs speed vs memory.
LLM Quantization Formats
Section titled “LLM Quantization Formats”| Format | Bits/Weight | Description | Quality Impact |
|---|---|---|---|
| F16 | 16 | Half-precision float. Baseline quality. | None (reference) |
| Q8_0 | 8.5 | 8-bit with block scaling. Near-lossless. | Negligible (<1% perplexity increase) |
| Q5_K_M | 5.5 | 5-bit k-quant, mixed precision attention layers. | Minor (1-2% perplexity increase) |
| Q4_K_M | 4.8 | 4-bit k-quant, mixed precision. Best quality/size ratio. | Moderate (2-4% perplexity increase) |
| Q4_0 | 4.5 | Basic 4-bit. Fast but lower quality. | Notable (4-6% perplexity increase) |
💡 VaulType Default: Q4_K_M is the default for LLM text refinement. For VaulType’s use case (grammar correction, punctuation, formatting), the quality difference between Q4_K_M and F16 is imperceptible in practice.
Whisper Model Sizes
Section titled “Whisper Model Sizes”| Model | Parameters | Disk Size | Memory (loaded) | English WER | Multilingual WER |
|---|---|---|---|---|---|
| tiny | 39 M | 75 MB | ~80 MB | 7.7% | 12.0% |
| base | 74 M | 142 MB | ~150 MB | 5.8% | 9.8% |
| small | 244 M | 466 MB | ~500 MB | 4.2% | 7.6% |
| medium | 769 M | 1.5 GB | ~1.5 GB | 3.5% | 6.5% |
| large-v3 | 1550 M | 3.1 GB | ~3.0 GB | 2.9% | 5.2% |
ℹ️ WER = Word Error Rate: Lower is better. These are approximate values on LibriSpeech/Common Voice benchmarks. Real-world performance varies with accent, background noise, and microphone quality.
Quality vs Speed vs Memory Tradeoffs
Section titled “Quality vs Speed vs Memory Tradeoffs”7B LLM — Text Refinement Speed (tokens/sec on Apple Silicon base chips):
| Quantization | M1 (8 GB) | M2 (8 GB) | M3 (8 GB) | M4 (16 GB) | Memory |
|---|---|---|---|---|---|
| Q4_0 | 18 t/s | 24 t/s | 26 t/s | 32 t/s | 3.8 GB |
| Q4_K_M | 16 t/s | 22 t/s | 24 t/s | 30 t/s | 4.1 GB |
| Q5_K_M | 13 t/s | 18 t/s | 20 t/s | 26 t/s | 4.8 GB |
| Q8_0 | 8 t/s | 12 t/s | 14 t/s | 20 t/s | 7.2 GB |
| F16 | N/A | N/A | N/A | 12 t/s | 13.5 GB |
Whisper — Real-Time Factor (RTF, lower is faster):
| Model | M1 | M2 | M3 | M4 | Memory |
|---|---|---|---|---|---|
| tiny | 0.05 | 0.04 | 0.03 | 0.02 | 80 MB |
| base | 0.08 | 0.06 | 0.05 | 0.04 | 150 MB |
| small | 0.15 | 0.11 | 0.09 | 0.07 | 500 MB |
| medium | 0.35 | 0.25 | 0.20 | 0.15 | 1.5 GB |
| large-v3 | 0.70 | 0.50 | 0.40 | 0.30 | 3.0 GB |
ℹ️ Real-Time Factor (RTF): An RTF of 0.25 means 1 second of audio is processed in 0.25 seconds. Values below 1.0 are faster than real time.
Quantization Selection Strategy
Section titled “Quantization Selection Strategy”/// Selects the best quantization level based on available memory and power statestruct QuantizationSelector { static func recommendedLLMQuantization( availableMemoryGB: Double, isOnBattery: Bool, thermalState: ProcessInfo.ThermalState ) -> LLMQuantization { // Under thermal pressure or battery: prefer smaller models if thermalState == .critical || thermalState == .serious { return .q4_0 }
if isOnBattery { // On battery, favor speed over quality if availableMemoryGB >= 5.0 { return .q4_k_m } else { return .q4_0 } }
// On power: use highest quality that fits switch availableMemoryGB { case 14.0...: return .f16 case 8.0...: return .q8_0 case 5.5...: return .q5_k_m case 4.5...: return .q4_k_m default: return .q4_0 } }
static func recommendedWhisperModel( availableMemoryGB: Double, isOnBattery: Bool, thermalState: ProcessInfo.ThermalState ) -> WhisperModelSize { if thermalState == .critical { return .tiny } if thermalState == .serious { return .base }
if isOnBattery { if availableMemoryGB >= 2.0 { return .small } else { return .base } }
// On power switch availableMemoryGB { case 5.0...: return .large case 3.0...: return .medium case 1.0...: return .small case 0.5...: return .base default: return .tiny } }}Memory Management for Dual-Model Operation
Section titled “Memory Management for Dual-Model Operation”Running Whisper and an LLM simultaneously is VaulType’s most demanding resource requirement. This section covers how to manage memory for both models.
Memory Layout Architecture
Section titled “Memory Layout Architecture”┌──────────────────────────────────────────────────────────────┐│ System Unified Memory (16 GB example) ││ ││ ┌──────────────────────────────────────────────────────┐ ││ │ macOS + System Services ~3.5 GB │ ││ ├──────────────────────────────────────────────────────┤ ││ │ VaulType App │ ││ │ ┌─────────────────────────────────────────────┐ │ ││ │ │ App Binary + Runtime ~50 MB │ │ ││ │ ├─────────────────────────────────────────────┤ │ ││ │ │ SwiftUI View Hierarchy ~30 MB │ │ ││ │ ├─────────────────────────────────────────────┤ │ ││ │ │ Audio Pipeline (buffers) ~20 MB │ │ ││ │ ├─────────────────────────────────────────────┤ │ ││ │ │ Whisper Model (medium) ~1.5 GB │ │ ││ │ │ ├── Encoder weights │ │ ││ │ │ ├── Decoder weights │ │ ││ │ │ └── Mel filterbank + vocab │ │ ││ │ ├─────────────────────────────────────────────┤ │ ││ │ │ Whisper Compute Buffers ~200 MB │ │ ││ │ │ ├── Mel spectrogram │ │ ││ │ │ ├── Encoder output │ │ ││ │ │ └── Decoder KV cache │ │ ││ │ ├─────────────────────────────────────────────┤ │ ││ │ │ LLM Model (Q4_K_M 7B) ~4.1 GB │ │ ││ │ │ ├── Embedding layer │ │ ││ │ │ ├── 32 transformer layers │ │ ││ │ │ └── Output head │ │ ││ │ ├─────────────────────────────────────────────┤ │ ││ │ │ LLM KV Cache ~500 MB │ │ ││ │ │ (context-dependent) │ │ ││ │ ├─────────────────────────────────────────────┤ │ ││ │ │ Metal Compute Buffers ~200 MB │ │ ││ │ └─────────────────────────────────────────────┘ │ ││ │ Total VaulType: ~6.6 GB │ ││ ├──────────────────────────────────────────────────────┤ ││ │ Free / Available ~5.9 GB │ ││ └──────────────────────────────────────────────────────┘ │└──────────────────────────────────────────────────────────────┘Peak Memory Calculations
Section titled “Peak Memory Calculations”Calculate peak memory for any hardware/model combination:
/// Calculates peak memory usage for a given model configurationstruct MemoryCalculator { struct MemoryBudget { let whisperModelMB: Double let whisperComputeMB: Double let llmModelMB: Double let llmKVCacheMB: Double let metalBuffersMB: Double let appOverheadMB: Double
var totalMB: Double { whisperModelMB + whisperComputeMB + llmModelMB + llmKVCacheMB + metalBuffersMB + appOverheadMB }
var totalGB: Double { totalMB / 1024.0 }
/// Whether this configuration fits in the given memory with headroom func fits(inGB availableGB: Double, headroomGB: Double = 2.0) -> Bool { totalGB + headroomGB <= availableGB } }
static func calculateBudget( whisperModel: WhisperModelSize, llmQuantization: LLMQuantization, llmParameterBillions: Double, contextLength: Int = 2048 ) -> MemoryBudget { let whisperModelMB = whisperModel.estimatedMemoryGB * 1024
// Whisper compute buffers: mel spectrogram + encoder output + KV cache let whisperComputeMB: Double = switch whisperModel { case .tiny: 50.0 case .base: 80.0 case .small: 120.0 case .medium: 200.0 case .large: 350.0 }
// LLM model size let llmBits = llmParameterBillions * 1e9 * llmQuantization.bitsPerParameter let llmModelMB = llmBits / 8.0 / (1024 * 1024)
// LLM KV cache: 2 (K+V) * layers * heads * head_dim * context * 2 bytes(fp16) // Simplified: roughly 1 MB per 100 context tokens per billion parameters at fp16 let llmKVCacheMB = llmParameterBillions * Double(contextLength) / 100.0
// Metal compute buffers (scratch space for GPU operations) let metalBuffersMB: Double = 200.0
let appOverheadMB: Double = 100.0
return MemoryBudget( whisperModelMB: whisperModelMB, whisperComputeMB: whisperComputeMB, llmModelMB: llmModelMB, llmKVCacheMB: llmKVCacheMB, metalBuffersMB: metalBuffersMB, appOverheadMB: appOverheadMB ) }}Reference budgets for common configurations:
| Configuration | Whisper | LLM | Peak Memory | Min RAM |
|---|---|---|---|---|
| Minimal | tiny (80 MB) | Q4_0 3B (1.7 GB) | ~2.3 GB | 8 GB |
| Balanced | small (500 MB) | Q4_K_M 7B (4.1 GB) | ~5.2 GB | 8 GB |
| Quality | medium (1.5 GB) | Q4_K_M 7B (4.1 GB) | ~6.6 GB | 16 GB |
| Maximum | large-v3 (3.0 GB) | Q5_K_M 7B (4.8 GB) | ~8.8 GB | 16 GB |
| Ultra | large-v3 (3.0 GB) | Q8_0 13B (13.5 GB) | ~17.5 GB | 32 GB |
Memory Mapping Strategies
Section titled “Memory Mapping Strategies”Both whisper.cpp and llama.cpp support mmap for loading model files. This is critical for VaulType:
/// Memory mapping strategy for model loadingenum ModelMappingStrategy { /// mmap the model file. Pages are loaded on demand by the kernel. /// Pro: Fast initial load, memory is shared with page cache /// Con: First inference may stall as pages fault in case memoryMapped
/// Read entire model into allocated memory. /// Pro: Predictable performance after load completes /// Con: Slower initial load, higher peak memory case fullyLoaded
/// mmap + mlock to prevent paging /// Pro: Fast load + guaranteed in-memory after first pass /// Con: Requires elevated memory; system may refuse mlock case memoryMappedLocked
var llmLoadFlag: Bool { switch self { case .memoryMapped: return true // use_mmap = true case .fullyLoaded: return false // use_mmap = false case .memoryMappedLocked: return true // use_mmap = true, use_mlock = true } }
/// Recommended strategy for current conditions static func recommended( availableMemoryGB: Double, modelSizeGB: Double, isFirstLaunch: Bool ) -> ModelMappingStrategy { // If plenty of memory, use mmap + mlock for best performance if availableMemoryGB > modelSizeGB * 2.5 { return .memoryMappedLocked }
// If memory is adequate, plain mmap is fine if availableMemoryGB > modelSizeGB * 1.5 { return .memoryMapped }
// Tight on memory: fully loaded gives more predictable behavior // (system can reclaim mmap pages under pressure, causing stalls) return .fullyLoaded }}⚠️ mmap Pitfall: Under memory pressure, macOS can evict mmap’d pages. The next access then triggers a page fault and disk read, causing inference stalls. For real-time transcription, prefer
fullyLoadedormemoryMappedLockedwhen memory allows.
Model Swapping
Section titled “Model Swapping”When memory is insufficient for both models simultaneously, VaulType can swap models:
/// Manages loading and unloading of models to fit within memory constraintsactor ModelSwapManager { enum ActiveModel { case whisperOnly case llmOnly case both case none }
private(set) var activeState: ActiveModel = .none private var whisperContext: OpaquePointer? // whisper_context* private var llamaModel: OpaquePointer? // llama_model*
/// Transition to a new active model state, unloading as needed func transition(to target: ActiveModel) async throws { guard target != activeState else { return }
switch (activeState, target) { case (_, .none): unloadWhisper() unloadLLM()
case (.none, .whisperOnly), (.llmOnly, .whisperOnly): unloadLLM() try await loadWhisper()
case (.none, .llmOnly), (.whisperOnly, .llmOnly): unloadWhisper() try await loadLLM()
case (_, .both): if whisperContext == nil { try await loadWhisper() } if llamaModel == nil { try await loadLLM() }
default: break }
activeState = target }
private func loadWhisper() async throws { // Implementation calls whisper_init_from_file_with_params() // with Metal-enabled parameters }
private func loadLLM() async throws { // Implementation calls llama_load_model_from_file() // with Metal GPU layer configuration }
private func unloadWhisper() { guard let ctx = whisperContext else { return } // whisper_free(ctx) whisperContext = nil }
private func unloadLLM() { guard let model = llamaModel else { return } // llama_free_model(model) llamaModel = nil }}When to Unload Models
Section titled “When to Unload Models”| Condition | Action | Rationale |
|---|---|---|
Memory pressure warning (.warning) | Unload LLM if idle > 30s | LLM is larger and less immediately needed |
Memory pressure critical (.critical) | Unload both models | System stability takes priority |
| App enters background | Unload LLM after 60s | Background apps should minimize footprint |
| No transcription for 5 min | Unload Whisper | Can reload in ~1-2 seconds when needed |
| No LLM use for 5 min | Unload LLM | Can reload in ~2-4 seconds when needed |
Thermal state .critical | Unload LLM | Reduce thermal generation |
| Battery below 10% | Unload LLM, use Whisper only | Preserve remaining battery |
Memory Pressure Monitoring
Section titled “Memory Pressure Monitoring”import Foundation
/// Monitors system memory pressure and triggers adaptive responsesfinal class MemoryPressureMonitor: @unchecked Sendable { static let shared = MemoryPressureMonitor()
enum PressureLevel: Int, Comparable { case nominal = 0 case warning = 1 case critical = 2
static func < (lhs: PressureLevel, rhs: PressureLevel) -> Bool { lhs.rawValue < rhs.rawValue } }
private let source: DispatchSourceMemoryPressure private var currentLevel: PressureLevel = .nominal private var observers: [(PressureLevel) -> Void] = []
private init() { source = DispatchSource.makeMemoryPressureSource( eventMask: [.warning, .critical, .normal], queue: .global(qos: .utility) )
source.setEventHandler { [weak self] in guard let self else { return } let event = self.source.data
let newLevel: PressureLevel if event.contains(.critical) { newLevel = .critical } else if event.contains(.warning) { newLevel = .warning } else { newLevel = .nominal }
if newLevel != self.currentLevel { self.currentLevel = newLevel self.notifyObservers(newLevel) } }
source.resume() }
deinit { source.cancel() }
/// Register a callback for memory pressure changes func observe(_ handler: @escaping (PressureLevel) -> Void) { observers.append(handler) }
/// Current memory usage of this process in bytes static var currentProcessMemory: UInt64 { var info = mach_task_basic_info() var count = mach_msg_type_number_t( MemoryLayout<mach_task_basic_info>.size / MemoryLayout<natural_t>.size ) let result = withUnsafeMutablePointer(to: &info) { infoPtr in infoPtr.withMemoryRebound( to: integer_t.self, capacity: Int(count) ) { rawPtr in task_info( mach_task_self_, task_flavor_t(MACH_TASK_BASIC_INFO), rawPtr, &count ) } } return result == KERN_SUCCESS ? info.resident_size : 0 }
/// Current process memory in megabytes static var currentProcessMemoryMB: Double { Double(currentProcessMemory) / (1024 * 1024) }
/// Available system memory in bytes (approximate) static var availableSystemMemory: UInt64 { var stats = vm_statistics64() var count = mach_msg_type_number_t( MemoryLayout<vm_statistics64>.size / MemoryLayout<integer_t>.size ) let result = withUnsafeMutablePointer(to: &stats) { statsPtr in statsPtr.withMemoryRebound( to: integer_t.self, capacity: Int(count) ) { rawPtr in host_statistics64( mach_host_self(), HOST_VM_INFO64, rawPtr, &count ) } }
guard result == KERN_SUCCESS else { return 0 }
let pageSize = UInt64(vm_kernel_page_size) let free = UInt64(stats.free_count) * pageSize let inactive = UInt64(stats.inactive_count) * pageSize // macOS "available" ≈ free + inactive (purgeable/compressor pages can be reclaimed) return free + inactive }
private func notifyObservers(_ level: PressureLevel) { for observer in observers { observer(level) } }}✅ Integration: The
PerformanceManagersubscribes toMemoryPressureMonitorand callsModelSwapManager.transition(to:)when pressure levels change. See Architecture for the full dependency graph.
Audio Buffer Optimization
Section titled “Audio Buffer Optimization”The audio pipeline is the first stage of VaulType’s transcription flow. Buffer management here directly affects both latency and reliability.
Buffer Size Selection
Section titled “Buffer Size Selection”| Buffer Size (frames) | Duration @ 16 kHz | Latency | CPU Overhead | Use Case |
|---|---|---|---|---|
| 256 | 16 ms | Very low | Very high | Not recommended |
| 512 | 32 ms | Low | High | Real-time monitoring |
| 1024 | 64 ms | Medium | Medium | Default for VaulType |
| 2048 | 128 ms | Higher | Low | Battery-saving mode |
| 4096 | 256 ms | High | Very low | Background processing |
VaulType uses 1024 frames as the default buffer size. This provides 64 ms latency at 16 kHz — fast enough that users perceive no delay between speaking and seeing the waveform indicator, while keeping CPU wake-ups reasonable.
/// Audio buffer size configurationenum AudioBufferConfig { case lowLatency // 512 frames case balanced // 1024 frames (default) case batterySaving // 2048 frames
var frameCount: AVAudioFrameCount { switch self { case .lowLatency: return 512 case .balanced: return 1024 case .batterySaving: return 2048 } }
/// Recommended config based on current power and thermal state static func recommended( isOnBattery: Bool, thermalState: ProcessInfo.ThermalState ) -> AudioBufferConfig { if thermalState >= .serious || isOnBattery { return .batterySaving } return .balanced }}Sample Rate Conversion
Section titled “Sample Rate Conversion”VaulType captures audio at the system’s native sample rate (typically 48 kHz) and converts to 16 kHz for Whisper. The conversion pipeline:
┌───────────────┐ ┌──────────────────┐ ┌─────────────────┐│ AVAudioEngine │ │ Format Converter │ │ Ring Buffer ││ Input Node │────►│ 48 kHz → 16 kHz │────►│ (16 kHz mono) ││ (48 kHz) │ │ Stereo → Mono │ │ ││ │ │ Float32 │ │ Whisper reads │└───────────────┘ └──────────────────┘ │ from here │ └─────────────────┘import AVFoundation
/// Configures the audio format conversion pipelinestruct AudioFormatConverter { /// Target format for Whisper: 16 kHz, mono, Float32 static let whisperFormat = AVAudioFormat( commonFormat: .pcmFormatFloat32, sampleRate: 16000.0, channels: 1, interleaved: false )!
/// Creates an AVAudioConverter for the given input format static func makeConverter( from inputFormat: AVAudioFormat ) -> AVAudioConverter? { guard let converter = AVAudioConverter( from: inputFormat, to: whisperFormat ) else { return nil }
// Use highest quality SRC for accuracy // SampleRateConverterComplexity: // .linear — fastest, lowest quality // .normal — balanced (VaulType default on battery) // .mastering — highest quality (VaulType default on power) converter.sampleRateConverterQuality = .max
return converter }}💡 Performance Note: Sample rate conversion quality has minimal impact on transcription accuracy for speech.
.normalquality is sufficient and uses significantly less CPU than.max. Reserve.maxfor when connected to power.
Ring Buffer Implementation
Section titled “Ring Buffer Implementation”A lock-free ring buffer connects the audio capture thread (real-time priority) to the Whisper inference thread. This avoids priority inversion from locks.
import Atomics
/// Lock-free single-producer, single-consumer ring buffer for audio samplesfinal class AudioRingBuffer: @unchecked Sendable { private let buffer: UnsafeMutableBufferPointer<Float> private let capacity: Int private let writeIndex = ManagedAtomic<Int>(0) private let readIndex = ManagedAtomic<Int>(0)
/// Creates a ring buffer with the given capacity in samples /// - Parameter capacity: Number of Float samples. Should be power of 2. init(capacity: Int) { precondition(capacity > 0 && capacity & (capacity - 1) == 0, "Capacity must be a power of 2") self.capacity = capacity let ptr = UnsafeMutablePointer<Float>.allocate(capacity: capacity) ptr.initialize(repeating: 0.0, count: capacity) self.buffer = UnsafeMutableBufferPointer(start: ptr, count: capacity) }
deinit { buffer.baseAddress?.deinitialize(count: capacity) buffer.baseAddress?.deallocate() }
/// Number of samples available to read var availableSamples: Int { let write = writeIndex.load(ordering: .acquiring) let read = readIndex.load(ordering: .acquiring) return (write - read + capacity) & (capacity - 1) }
/// Number of free slots available for writing var freeSlots: Int { capacity - 1 - availableSamples }
/// Write samples from the audio capture callback (real-time safe) /// - Returns: Number of samples actually written @discardableResult func write(_ samples: UnsafeBufferPointer<Float>) -> Int { let count = min(samples.count, freeSlots) guard count > 0 else { return 0 }
let writePos = writeIndex.load(ordering: .relaxed)
for i in 0..<count { buffer[(writePos + i) & (capacity - 1)] = samples[i] }
writeIndex.store( (writePos + count) & (capacity - 1), ordering: .releasing ) return count }
/// Read samples for Whisper processing (non-real-time thread) /// - Parameter into: Destination buffer /// - Returns: Number of samples actually read @discardableResult func read(into destination: UnsafeMutableBufferPointer<Float>) -> Int { let count = min(destination.count, availableSamples) guard count > 0 else { return 0 }
let readPos = readIndex.load(ordering: .relaxed)
for i in 0..<count { destination[i] = buffer[(readPos + i) & (capacity - 1)] }
readIndex.store( (readPos + count) & (capacity - 1), ordering: .releasing ) return count }
/// Peek at samples without advancing the read pointer func peek(count: Int) -> [Float] { let available = min(count, availableSamples) guard available > 0 else { return [] }
let readPos = readIndex.load(ordering: .acquiring) var result = [Float](repeating: 0, count: available)
for i in 0..<available { result[i] = buffer[(readPos + i) & (capacity - 1)] } return result }
/// Discard all buffered samples func reset() { readIndex.store( writeIndex.load(ordering: .acquiring), ordering: .releasing ) }}⚠️ Real-Time Safety: The
writemethod is called from the audio render callback, which runs on a real-time thread. It must never allocate memory, acquire locks, or call Objective-C methods. The implementation above uses only atomic operations and direct pointer access.
Latency vs Reliability Tradeoffs
Section titled “Latency vs Reliability Tradeoffs” Low Latency ◄────────────────────────► High Reliability ┌─────┬─────┬──────┬──────┬──────┐Buffer Size: │ 256 │ 512 │ 1024 │ 2048 │ 4096 │ └──┬──┴──┬──┴──┬───┴──┬───┴──┬───┘ │ │ │ │ │ Latency (ms): 16 32 64 128 256 Drop risk: High Med Low V.Low None CPU wakeups: 62/s 31/s 16/s 8/s 4/s Battery: Poor Fair Good Great BestVaulType’s default of 1024 frames provides the best balance. The ring buffer’s capacity should be at least 8x the buffer size (8192 samples = 0.5 seconds) to absorb processing jitter.
Lazy Model Loading Strategies
Section titled “Lazy Model Loading Strategies”Loading ML models is expensive. A 4 GB LLM takes 2-4 seconds to load from SSD. VaulType uses lazy loading to keep launch time under 1 second.
Load on First Use
Section titled “Load on First Use”The simplest strategy: do not load models at launch. Wait until the user triggers their first transcription.
/// Lazy-loading wrapper for ML model contextsactor LazyModelLoader<Context> { enum State { case unloaded case loading(Task<Context, Error>) case loaded(Context) case failed(Error) }
private var state: State = .unloaded private let loadFunction: () async throws -> Context private let unloadFunction: (Context) -> Void
init( load: @escaping () async throws -> Context, unload: @escaping (Context) -> Void ) { self.loadFunction = load self.unloadFunction = unload }
/// Get the model context, loading if necessary func get() async throws -> Context { switch state { case .loaded(let context): return context
case .loading(let task): return try await task.value
case .unloaded, .failed: let task = Task { try await loadFunction() } state = .loading(task) do { let context = try await task.value state = .loaded(context) return context } catch { state = .failed(error) throw error } } }
/// Unload the model to free memory func unload() { if case .loaded(let context) = state { unloadFunction(context) } state = .unloaded }
/// Whether the model is currently loaded and ready var isLoaded: Bool { if case .loaded = state { return true } return false }}Background Preloading
Section titled “Background Preloading”After launch, preload models in the background if conditions allow (on power, not thermal throttled, sufficient memory):
/// Preloads models in the background when conditions are favorablefinal class BackgroundPreloader { private var preloadTask: Task<Void, Never>?
func startPreloadingIfAppropriate( whisperLoader: LazyModelLoader<OpaquePointer>, llmLoader: LazyModelLoader<OpaquePointer>, performanceState: PerformanceState ) { // Don't preload if conditions are unfavorable guard !performanceState.isOnBattery, performanceState.thermalState <= .fair, performanceState.memoryPressure == .nominal else { return }
preloadTask = Task(priority: .background) { // Small delay so launch UI is fully responsive first try? await Task.sleep(for: .seconds(2))
guard !Task.isCancelled else { return }
// Load Whisper first (smaller, more immediately needed) _ = try? await whisperLoader.get()
guard !Task.isCancelled else { return }
// Then LLM _ = try? await llmLoader.get() } }
func cancelPreloading() { preloadTask?.cancel() preloadTask = nil }}Warm-Up Inference
Section titled “Warm-Up Inference”After loading, run a single dummy inference to warm up GPU shaders and fill caches:
/// Runs warm-up inference passes to prime GPU shader cachesstruct ModelWarmup { /// Warm up Whisper with a short silence buffer static func warmUpWhisper(context: OpaquePointer) async { // Create 1 second of silence at 16 kHz let silenceBuffer = [Float](repeating: 0.0, count: 16000)
await Task.detached(priority: .background) { // whisper_full() with the silence buffer // This compiles Metal shaders on first run and caches them silenceBuffer.withUnsafeBufferPointer { ptr in // whisper_full(context, params, ptr.baseAddress, Int32(ptr.count)) _ = ptr // Placeholder: actual whisper_full call } }.value }
/// Warm up LLM with a minimal prompt static func warmUpLLM(model: OpaquePointer) async { await Task.detached(priority: .background) { // Run a single-token generation with a minimal prompt // This warms up: // 1. Metal compute pipeline state objects // 2. GPU shader compilation cache // 3. Memory allocation pools // Actual llama_decode call with a single "hello" token }.value }}ℹ️ Shader Compilation: Metal shaders are compiled just-in-time on first use. The first inference pass is typically 2-5x slower than subsequent passes. Warm-up inference ensures the user never sees this penalty.
Model Priority Queue
Section titled “Model Priority Queue”When multiple models need loading, prioritize based on likely user action:
/// Priority queue for model loading requestsactor ModelPriorityQueue { struct LoadRequest: Comparable { let modelId: String let priority: Priority let loadAction: () async throws -> Void
enum Priority: Int, Comparable { case critical = 0 // User is waiting right now case high = 1 // User likely to need soon case background = 2 // Speculative preload
static func < (lhs: Priority, rhs: Priority) -> Bool { lhs.rawValue < rhs.rawValue } }
static func < (lhs: LoadRequest, rhs: LoadRequest) -> Bool { lhs.priority < rhs.priority } }
private var queue: [LoadRequest] = [] private var isProcessing = false
func enqueue(_ request: LoadRequest) { queue.append(request) queue.sort() processNext() }
private func processNext() { guard !isProcessing, let request = queue.first else { return } queue.removeFirst() isProcessing = true
Task { do { try await request.loadAction() } catch { // Log error, model will be loaded on next attempt } isProcessing = false processNext() } }}Model Preloading Service
Section titled “Model Preloading Service”The complete preloading service that ties all loading strategies together:
import Combineimport Foundation
/// Orchestrates model lifecycle: lazy loading, preloading, warm-up, and unloading@MainActorfinal class ModelPreloadingService: ObservableObject { // MARK: - Published State
@Published private(set) var whisperState: ModelState = .unloaded @Published private(set) var llmState: ModelState = .unloaded
enum ModelState: Equatable { case unloaded case loading(progress: Double) case loaded case warmingUp case ready case error(String)
static func == (lhs: ModelState, rhs: ModelState) -> Bool { switch (lhs, rhs) { case (.unloaded, .unloaded), (.loaded, .loaded), (.warmingUp, .warmingUp), (.ready, .ready): return true case (.loading(let a), .loading(let b)): return a == b case (.error(let a), .error(let b)): return a == b default: return false } } }
// MARK: - Dependencies
private let memoryMonitor = MemoryPressureMonitor.shared private let priorityQueue = ModelPriorityQueue() private var cancellables = Set<AnyCancellable>() private var idleTimers: [String: Task<Void, Never>] = [:]
// Configuration private let whisperIdleTimeout: Duration = .seconds(300) // 5 minutes private let llmIdleTimeout: Duration = .seconds(300) // 5 minutes
// MARK: - Initialization
init() { setupMemoryPressureHandling() setupThermalStateHandling() }
// MARK: - Public API
/// Ensure Whisper is loaded and ready for transcription func ensureWhisperReady() async throws { resetIdleTimer(for: "whisper")
guard whisperState != .ready else { return }
whisperState = .loading(progress: 0.0)
// Load model from disk // In actual implementation: whisper_init_from_file_with_params() whisperState = .loading(progress: 0.5)
// Simulated progress for Metal shader compilation whisperState = .loaded
// Warm up whisperState = .warmingUp // await ModelWarmup.warmUpWhisper(context: whisperContext)
whisperState = .ready startIdleTimer(for: "whisper", timeout: whisperIdleTimeout) }
/// Ensure LLM is loaded and ready for text refinement func ensureLLMReady() async throws { resetIdleTimer(for: "llm")
guard llmState != .ready else { return }
llmState = .loading(progress: 0.0)
// Load model llmState = .loading(progress: 0.5) llmState = .loaded
// Warm up llmState = .warmingUp // await ModelWarmup.warmUpLLM(model: llamaModel)
llmState = .ready startIdleTimer(for: "llm", timeout: llmIdleTimeout) }
/// Preload both models in the background if conditions are favorable func preloadIfFavorable( isOnBattery: Bool, thermalState: ProcessInfo.ThermalState ) { guard !isOnBattery, thermalState <= .fair else { return }
Task(priority: .background) { try? await Task.sleep(for: .seconds(2)) try? await ensureWhisperReady() try? await Task.sleep(for: .seconds(1)) try? await ensureLLMReady() } }
/// Unload a specific model func unload(_ model: String) { switch model { case "whisper": // whisper_free(context) whisperState = .unloaded case "llm": // llama_free_model(model) llmState = .unloaded default: break } idleTimers[model]?.cancel() idleTimers[model] = nil }
/// Unload all models func unloadAll() { unload("whisper") unload("llm") }
// MARK: - Private
private func setupMemoryPressureHandling() { memoryMonitor.observe { [weak self] level in Task { @MainActor in guard let self else { return } switch level { case .warning: // Unload LLM if it's idle (Whisper is more critical) if self.llmState == .ready { self.unload("llm") } case .critical: self.unloadAll() case .nominal: break } } } }
private func setupThermalStateHandling() { NotificationCenter.default .publisher(for: ProcessInfo.thermalStateDidChangeNotification) .receive(on: DispatchQueue.main) .sink { [weak self] _ in guard let self else { return } let state = ProcessInfo.processInfo.thermalState if state == .critical { self.unloadAll() } else if state == .serious { self.unload("llm") } } .store(in: &cancellables) }
private func startIdleTimer(for model: String, timeout: Duration) { idleTimers[model]?.cancel() idleTimers[model] = Task { try? await Task.sleep(for: timeout) guard !Task.isCancelled else { return } await MainActor.run { self.unload(model) } } }
private func resetIdleTimer(for model: String) { idleTimers[model]?.cancel() idleTimers[model] = nil }}✅ Architecture Note:
ModelPreloadingServiceis owned by thePerformanceManagerand exposed to SwiftUI views via@EnvironmentObject. See Architecture for the full object graph.
Battery-Aware Performance Throttling
Section titled “Battery-Aware Performance Throttling”VaulType adapts its performance profile based on whether the Mac is connected to power or running on battery.
Power Source Detection
Section titled “Power Source Detection”import IOKit.ps
/// Monitors power source changes and provides current battery statefinal class PowerSourceMonitor: @unchecked Sendable { static let shared = PowerSourceMonitor()
struct PowerState: Equatable { let isOnBattery: Bool let batteryLevel: Double? // 0.0 - 1.0, nil if no battery let isCharging: Bool let timeRemaining: TimeInterval? // seconds, nil if unknown
/// Whether we should use battery-saving mode var shouldThrottle: Bool { isOnBattery && (batteryLevel ?? 1.0) < 0.5 }
/// Whether we should use minimal mode (critical battery) var isCritical: Bool { isOnBattery && (batteryLevel ?? 1.0) < 0.1 } }
private var observers: [(PowerState) -> Void] = [] private var runLoopSource: CFRunLoopSource?
private(set) var currentState: PowerState
private init() { self.currentState = PowerSourceMonitor.readCurrentState() setupMonitoring() }
/// Read current power source info from IOKit private static func readCurrentState() -> PowerState { guard let snapshot = IOPSCopyPowerSourcesInfo()?.takeRetainedValue(), let sources = IOPSCopyPowerSourcesList(snapshot)?.takeRetainedValue() as? [CFTypeRef] else { // Desktop Mac with no battery return PowerState( isOnBattery: false, batteryLevel: nil, isCharging: false, timeRemaining: nil ) }
for source in sources { guard let info = IOPSGetPowerSourceDescription(snapshot, source)? .takeUnretainedValue() as? [String: Any] else { continue }
let powerSource = info[kIOPSPowerSourceStateKey as String] as? String let isOnBattery = powerSource == kIOPSBatteryPowerValue let currentCapacity = info[kIOPSCurrentCapacityKey as String] as? Int ?? 0 let maxCapacity = info[kIOPSMaxCapacityKey as String] as? Int ?? 100 let isCharging = info[kIOPSIsChargingKey as String] as? Bool ?? false let timeRemaining = info[kIOPSTimeToEmptyKey as String] as? Int
return PowerState( isOnBattery: isOnBattery, batteryLevel: Double(currentCapacity) / Double(maxCapacity), isCharging: isCharging, timeRemaining: timeRemaining.map { TimeInterval($0 * 60) } ) }
return PowerState( isOnBattery: false, batteryLevel: nil, isCharging: false, timeRemaining: nil ) }
private func setupMonitoring() { let context = UnsafeMutableRawPointer( Unmanaged.passUnretained(self).toOpaque() )
runLoopSource = IOPSNotificationCreateRunLoopSource( { context in guard let context else { return } let monitor = Unmanaged<PowerSourceMonitor> .fromOpaque(context) .takeUnretainedValue() let newState = PowerSourceMonitor.readCurrentState() if newState != monitor.currentState { monitor.currentState = newState for observer in monitor.observers { observer(newState) } } }, context )?.takeRetainedValue()
if let source = runLoopSource { CFRunLoopAddSource(CFRunLoopGetMain(), source, .defaultMode) } }
func observe(_ handler: @escaping (PowerState) -> Void) { observers.append(handler) }}Battery-Aware Model Selection
Section titled “Battery-Aware Model Selection”/// Selects optimal model configuration based on power statestruct BatteryAwareModelSelector {
struct ModelSelection { let whisperModel: WhisperModelSize let llmQuantization: LLMQuantization let llmParameterBillions: Double let gpuLayerReduction: Int32 // reduce GPU layers by this amount let preloadingEnabled: Bool let audioBufferConfig: AudioBufferConfig }
/// Select the best model configuration for current conditions static func select( powerState: PowerSourceMonitor.PowerState, thermalState: ProcessInfo.ThermalState, availableMemoryGB: Double, preferredWhisperModel: WhisperModelSize, preferredLLMQuantization: LLMQuantization, preferredLLMParamBillions: Double ) -> ModelSelection {
// On power, no thermal issues: use preferred configuration if !powerState.isOnBattery && thermalState <= .fair { return ModelSelection( whisperModel: preferredWhisperModel, llmQuantization: preferredLLMQuantization, llmParameterBillions: preferredLLMParamBillions, gpuLayerReduction: 0, preloadingEnabled: true, audioBufferConfig: .balanced ) }
// Critical battery: minimal configuration if powerState.isCritical { return ModelSelection( whisperModel: .tiny, llmQuantization: .q4_0, llmParameterBillions: min(preferredLLMParamBillions, 3.0), gpuLayerReduction: 16, preloadingEnabled: false, audioBufferConfig: .batterySaving ) }
// On battery with < 50%: reduced configuration if powerState.shouldThrottle { let whisper: WhisperModelSize = switch preferredWhisperModel { case .large: .medium case .medium: .small default: preferredWhisperModel }
return ModelSelection( whisperModel: whisper, llmQuantization: .q4_0, llmParameterBillions: preferredLLMParamBillions, gpuLayerReduction: 8, preloadingEnabled: false, audioBufferConfig: .batterySaving ) }
// On battery, > 50%: slightly reduced configuration return ModelSelection( whisperModel: preferredWhisperModel, llmQuantization: min(preferredLLMQuantization, .q4_k_m), llmParameterBillions: preferredLLMParamBillions, gpuLayerReduction: 4, preloadingEnabled: false, audioBufferConfig: .balanced ) }}
// Make LLMQuantization Comparable for min() usageextension LLMQuantization: Comparable { static func < (lhs: LLMQuantization, rhs: LLMQuantization) -> Bool { lhs.bitsPerParameter < rhs.bitsPerParameter }}Preloading on Battery
Section titled “Preloading on Battery”| Battery Level | Preloading Policy |
|---|---|
| 100% - 80% (charging or just unplugged) | Allow preloading, use balanced models |
| 80% - 50% | No preloading; load on first use only |
| 50% - 20% | No preloading; reduce to smaller models |
| 20% - 10% | No preloading; reduce GPU layers; unload idle models after 60s |
| Below 10% | Unload LLM entirely; Whisper tiny only; minimal GPU |
Power Profile Configuration
Section titled “Power Profile Configuration”/// Power profiles that bundle all performance settingsenum PowerProfile: String, CaseIterable { case maximum // Maximum quality, all features case balanced // Good quality, reasonable power use case efficiency // Reduced quality, extended battery case minimal // Minimum viable, emergency battery
struct Settings { let whisperModel: WhisperModelSize let llmQuantization: LLMQuantization let maxGPULayers: Int32 let preloadingEnabled: Bool let warmupEnabled: Bool let audioBufferConfig: AudioBufferConfig let idleUnloadTimeout: Duration }
var settings: Settings { switch self { case .maximum: return Settings( whisperModel: .large, llmQuantization: .q8_0, maxGPULayers: 32, preloadingEnabled: true, warmupEnabled: true, audioBufferConfig: .lowLatency, idleUnloadTimeout: .seconds(600) ) case .balanced: return Settings( whisperModel: .medium, llmQuantization: .q4_k_m, maxGPULayers: 32, preloadingEnabled: true, warmupEnabled: true, audioBufferConfig: .balanced, idleUnloadTimeout: .seconds(300) ) case .efficiency: return Settings( whisperModel: .small, llmQuantization: .q4_0, maxGPULayers: 20, preloadingEnabled: false, warmupEnabled: false, audioBufferConfig: .batterySaving, idleUnloadTimeout: .seconds(120) ) case .minimal: return Settings( whisperModel: .tiny, llmQuantization: .q4_0, maxGPULayers: 8, preloadingEnabled: false, warmupEnabled: false, audioBufferConfig: .batterySaving, idleUnloadTimeout: .seconds(30) ) } }
/// Automatically select profile based on current conditions static func automatic( powerState: PowerSourceMonitor.PowerState, thermalState: ProcessInfo.ThermalState ) -> PowerProfile { if thermalState >= .critical { return .minimal } if thermalState >= .serious { return .efficiency }
if powerState.isCritical { return .minimal } if powerState.shouldThrottle { return .efficiency } if powerState.isOnBattery { return .balanced }
return .maximum }}🔒 Privacy Note: Power state detection uses IOKit APIs and does not require any special permissions. No power data leaves the device.
Thermal Management
Section titled “Thermal Management”Apple Silicon throttles CPU and GPU frequency under thermal pressure. VaulType proactively adapts before the system forces throttling.
Thermal State Monitoring
Section titled “Thermal State Monitoring”macOS provides four thermal states:
| State | Meaning | VaulType Response |
|---|---|---|
.nominal | Normal operating temperature | Full performance |
.fair | Slightly elevated temperature | Reduce preloading |
.serious | High temperature, system may throttle | Reduce GPU layers, smaller models |
.critical | System is actively throttling | Minimal mode, unload LLM |
Temperature ──────────────────────────────────────────────►
┌───────────┬───────────┬───────────┬───────────┐ │ nominal │ fair │ serious │ critical │ │ │ │ │ │ │ Full GPU │ No │ -8 GPU │ Unload │ │ Preload │ preload │ layers │ LLM │ │ All │ Reduce │ Smaller │ Whisper │ │ features │ warm-up │ models │ tiny only │ └───────────┴───────────┴───────────┴───────────┘Adaptive Throttling
Section titled “Adaptive Throttling”import Combine
/// Monitors thermal state and adapts performance parameters@MainActorfinal class ThermalThrottleManager: ObservableObject { @Published private(set) var currentThermalState: ProcessInfo.ThermalState = .nominal @Published private(set) var gpuLayerReduction: Int32 = 0 @Published private(set) var shouldReduceModelSize: Bool = false @Published private(set) var shouldDisablePreloading: Bool = false
private var cancellable: AnyCancellable?
init() { currentThermalState = ProcessInfo.processInfo.thermalState updateThrottling(for: currentThermalState)
cancellable = NotificationCenter.default .publisher(for: ProcessInfo.thermalStateDidChangeNotification) .receive(on: DispatchQueue.main) .sink { [weak self] _ in guard let self else { return } let newState = ProcessInfo.processInfo.thermalState self.currentThermalState = newState self.updateThrottling(for: newState) } }
private func updateThrottling(for state: ProcessInfo.ThermalState) { switch state { case .nominal: gpuLayerReduction = 0 shouldReduceModelSize = false shouldDisablePreloading = false
case .fair: gpuLayerReduction = 0 shouldReduceModelSize = false shouldDisablePreloading = true // Stop speculative loads
case .serious: gpuLayerReduction = 8 // Move 8 layers from GPU to CPU shouldReduceModelSize = true // Step down model sizes shouldDisablePreloading = true
case .critical: gpuLayerReduction = 16 shouldReduceModelSize = true shouldDisablePreloading = true
@unknown default: gpuLayerReduction = 0 shouldReduceModelSize = false shouldDisablePreloading = false } }}Reducing GPU Layers Under Thermal Pressure
Section titled “Reducing GPU Layers Under Thermal Pressure”When thermal state escalates, reducing GPU layers shifts computation from GPU to CPU. This reduces heat generation because:
- CPU cores can individually clock down more granularly
- CPU work can be spread across efficiency cores (P/E core scheduling)
- GPU power draw is typically higher per FLOP than CPU for these workloads at reduced clocks
/// Calculates effective GPU layer count under thermal constraintsstruct ThermalAwareGPULayers { /// Calculate effective GPU layers for Whisper static func whisperGPULayers( baseConfig: MetalGPUConfiguration, thermalReduction: Int32 ) -> Int32 { max(baseConfig.whisperGPULayers - thermalReduction, 0) }
/// Calculate effective GPU layers for LLM static func llmGPULayers( baseConfig: MetalGPUConfiguration, thermalReduction: Int32 ) -> Int32 { max(baseConfig.llmGPULayers - thermalReduction, 0) }
/// Estimated performance impact of thermal reduction static func estimatedSlowdown( totalLayers: Int32, gpuLayers: Int32, thermalReduction: Int32 ) -> Double { let originalGPU = Double(gpuLayers) let reducedGPU = Double(max(gpuLayers - thermalReduction, 0)) let cpuLayers = Double(totalLayers) - reducedGPU let originalCPULayers = Double(totalLayers) - originalGPU
// Very rough: GPU layers are ~3x faster than CPU layers let originalTime = originalCPULayers * 3.0 + originalGPU * 1.0 let reducedTime = cpuLayers * 3.0 + reducedGPU * 1.0
return reducedTime / originalTime }}Thermal Throttling in Code
Section titled “Thermal Throttling in Code”Complete thermal adaptation that integrates with the model pipeline:
import Combineimport Foundation
/// Coordinates thermal management across all VaulType subsystems@MainActorfinal class ThermalCoordinator: ObservableObject { @Published private(set) var effectiveProfile: PowerProfile = .balanced
private let thermalManager = ThermalThrottleManager() private let powerMonitor = PowerSourceMonitor.shared private let preloadingService: ModelPreloadingService
private var cancellables = Set<AnyCancellable>()
init(preloadingService: ModelPreloadingService) { self.preloadingService = preloadingService setupBindings() }
private func setupBindings() { // React to thermal state changes thermalManager.$currentThermalState .combineLatest( thermalManager.$gpuLayerReduction, thermalManager.$shouldReduceModelSize ) .debounce(for: .seconds(1), scheduler: DispatchQueue.main) .sink { [weak self] thermalState, _, _ in guard let self else { return } self.recalculateProfile() self.applyThermalMitigation(thermalState) } .store(in: &cancellables) }
private func recalculateProfile() { let powerState = powerMonitor.currentState let thermalState = ProcessInfo.processInfo.thermalState effectiveProfile = PowerProfile.automatic( powerState: powerState, thermalState: thermalState ) }
private func applyThermalMitigation(_ state: ProcessInfo.ThermalState) { switch state { case .nominal: // Restore full performance if memory allows break
case .fair: // Cancel any pending preloads preloadingService.unload("llm") // Keep Whisper only if both loaded
case .serious: // Unload LLM, keep Whisper with reduced GPU layers preloadingService.unload("llm") // Reconfigure Whisper with reduced GPU layers on next inference
case .critical: // Unload everything preloadingService.unloadAll() // Next transcription will use .minimal profile
@unknown default: break } }
/// Log thermal event for diagnostics private func logThermalEvent(_ state: ProcessInfo.ThermalState) { let stateNames: [ProcessInfo.ThermalState: String] = [ .nominal: "nominal", .fair: "fair", .serious: "serious", .critical: "critical" ] let name = stateNames[state] ?? "unknown" // Logger.performance.info("Thermal state changed to: \(name)") _ = name }}⚠️ Hysteresis: Thermal state transitions can oscillate. The
debounce(for: .seconds(1))prevents rapid reconfiguration. When transitioning from.seriousback to.nominal, wait at least 30 seconds before restoring full GPU layers to avoid thermal cycling.
Benchmarking Methodology and Tools
Section titled “Benchmarking Methodology and Tools”Systematic benchmarking ensures VaulType’s performance improves (or at least does not regress) with every release.
Instruments Profiling
Section titled “Instruments Profiling”Use these Instruments templates for VaulType performance analysis:
| Template | What to Look For |
|---|---|
| Time Profiler | Hot functions in whisper.cpp/llama.cpp wrappers, main thread stalls, Swift overhead |
| Allocations | Memory growth during transcription, leaked model buffers, KV cache growth |
| Metal System Trace | GPU utilization, shader compilation stalls, GPU/CPU sync points |
| System Trace | Thread scheduling, priority inversions, real-time thread preemption |
| Energy Log | CPU/GPU/ANE energy impact, background activity, wake-ups |
| Thermal State | Thermal ramp during sustained transcription, throttle points |
Key areas to profile:
- Model loading — Time from
load()call to first inference readiness - First inference — Time including shader compilation
- Steady-state inference — Time for subsequent inferences (cached shaders)
- Audio pipeline latency — Time from microphone capture to ring buffer availability
- End-to-end latency — Time from end of speech to text appearing in target app
- Memory high-water mark — Peak memory during dual-model operation
Custom Benchmarking Harness
Section titled “Custom Benchmarking Harness”import Foundationimport os.signpost
/// Benchmarking harness for VaulType performance measurementsfinal class PerformanceBenchmark { // MARK: - Signpost Integration
private static let log = OSLog( subsystem: "com.vaultype.benchmark", category: .pointsOfInterest )
private static let signpostLog = OSLog( subsystem: "com.vaultype.benchmark", category: "Performance" )
// MARK: - Measurement Types
struct Measurement { let name: String let duration: Duration let metadata: [String: String] let timestamp: Date
var durationMilliseconds: Double { let components = duration.components return Double(components.seconds) * 1000.0 + Double(components.attoseconds) / 1_000_000_000_000_000.0 } }
struct BenchmarkResult { let name: String let measurements: [Measurement]
var count: Int { measurements.count }
var meanDuration: Double { guard !measurements.isEmpty else { return 0 } let total = measurements.reduce(0.0) { $0 + $1.durationMilliseconds } return total / Double(measurements.count) }
var medianDuration: Double { guard !measurements.isEmpty else { return 0 } let sorted = measurements.map(\.durationMilliseconds).sorted() let mid = sorted.count / 2 if sorted.count.isMultiple(of: 2) { return (sorted[mid - 1] + sorted[mid]) / 2.0 } return sorted[mid] }
var p95Duration: Double { guard !measurements.isEmpty else { return 0 } let sorted = measurements.map(\.durationMilliseconds).sorted() let index = Int(Double(sorted.count) * 0.95) return sorted[min(index, sorted.count - 1)] }
var p99Duration: Double { guard !measurements.isEmpty else { return 0 } let sorted = measurements.map(\.durationMilliseconds).sorted() let index = Int(Double(sorted.count) * 0.99) return sorted[min(index, sorted.count - 1)] }
var standardDeviation: Double { guard measurements.count > 1 else { return 0 } let mean = meanDuration let variance = measurements.reduce(0.0) { sum, m in let diff = m.durationMilliseconds - mean return sum + diff * diff } / Double(measurements.count - 1) return variance.squareRoot() }
var summary: String { """ Benchmark: \(name) Iterations: \(count) Mean: \(String(format: "%.2f", meanDuration)) ms Median: \(String(format: "%.2f", medianDuration)) ms P95: \(String(format: "%.2f", p95Duration)) ms P99: \(String(format: "%.2f", p99Duration)) ms StdDev: \(String(format: "%.2f", standardDeviation)) ms """ } }
// MARK: - Benchmark Execution
private var results: [String: [Measurement]] = [:]
/// Run a benchmark with the given name and iteration count func benchmark( name: String, iterations: Int = 100, warmupIterations: Int = 5, metadata: [String: String] = [:], operation: () async throws -> Void ) async throws -> BenchmarkResult { // Warm-up phase (results discarded) for _ in 0..<warmupIterations { try await operation() }
// Measurement phase var measurements: [Measurement] = [] measurements.reserveCapacity(iterations)
for _ in 0..<iterations { let signpostID = OSSignpostID(log: Self.signpostLog) os_signpost( .begin, log: Self.signpostLog, name: "Benchmark", signpostID: signpostID, "%{public}s", name )
let clock = ContinuousClock() let duration = try await clock.measure { try await operation() }
os_signpost( .end, log: Self.signpostLog, name: "Benchmark", signpostID: signpostID )
measurements.append(Measurement( name: name, duration: duration, metadata: metadata, timestamp: Date() )) }
results[name] = measurements return BenchmarkResult(name: name, measurements: measurements) }
/// Run a synchronous benchmark func benchmarkSync( name: String, iterations: Int = 100, warmupIterations: Int = 5, metadata: [String: String] = [:], operation: () throws -> Void ) throws -> BenchmarkResult { // Warm-up for _ in 0..<warmupIterations { try operation() }
var measurements: [Measurement] = [] measurements.reserveCapacity(iterations)
let clock = ContinuousClock()
for _ in 0..<iterations { let duration = try clock.measure { try operation() }
measurements.append(Measurement( name: name, duration: duration, metadata: metadata, timestamp: Date() )) }
results[name] = measurements return BenchmarkResult(name: name, measurements: measurements) }
/// Export all results as JSON for trend analysis func exportJSON() throws -> Data { struct ExportEntry: Codable { let name: String let durationMs: Double let metadata: [String: String] let timestamp: String }
let formatter = ISO8601DateFormatter() let entries = results.flatMap { (name, measurements) in measurements.map { m in ExportEntry( name: name, durationMs: m.durationMilliseconds, metadata: m.metadata, timestamp: formatter.string(from: m.timestamp) ) } }
return try JSONEncoder().encode(entries) }
/// Print a formatted summary of all results func printSummary() { for (name, measurements) in results.sorted(by: { $0.key < $1.key }) { let result = BenchmarkResult(name: name, measurements: measurements) print(result.summary) print("---") } }}Example usage:
// Run benchmarks for all key operationsfunc runPerformanceSuite() async throws { let bench = PerformanceBenchmark()
// Benchmark Whisper inference let whisperResult = try await bench.benchmark( name: "whisper_inference_5s_audio", iterations: 50, warmupIterations: 3, metadata: [ "model": "medium", "audio_duration": "5.0", "hardware": ProcessInfo.processInfo.machineHardwareName ] ) { // Run whisper_full() on a 5-second test audio buffer try await transcribeTestAudio() }
// Benchmark LLM inference let llmResult = try await bench.benchmark( name: "llm_refinement_50_tokens", iterations: 50, warmupIterations: 3, metadata: [ "quantization": "Q4_K_M", "input_tokens": "50", "hardware": ProcessInfo.processInfo.machineHardwareName ] ) { // Run llama_decode() on a test prompt try await refineTestText() }
// Benchmark end-to-end pipeline let e2eResult = try await bench.benchmark( name: "end_to_end_pipeline", iterations: 20, warmupIterations: 2, metadata: [ "whisper_model": "medium", "llm_quantization": "Q4_K_M" ] ) { try await runFullPipeline() }
bench.printSummary()
// Export for CI comparison let jsonData = try bench.exportJSON() try jsonData.write(to: URL(fileURLWithPath: "/tmp/vaultype_benchmarks.json"))}
// Helper to get hardware nameextension ProcessInfo { var machineHardwareName: String { var sysinfo = utsname() uname(&sysinfo) return withUnsafePointer(to: &sysinfo.machine) { $0.withMemoryRebound(to: CChar.self, capacity: 1) { String(validatingUTF8: $0) ?? "unknown" } } }}Key Metrics to Track
Section titled “Key Metrics to Track”| Metric | Target | Alert Threshold | Measurement Method |
|---|---|---|---|
| Model load time (Whisper medium) | < 1.5 s | > 3.0 s | ContinuousClock around init |
| Model load time (LLM Q4_K_M 7B) | < 3.0 s | > 6.0 s | ContinuousClock around init |
| First inference (Whisper) | < 500 ms | > 1000 ms | Includes shader compilation |
| Steady inference (Whisper, 5s audio) | < 300 ms | > 600 ms | After warm-up |
| LLM tokens/sec (Q4_K_M, M2) | > 20 t/s | < 10 t/s | Token generation loop |
| End-to-end latency | < 500 ms | > 1000 ms | Speech end → text output |
| Peak memory (dual model) | < 7 GB | > 10 GB | mach_task_basic_info |
| Audio dropout rate | < 0.1% | > 1% | Ring buffer overflow counter |
| Battery drain (1 hr active) | < 15% | > 25% | IOPowerSources sampling |
Regression Detection
Section titled “Regression Detection”Integrate benchmarks into CI to catch regressions:
/// Compares benchmark results against stored baselinesstruct RegressionDetector { struct Baseline: Codable { let name: String let meanDurationMs: Double let p95DurationMs: Double let hardware: String let date: String }
/// Load baselines from disk static func loadBaselines( from url: URL ) throws -> [String: Baseline] { let data = try Data(contentsOf: url) let baselines = try JSONDecoder().decode([Baseline].self, from: data) return Dictionary(uniqueKeysWithValues: baselines.map { ($0.name, $0) }) }
/// Check for regressions against baselines static func checkForRegressions( results: [PerformanceBenchmark.BenchmarkResult], baselines: [String: Baseline], regressionThreshold: Double = 0.10 // 10% slower = regression ) -> [RegressionReport] { var reports: [RegressionReport] = []
for result in results { guard let baseline = baselines[result.name] else { reports.append(RegressionReport( name: result.name, status: .noBaseline, currentMean: result.meanDuration, baselineMean: nil, changePercent: nil )) continue }
let changePercent = (result.meanDuration - baseline.meanDurationMs) / baseline.meanDurationMs
let status: RegressionStatus if changePercent > regressionThreshold { status = .regression } else if changePercent < -regressionThreshold { status = .improvement } else { status = .stable }
reports.append(RegressionReport( name: result.name, status: status, currentMean: result.meanDuration, baselineMean: baseline.meanDurationMs, changePercent: changePercent )) }
return reports }
enum RegressionStatus: String { case regression case stable case improvement case noBaseline }
struct RegressionReport { let name: String let status: RegressionStatus let currentMean: Double let baselineMean: Double? let changePercent: Double?
var description: String { let changeStr: String if let change = changePercent { let sign = change >= 0 ? "+" : "" changeStr = "\(sign)\(String(format: "%.1f", change * 100))%" } else { changeStr = "N/A" }
let icon: String = switch status { case .regression: "FAIL" case .stable: "PASS" case .improvement: "IMPROVED" case .noBaseline: "NEW" }
return "[\(icon)] \(name): \(String(format: "%.2f", currentMean)) ms (\(changeStr))" } }}Performance Benchmark Tables
Section titled “Performance Benchmark Tables”End-to-End Latency (5-second utterance, speech-end to text-output):
| Configuration | M1 (8 GB) | M2 (8 GB) | M3 (16 GB) | M4 Pro (24 GB) |
|---|---|---|---|---|
| tiny + Q4_0 3B | 320 ms | 250 ms | 210 ms | 150 ms |
| small + Q4_K_M 7B | 480 ms | 380 ms | 310 ms | 220 ms |
| medium + Q4_K_M 7B | 650 ms | 500 ms | 400 ms | 280 ms |
| large-v3 + Q4_K_M 7B | 1100 ms | 850 ms | 680 ms | 450 ms |
| large-v3 + Q8_0 7B | N/A | N/A | 900 ms | 550 ms |
ℹ️ Reading the table: These represent total latency from the moment the user stops speaking to the moment refined text is ready for insertion. The pipeline runs Whisper and LLM sequentially (Whisper output feeds LLM input).
Model Loading Time (cold start, from SSD):
| Model | Size | M1 | M2 | M3 | M4 |
|---|---|---|---|---|---|
| Whisper tiny | 75 MB | 0.1 s | 0.08 s | 0.07 s | 0.05 s |
| Whisper medium | 1.5 GB | 0.8 s | 0.6 s | 0.5 s | 0.4 s |
| Whisper large-v3 | 3.1 GB | 1.5 s | 1.2 s | 1.0 s | 0.8 s |
| LLM Q4_K_M 7B | 4.1 GB | 2.5 s | 2.0 s | 1.7 s | 1.3 s |
| LLM Q8_0 7B | 7.2 GB | 4.0 s | 3.2 s | 2.8 s | 2.1 s |
Peak Memory During Dual-Model Operation:
| Configuration | Model Memory | Compute Overhead | Total Peak |
|---|---|---|---|
| tiny + Q4_0 3B | 1.8 GB | 0.5 GB | 2.3 GB |
| base + Q4_K_M 7B | 4.3 GB | 0.7 GB | 5.0 GB |
| small + Q4_K_M 7B | 4.6 GB | 0.7 GB | 5.3 GB |
| medium + Q4_K_M 7B | 5.6 GB | 0.9 GB | 6.5 GB |
| large-v3 + Q4_K_M 7B | 7.1 GB | 1.1 GB | 8.2 GB |
| large-v3 + Q8_0 7B | 10.2 GB | 1.3 GB | 11.5 GB |
| large-v3 + Q8_0 13B | 16.5 GB | 1.8 GB | 18.3 GB |
Pipeline Optimization
Section titled “Pipeline Optimization”End-to-End Pipeline
Section titled “End-to-End Pipeline”The full VaulType pipeline from microphone to text insertion:
┌─────────────────────────────────────────────────────────────────────┐│ VaulType Processing Pipeline ││ ││ ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ ││ │ Audio │ │ Format │ │ Ring │ │ Voice Activity │ ││ │ Capture │──►│ Convert │──►│ Buffer │──►│ Detection (VAD) │ ││ │ 48 kHz │ │ → 16 kHz │ │ 0.5 s │ │ │ ││ └─────────┘ └──────────┘ └──────────┘ └────────┬─────────┘ ││ │ ││ Speech ended? │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Whisper Inference │ ││ │ │ ││ │ Audio Chunk ──► Mel Spectrogram ──► Encoder ──► Decoder │ ││ │ │ │ ││ │ Raw transcript │ ││ └─────────────────────────────────────────────┬───────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ LLM Refinement │ ││ │ │ ││ │ System Prompt + Raw Transcript ──► Token Generation │ ││ │ │ │ ││ │ Refined text │ ││ └─────────────────────────────────────────────┬───────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Text Insertion │ ││ │ │ ││ │ Refined Text ──► CGEvent / Accessibility API ──► Target App│ ││ └─────────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────────┘
Timeline (medium + Q4_K_M 7B on M2):├─ Audio capture ──────────────────────── continuous ─────────────────┤├─ VAD detection ────────────────────────── ~50 ms ──┤├─ Whisper inference ─────────────────────────────── ~250 ms ────────┤├─ LLM refinement ─────────────────────────────────── ~130 ms ──────┤├─ Text insertion ──────────────────────────────────────── ~5 ms ───┤│ ││ Total end-to-end: ~435 ms │Concurrent Execution Strategy
Section titled “Concurrent Execution Strategy”While Whisper and LLM run sequentially per utterance (LLM needs Whisper’s output), multiple optimizations enable concurrency:
- Streaming Whisper output: Begin LLM processing as soon as the first sentence is decoded, while Whisper continues on remaining audio
- Pipeline overlap: While LLM refines utterance N, Whisper can begin processing utterance N+1
- Parallel model prep: Load/warm-up both models concurrently during preloading
/// Orchestrates overlapped pipeline executionactor PipelineOrchestrator { private let whisperEngine: WhisperEngine private let llmEngine: LLMEngine
init(whisperEngine: WhisperEngine, llmEngine: LLMEngine) { self.whisperEngine = whisperEngine self.llmEngine = llmEngine }
/// Process audio with pipeline overlap func processAudio(_ audioBuffer: [Float]) -> AsyncStream<String> { AsyncStream { continuation in Task { // Start Whisper transcription let rawSegments = await whisperEngine.transcribeStreaming(audioBuffer)
// Process each segment through LLM as it becomes available for await segment in rawSegments { let refined = try? await llmEngine.refine(segment) continuation.yield(refined ?? segment) }
continuation.finish() } } }}
// Protocol stubs for the exampleprotocol WhisperEngine { func transcribeStreaming(_ audio: [Float]) async -> AsyncStream<String>}
protocol LLMEngine { func refine(_ text: String) async throws -> String}💡 Streaming Optimization: For short utterances (under 5 seconds), the overhead of streaming is not worth it — just run Whisper to completion then LLM. For longer dictation sessions (30+ seconds), streaming reduces perceived latency significantly because the user sees refined text appearing while still speaking.
Related Documentation
Section titled “Related Documentation”| Document | Relevance |
|---|---|
| Architecture Overview | System component graph, dependency injection, manager lifecycle |
| Speech Recognition | Whisper integration details, audio pipeline, VAD configuration |
| LLM Processing | llama.cpp integration, prompt engineering, token generation |
| Model Management | Model download, storage, versioning, user-facing model picker |
| Tech Stack | Full dependency list, version requirements, build configuration |
This document should be updated whenever new Apple Silicon generations are released, when whisper.cpp or llama.cpp make significant performance changes, or when new quantization formats become available.