Skip to content

System Architecture

Last Updated: 2026-02-13

VaulType — Privacy-first, macOS-native speech-to-text with local LLM post-processing. This document is the definitive reference for VaulType’s internal architecture, data flows, threading model, memory management, and extensibility design.



VaulType follows a strict layered architecture with four tiers. Dependencies flow downward only — upper layers depend on lower layers, but never the reverse. Each layer communicates through well-defined Swift protocols, enabling testability and future extensibility.

┌─────────────────────────────────────────────────────────────────────────────┐
│ │
│ PRESENTATION LAYER │
│ │
│ ┌──────────────┐ ┌──────────────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ MenuBarView │ │ SettingsView │ │ OverlayView │ │ Onboarding│ │
│ │ (SwiftUI │ │ (SwiftUI │ │ (SwiftUI │ │ View │ │
│ │ MenuBar │ │ Settings │ │ NSPanel │ │ (SwiftUI) │ │
│ │ Extra) │ │ Scene) │ │ overlay) │ │ │ │
│ └──────┬───────┘ └────────┬─────────┘ └──────┬───────┘ └─────┬─────┘ │
│ │ │ │ │ │
├─────────┼───────────────────┼────────────────────┼────────────────┼─────────┤
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ APPLICATION SERVICES │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ TranscriptionCoordinator │ │
│ │ Orchestrates the full pipeline: record → transcribe → process → │ │
│ │ inject. Single entry point for the entire dictation lifecycle. │ │
│ └──────────────────────────────┬──────────────────────────────────────┘ │
│ │ │
│ ┌──────────────┐ ┌───────────┴──────┐ ┌──────────────┐ ┌───────────┐ │
│ │ HotkeyManager│ │ ModeManager │ │PermissionMgr │ │ AppState │ │
│ │ │ │ │ │ │ │(Observable│ │
│ │ Global key │ │ Tracks active │ │ Accessibility│ │ Object) │ │
│ │ event mon- │ │ processing mode │ │ + Microphone │ │ │ │
│ │ itoring │ │ and app profile │ │ permission │ │ Central │ │
│ │ │ │ resolution │ │ requests │ │ published │ │
│ │ │ │ │ │ │ │ state │ │
│ └──────┬───────┘ └──────┬──────────┘ └──────┬───────┘ └─────┬─────┘ │
│ │ │ │ │ │
├─────────┼─────────────────┼─────────────────────┼────────────────┼─────────┤
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ DOMAIN LAYER │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────┐ ┌──────────────┐ │
│ │WhisperService│ │ LLMService │ │ CommandParser │ │AudioCapture │ │
│ │ │ │ │ │ │ │ Service │ │
│ │ Whisper ctx │ │ LLM ctx │ │ Voice cmd │ │ │ │
│ │ management, │ │ management, │ │ detection + │ │ AVAudioEngine│ │
│ │ inference │ │ prompt exec, │ │ regex/LLM │ │ tap, format │ │
│ │ execution, │ │ mode routing │ │ parsing │ │ conversion, │ │
│ │ language │ │ │ │ │ │ ring buffer │ │
│ │ detection │ │ │ │ │ │ │ │
│ └──────┬───────┘ └──────┬───────┘ └───────┬───────┘ └──────┬───────┘ │
│ │ │ │ │ │
│ ┌──────┴──────┐ ┌──────┴──────┐ ┌─────────┴─────┐ ┌───────┴────────┐ │
│ │TextInjection│ │ Vocabulary │ │PromptTemplate │ │ VAD │ │
│ │ Service │ │ Service │ │ Engine │ │ (Voice │ │
│ │ │ │ │ │ │ │ Activity │ │
│ │ CGEvent + │ │ Word │ │ Template │ │ Detection) │ │
│ │ Clipboard │ │ replacement │ │ variable │ │ │ │
│ │ injection │ │ pipeline │ │ substitution │ │ Energy-based │ │
│ │ │ │ │ │ │ │ speech detect │ │
│ └──────┬───────┘ └──────┬───────┘ └───────┬───────┘ └──────┬───────┘ │
│ │ │ │ │ │
├─────────┼─────────────────┼───────────────────┼─────────────────┼──────────┤
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ INFRASTRUCTURE LAYER │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────┐ ┌──────────────┐ │
│ │ whisper.cpp │ │ llama.cpp │ │ AVAudio │ │ CGEvent │ │
│ │ Bridge │ │ Bridge │ │ Engine │ │ Bridge │ │
│ │ │ │ │ │ │ │ │ │
│ │ C bridging │ │ C bridging │ │ System audio │ │ Quartz event │ │
│ │ header, │ │ header, │ │ capture │ │ services, │ │
│ │ OpaquePtr │ │ OpaquePtr │ │ hardware │ │ keystroke │ │
│ │ lifecycle │ │ lifecycle │ │ │ │ simulation │ │
│ └──────────────┘ └──────────────┘ └───────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────┐ ┌──────────────┐ │
│ │ SwiftData │ │ Model File │ │ NSWorkspace │ │ NSPaste │ │
│ │ Store │ │ Manager │ │ Bridge │ │ board │ │
│ │ │ │ │ │ │ │ Bridge │ │
│ │ Persistence, │ │ GGUF/bin │ │ App detection,│ │ │ │
│ │ migration, │ │ download, │ │ launch, │ │ Clipboard │ │
│ │ queries │ │ validation, │ │ activation │ │ read/write │ │
│ │ │ │ storage │ │ │ │ + restore │ │
│ └──────────────┘ └──────────────┘ └───────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

The following diagram shows the primary runtime data flow when a user performs a dictation — from pressing the hotkey through to text appearing in their focused application.

User presses TranscriptionCoordinator
global hotkey ────▶ receives start signal
│ │
│ ▼
│ AudioCaptureService
│ .startCapture()
│ │
│ ┌─────────┴──────────┐
│ │ AVAudioEngine │
│ │ installTap(onBus:) │
│ │ 48kHz stereo ──────┼──▶ Format Converter
│ └────────────────────┘ (48kHz→16kHz, stereo→mono)
│ │
│ ▼
│ Ring Buffer
│ (30s @ 16kHz mono)
│ │
User releases │
global hotkey ────▶ TranscriptionCoordinator │
│ receives stop signal │
│ │ │
│ ▼ │
│ AudioCaptureService │
│ .stopCapture() │
│ │ │
│ ▼ ▼
│ WhisperService.transcribe(samples:)
│ │
│ ▼
│ ┌───────────────┐
│ │ whisper.cpp │
│ │ inference │
│ │ (Metal GPU) │
│ └───────┬───────┘
│ │
│ Raw Text
│ │
│ ┌────────┴────────┐
│ │ │
│ ▼ ▼
│ CommandParser ModeManager
│ .isCommand()? .resolveMode()
│ │ │
│ │ (if voice cmd) │ (if regular text)
│ ▼ ▼
│ ActionExecutor LLMService.process()
│ .execute(cmd) │
│ │ ▼
│ │ ┌───────────────┐
│ │ │ llama.cpp │
│ │ │ inference │
│ │ │ (Metal GPU) │
│ │ └───────┬───────┘
│ │ │
│ │ Processed Text
│ │ │
│ │ VocabularyService
│ │ .applyReplacements()
│ │ │
│ ▼ ▼
│ System Action TextInjectionService
│ (NSWorkspace, .inject(text:)
│ AppleScript) │
│ ┌──────┴───────┐
│ │ │
│ ▼ ▼
│ CGEvent Clipboard
│ (< 50 ch) + Cmd+V
│ │ (>= 50 ch)
│ │ │
│ └──────┬───────┘
│ │
│ ▼
└────────────────────▶ Text appears in
focused application

ℹ️ Info: The entire pipeline — from audio capture stop to text injection — typically completes in under 2 seconds on Apple Silicon with the recommended model configuration (whisper-small + Qwen2.5-1.5B).


The audio pipeline is responsible for capturing microphone input, converting it to the format whisper.cpp expects (16kHz mono Float32 PCM), buffering it efficiently, and detecting voice activity to optimize inference quality.

┌─────────────────────────────────────────────────────────────────────────┐
│ AUDIO PIPELINE │
│ │
│ ┌─────────┐ ┌────────────────┐ ┌─────────────────────────┐ │
│ │ macOS │ │ AVAudioEngine │ │ AVAudioConverter │ │
│ │ Micro- │─────▶│ Input Node │─────▶│ │ │
│ │ phone │ │ │ │ Source: Device native │ │
│ │ │ │ Tap installed │ │ - 48kHz (typical) │ │
│ │ (User- │ │ on bus 0 │ │ - Stereo (2ch) │ │
│ │ selected│ │ │ │ - Float32 │ │
│ │ or │ │ Buffer: 1024 │ │ │ │
│ │ default)│ │ frames │ │ Target: whisper.cpp │ │
│ └─────────┘ │ (~21ms @48kHz)│ │ - 16kHz │ │
│ └────────────────┘ │ - Mono (1ch) │ │
│ │ - Float32 │ │
│ │ - Range: [-1.0, 1.0] │ │
│ └────────────┬────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ CircularAudioBuffer │ │
│ │ │ │
│ │ Capacity: 30 seconds @ 16kHz = 480,000 │ │
│ │ samples (1.83 MB) │ │
│ │ │ │
│ │ ┌─────────────────────────────────────┐ │ │
│ │ │ Write Head ──▶ [samples...] ◀── Read│ │ │
│ │ │ (lock-free SPSC) │ │ │
│ │ └─────────────────────────────────────┘ │ │
│ │ │ │
│ │ Thread safety: Single-producer (audio │ │
│ │ callback thread), single-consumer │ │
│ │ (inference thread). Lock-free via atomic │ │
│ │ read/write indices. │ │
│ └──────────────────────┬──────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Voice Activity Detection (VAD) │ │
│ │ │ │
│ │ Algorithm: Energy-based with adaptive │ │
│ │ threshold │ │
│ │ │ │
│ │ 1. Compute RMS energy per 30ms frame │ │
│ │ 2. Compare against adaptive noise floor │ │
│ │ 3. Apply hangover timer (300ms) to avoid │ │
│ │ cutting off trailing syllables │ │
│ │ 4. Trim leading/trailing silence before │ │
│ │ sending to whisper.cpp │ │
│ │ │ │
│ │ Purpose: Reduces inference time by │ │
│ │ excluding silence. A 10s recording with │ │
│ │ 6s of speech + 4s of silence processes │ │
│ │ ~40% faster with VAD trimming. │ │
│ └──────────────────────┬──────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ whisper.cpp Inference │ │
│ │ │ │
│ │ Input: [Float] — 16kHz mono PCM samples │ │
│ │ Params: whisper_full_params (beam size, │ │
│ │ language, thread count, etc.) │ │
│ │ Output: String — raw transcription │ │
│ │ │ │
│ │ Execution: Dedicated inference thread │ │
│ │ GPU: Metal acceleration (encoder + decoder) │ │
│ │ CPU: N threads for non-Metal operations │ │
│ └─────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘

The ring buffer decouples the real-time audio callback thread from the inference thread. The audio callback fires at hardware-determined intervals (typically every ~21ms at 48kHz with a 1024-frame buffer) and must return quickly to avoid audio glitches.

/// Lock-free single-producer single-consumer circular buffer for audio samples.
/// The audio callback thread writes; the inference thread reads.
final class CircularAudioBuffer: @unchecked Sendable {
private var buffer: [Float]
private let capacity: Int
private var writeIndex = UnsafeAtomic<Int>.create(0)
private var readIndex = UnsafeAtomic<Int>.create(0)
init(capacity: Int) {
self.capacity = capacity
self.buffer = [Float](repeating: 0, count: capacity)
}
/// Called from the audio callback thread (producer)
func append(_ samples: [Float]) {
let currentWrite = writeIndex.load(ordering: .relaxed)
for (i, sample) in samples.enumerated() {
buffer[(currentWrite + i) % capacity] = sample
}
writeIndex.store(
(currentWrite + samples.count) % capacity,
ordering: .releasing
)
}
/// Called from the inference thread (consumer)
func drain() -> [Float] {
let currentRead = readIndex.load(ordering: .relaxed)
let currentWrite = writeIndex.load(ordering: .acquiring)
let count: Int
if currentWrite >= currentRead {
count = currentWrite - currentRead
} else {
count = capacity - currentRead + currentWrite
}
guard count > 0 else { return [] }
var result = [Float](repeating: 0, count: count)
for i in 0..<count {
result[i] = buffer[(currentRead + i) % capacity]
}
readIndex.store(
(currentRead + count) % capacity,
ordering: .releasing
)
return result
}
}

The WhisperService wraps the whisper.cpp C API and manages the model lifecycle:

/// Manages whisper.cpp context lifecycle and executes speech-to-text inference.
actor WhisperService {
private var context: OpaquePointer? // whisper_context*
private let modelPath: URL
var isLoaded: Bool { context != nil }
var detectedLanguage: String = "en"
var averageConfidence: Double = 0.0
init(modelPath: URL) {
self.modelPath = modelPath
}
func loadModel() throws {
var params = whisper_context_default_params()
params.use_gpu = true // Metal acceleration
params.flash_attn = true // Flash attention on supported hardware
context = whisper_init_from_file_with_params(
modelPath.path,
params
)
guard context != nil else {
throw WhisperError.modelLoadFailed(path: modelPath)
}
}
func transcribe(
samples: [Float],
params: whisper_full_params
) throws -> String {
guard let ctx = context else {
throw WhisperError.contextNotLoaded
}
var mutableParams = params
let result = samples.withUnsafeBufferPointer { ptr in
whisper_full(ctx, mutableParams, ptr.baseAddress, Int32(samples.count))
}
guard result == 0 else {
throw WhisperError.inferenceFailed(code: result)
}
let segmentCount = whisper_full_n_segments(ctx)
var transcription = ""
var totalProb: Float = 0
for i in 0..<segmentCount {
if let text = whisper_full_get_segment_text(ctx, i) {
transcription += String(cString: text)
}
let nTokens = whisper_full_n_tokens(ctx, i)
for j in 0..<nTokens {
totalProb += whisper_full_get_token_p(ctx, i, j)
}
let tokenCount = max(1, nTokens)
averageConfidence = Double(totalProb / Float(tokenCount))
}
// Detect language from first segment
if let langPtr = whisper_full_get_segment_text(ctx, 0) {
let langId = whisper_full_lang_id(ctx)
if let langStr = whisper_lang_str(langId) {
detectedLanguage = String(cString: langStr)
}
}
return transcription.trimmingCharacters(in: .whitespacesAndNewlines)
}
func unloadModel() {
if let ctx = context {
whisper_free(ctx)
context = nil
}
}
}

⚠️ Warning: whisper_full() is a blocking call that can take several seconds for longer audio clips. It must never be called on the main thread. The WhisperService is an actor, and all inference calls should be await-ed from a non-main-actor context.


The LLM pipeline takes raw transcription text from whisper.cpp and applies contextual post-processing based on the active processing mode. Each mode maps to a different prompt template that instructs the LLM on how to transform the text.

┌─────────────────────────────────────────────────────────────────────────┐
│ LLM PIPELINE │
│ │
│ Raw Text from │
│ WhisperService ──────▶ ModeManager.resolveMode() │
│ │ │
│ ┌──────────────┼──────────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Raw │ │ Clean │ │Structure │ │ Prompt │ │
│ │ │ │ │ │ │ │ │ │
│ │ No LLM │ │ Fix │ │ Organize │ │ User- │ │
│ │ processing│ │ punct, │ │ into │ │ defined │ │
│ │ — pass │ │ grammar, │ │ headings,│ │ template │ │
│ │ through │ │ filler │ │ bullets, │ │ with │ │
│ │ │ │ words │ │ sections │ │ variables│ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Code │ │ Custom │ │
│ │ │ │ │ │
│ │ Convert │ │ User- │ │
│ │ spoken │ │ defined │ │
│ │ code to │ │ pre/post │ │
│ │ syntax │ │ pipeline │ │
│ └────┬─────┘ └────┬─────┘ │
│ │ │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ PromptTemplateEngine │
│ .render(transcription:, mode:) │
│ │ │
│ ┌────────┴────────┐ │
│ │ System Prompt │ Role definition, behavioral │
│ │ (from template)│ constraints for the LLM │
│ ├─────────────────┤ │
│ │ User Prompt │ Raw text + mode-specific │
│ │ (rendered with │ instructions with {{variables}} │
│ │ variables) │ substituted │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ LLMService.complete(prompt:) │
│ │ │
│ ┌────────┴────────┐ │
│ │ llama.cpp │ │
│ │ Inference │ │
│ │ │ │
│ │ Context: 2048 │ │
│ │ Temperature: │ │
│ │ 0.1 (low for │ │
│ │ determinism) │ │
│ │ Top-P: 0.9 │ │
│ │ Max tokens: │ │
│ │ 512 │ │
│ │ │ │
│ │ Metal GPU │ │
│ │ acceleration │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ Processed Text │
│ │ │
│ ▼ │
│ VocabularyService │
│ .applyReplacements() │
│ │ │
│ ▼ │
│ Final Output │
│ │
└─────────────────────────────────────────────────────────────────────────┘

The PromptTemplateEngine resolves the active prompt template for the current mode, substitutes variables, and constructs the final prompt payload for LLM inference.

/// Resolves and renders prompt templates for LLM post-processing.
struct PromptTemplateEngine {
private let modelContext: ModelContext
/// Render the prompt for the given mode and transcription.
func renderPrompt(
mode: ProcessingMode,
transcription: String,
variables: [String: String] = [:]
) throws -> RenderedPrompt {
guard mode.requiresLLM else {
// Raw mode bypasses LLM entirely
return RenderedPrompt(
systemPrompt: "",
userPrompt: transcription,
skipInference: true
)
}
// Fetch the default template for this mode
let descriptor = FetchDescriptor<PromptTemplate>(
predicate: #Predicate {
$0.mode == mode && $0.isDefault == true
}
)
guard let template = try modelContext.fetch(descriptor).first else {
throw PromptError.noTemplateForMode(mode)
}
let renderedUserPrompt = template.render(
transcription: transcription,
values: variables
)
return RenderedPrompt(
systemPrompt: template.systemPrompt,
userPrompt: renderedUserPrompt,
skipInference: false
)
}
}
struct RenderedPrompt {
let systemPrompt: String
let userPrompt: String
let skipInference: Bool
}

The LLMService manages the llama.cpp context and executes inference:

/// Manages llama.cpp model lifecycle and executes LLM inference.
actor LLMService {
private var model: OpaquePointer? // llama_model*
private var context: OpaquePointer? // llama_context*
private let provider: LLMProvider
var isModelLoaded: Bool { model != nil && context != nil }
func process(
rawText: String,
mode: ProcessingMode,
templateEngine: PromptTemplateEngine
) async throws -> String {
let rendered = try templateEngine.renderPrompt(
mode: mode,
transcription: rawText
)
// Raw mode — skip LLM entirely
if rendered.skipInference {
return rawText
}
// Construct the chat-format prompt
let fullPrompt = """
<|system|>
\(rendered.systemPrompt)
<|user|>
\(rendered.userPrompt)
<|assistant|>
"""
let result = try await provider.complete(
prompt: fullPrompt,
parameters: LLMInferenceParameters(
maxTokens: 512,
temperature: 0.1,
topP: 0.9,
repeatPenalty: 1.1
)
)
return result.trimmingCharacters(in: .whitespacesAndNewlines)
}
}

💡 Tip: The prompt format (<|system|>, <|user|>, <|assistant|>) varies by LLM model family. VaulType maintains a prompt format registry that maps model filenames to their expected chat template format (ChatML, Llama, Phi, etc.).


After post-processing, the final text must be injected into whatever application the user was focused on when they triggered dictation. VaulType uses a dual-strategy approach: CGEvent keystroke simulation for short text, and clipboard paste for longer text.

┌─────────────────────────────────────────────────────────────────────────┐
│ TEXT INJECTION PIPELINE │
│ │
│ Processed Text ──────▶ TextInjectionService │
│ │ │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ Resolve injection │ │
│ │ method: │ │
│ │ │ │
│ │ 1. Check AppProfile │ │
│ │ for target app │ │
│ │ │ │
│ │ 2. If .auto: │ │
│ │ text.count < 50 │ │
│ │ → CGEvent │ │
│ │ text.count >= 50 │ │
│ │ → Clipboard │ │
│ │ │ │
│ │ 3. If explicit: │ │
│ │ Use configured │ │
│ │ method │ │
│ └───────────┬────────────┘ │
│ │ │
│ ┌────────────┴────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌───────────────────────┐ ┌─────────────────────────────────┐ │
│ │ CGEvent Strategy │ │ Clipboard Strategy │ │
│ │ │ │ │ │
│ │ For each character: │ │ 1. Save current clipboard │ │
│ │ │ │ contents (NSPasteboard) │ │
│ │ 1. Create CGEvent │ │ │ │
│ │ keyDown event │ │ 2. Set processed text to │ │
│ │ │ │ clipboard │ │
│ │ 2. Set Unicode │ │ │ │
│ │ string on event │ │ 3. Simulate Cmd+V via CGEvent │ │
│ │ │ │ keyDown: Cmd flag + 'v' │ │
│ │ 3. Post keyDown to │ │ keyUp: release both │ │
│ │ cghidEventTap │ │ │ │
│ │ │ │ 4. Wait 150ms for paste to │ │
│ │ 4. Create + post │ │ complete │ │
│ │ keyUp event │ │ │ │
│ │ │ │ 5. Restore previous clipboard │ │
│ │ 5. Sleep 1-5ms │ │ contents │ │
│ │ between chars │ │ │ │
│ │ (configurable) │ │ Time: ~200ms total │ │
│ │ │ │ (independent of text length) │ │
│ │ Time: ~N ms │ │ │ │
│ │ (N = char count * │ │ │ │
│ │ keystroke delay) │ │ │ │
│ └───────────┬───────────┘ └───────────────┬─────────────────┘ │
│ │ │ │
│ └──────────────┬───────────────┘ │
│ │ │
│ ▼ │
│ Text appears in │
│ focused application │
│ │
└─────────────────────────────────────────────────────────────────────────┘
/// Preserves and restores the system clipboard around a paste operation.
final class ClipboardPreserver {
private let pasteboard = NSPasteboard.general
private var savedItems: [NSPasteboardItem] = []
private var savedTypes: [NSPasteboard.PasteboardType] = []
private var savedStringContent: String?
/// Capture the current clipboard state.
func save() {
savedStringContent = pasteboard.string(forType: .string)
// Note: Full multi-type preservation would also save
// .rtf, .html, .tiff etc. for rich content.
}
/// Restore the previously captured clipboard state.
func restore() {
pasteboard.clearContents()
if let content = savedStringContent {
pasteboard.setString(content, forType: .string)
}
savedStringContent = nil
}
}

🔒 Security: The clipboard contains the transcribed text for approximately 150ms during the paste operation. VaulType immediately restores the previous clipboard contents. Applications that poll the clipboard rapidly (clipboard managers, password managers) may capture this transient content. Users who are concerned about this can configure CGEvent-only injection in their AppProfile, accepting slower injection for longer texts.


VaulType supports voice commands that trigger system actions instead of injecting text. Voice commands are detected by a configurable prefix (default: “hey hush”) and parsed into structured actions.

┌─────────────────────────────────────────────────────────────────────────┐
│ VOICE COMMAND PIPELINE │
│ │
│ Raw Text from │
│ WhisperService ──────▶ CommandParser.parse(text:) │
│ │ │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ Prefix Detection │ │
│ │ │ │
│ │ Does text start with │ │
│ │ command prefix? │ │
│ │ │ │
│ │ Default: "hey hush" │ │
│ │ Configurable in settings │ │
│ │ │ │
│ │ Case-insensitive match │ │
│ │ with fuzzy tolerance │ │
│ │ ("hey hush", "a hush", │ │
│ │ "hey hash" → all match) │ │
│ └──────────┬───────────────┘ │
│ │ │
│ ┌──────────┴───────────┐ │
│ │ No prefix detected │──────▶ Return to normal │
│ │ │ text pipeline │
│ └──────────────────────┘ │
│ │ │
│ (Prefix detected) │
│ │ │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ Command Body Extraction │ │
│ │ │ │
│ │ Strip prefix, normalize │ │
│ │ whitespace, lowercase │ │
│ │ │ │
│ │ "hey hush open Safari" │ │
│ │ → "open safari" │ │
│ └──────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ Regex Pattern Matching │ │
│ │ (first pass — fast) │ │
│ │ │ │
│ │ Built-in patterns: │ │
│ │ • "open (.+)" │ │
│ │ • "switch to (.+)" │ │
│ │ • "type (.+)" │ │
│ │ • "search (for )?(.+)" │ │
│ │ • "mode (raw|clean|...)" │ │
│ │ • "undo" │ │
│ │ • "select all" │ │
│ │ • "copy that" │ │
│ │ • "paste" │ │
│ │ • "new line" │ │
│ │ • "new paragraph" │ │
│ │ • "delete that" │ │
│ └──────────┬───────────────┘ │
│ │ │
│ ┌──────────┴───────────┐ │
│ │ No regex match │ │
│ └──────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ LLM Command Parser │ │
│ │ (second pass — smart) │ │
│ │ │ │
│ │ Send command body to LLM │ │
│ │ with structured output │ │
│ │ prompt: │ │
│ │ │ │
│ │ "Classify this voice │ │
│ │ command into an action │ │
│ │ type and parameters. │ │
│ │ Output JSON." │ │
│ │ │ │
│ │ Handles natural language: │ │
│ │ "can you open my browser" │ │
│ │ → { action: "open_app", │ │
│ │ target: "Safari" } │ │
│ └──────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ Action Executor │ │
│ │ │ │
│ │ Dispatch parsed command │ │
│ │ to appropriate system API│ │
│ │ │ │
│ │ open_app → NSWorkspace │ │
│ │ keystroke → CGEvent │ │
│ │ system → AppleScript │ │
│ │ mode → ModeManager │ │
│ │ text_edit → CGEvent seq │ │
│ └──────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
/// Parsed voice command with action type and parameters.
enum VoiceCommand {
case openApp(name: String)
case switchToApp(name: String)
case typeText(text: String)
case searchFor(query: String)
case changeMode(ProcessingMode)
case keystroke(KeystrokeAction)
case textEdit(TextEditAction)
case unknown(rawText: String)
}
enum KeystrokeAction {
case undo, redo, copy, paste, cut, selectAll, newLine, newParagraph
}
enum TextEditAction {
case deleteLastWord, deleteLastSentence, deleteLine
}
/// Executes parsed voice commands against macOS system APIs.
actor ActionExecutor {
func execute(_ command: VoiceCommand) async throws {
switch command {
case .openApp(let name):
let config = NSWorkspace.OpenConfiguration()
if let appURL = NSWorkspace.shared.urlForApplication(
withBundleIdentifier: resolveAppBundleId(name)
) {
try await NSWorkspace.shared.openApplication(
at: appURL,
configuration: config
)
}
case .keystroke(let action):
let source = CGEventSource(stateID: .hidSystemState)
switch action {
case .undo:
postKeystroke(key: 6, flags: .maskCommand, source: source) // Cmd+Z
case .copy:
postKeystroke(key: 8, flags: .maskCommand, source: source) // Cmd+C
case .selectAll:
postKeystroke(key: 0, flags: .maskCommand, source: source) // Cmd+A
// ... other keystroke actions
default:
break
}
case .changeMode(let mode):
await ModeManager.shared.setActiveMode(mode)
case .unknown(let rawText):
throw CommandError.unrecognizedCommand(rawText)
default:
break
}
}
}

ℹ️ Info: The two-pass command parsing strategy (regex first, LLM second) ensures that common commands execute instantly (~1ms for regex) while still supporting natural language variations through the LLM (~200-500ms). If the LLM is not loaded, unrecognized commands fall through to the text injection pipeline as regular transcription.


ComponentResponsibilityDependenciesThread Affinity
MenuBarViewSwiftUI menu bar interface, recording state indicator, quick mode switchingAppState, TranscriptionCoordinator@MainActor
SettingsViewMulti-tab settings window (General, Models, Audio, Text, History, Advanced)UserSettings, ModelInfo, AppProfile@MainActor
OverlayViewFloating transparent panel showing recording/processing state indicatorAppState@MainActor
OnboardingViewFirst-launch setup wizard (permissions, model download, hotkey config)PermissionManager, ModelFileManager@MainActor
HistoryViewSearchable, filterable list of past dictation entriesDictationEntry, SwiftData queries@MainActor
ModelManagerViewModel download/delete interface, storage usage displayModelInfo, ModelFileManager@MainActor
ComponentResponsibilityDependenciesThread Affinity
TranscriptionCoordinatorOrchestrates complete dictation lifecycle: start recording, stop, transcribe, post-process, injectAudioCaptureService, WhisperService, LLMService, TextInjectionService, CommandParseractor (own executor)
HotkeyManagerRegisters and monitors global keyboard shortcuts via CGEvent tapCGEvent, TranscriptionCoordinatorMain thread (event tap)
ModeManagerResolves active processing mode by checking AppProfile for focused app, falling back to global defaultAppProfile, UserSettings, NSWorkspace@MainActor
PermissionManagerRequests and monitors Accessibility and Microphone permissionsAXIsProcessTrusted, AVCaptureDevice@MainActor
AppStateCentral @Observable object publishing recording state, current mode, active model info to all UINone (pure state)@MainActor
ComponentResponsibilityDependenciesThread Affinity
WhisperServicewhisper.cpp context management, model loading/unloading, inference execution, language detectionwhisper.cpp bridgeactor (inference thread)
LLMServicellama.cpp context management, prompt execution, token samplingllama.cpp bridge, PromptTemplateEngineactor (inference thread)
AudioCaptureServiceAVAudioEngine lifecycle, tap installation, format conversion (48kHz->16kHz), ring buffer managementAVAudioEngine, CircularAudioBufferAudio thread (callback)
TextInjectionServiceDual-mode text injection (CGEvent keystrokes or clipboard paste), strategy selectionCGEvent, NSPasteboard, ClipboardPreserverBackground thread
CommandParserVoice command prefix detection, regex pattern matching, LLM-based natural language parsingLLMService (optional), regex patternsactor
VocabularyServicePost-inference word replacement pipeline, applies global and app-specific vocabulary entriesVocabularyEntry, AppProfileAny (stateless)
PromptTemplateEngineResolves prompt templates by mode, renders variable substitutionsPromptTemplate, SwiftDataAny (stateless)
VADProcessorVoice activity detection using energy-based thresholding, silence trimmingNone (pure computation)Audio thread
ActionExecutorExecutes parsed voice commands against macOS system APIsNSWorkspace, CGEvent, AppleScript bridgeactor
ComponentResponsibilityDependenciesThread Affinity
whisper.cpp BridgeC bridging header exposing whisper.h functions to Swift, OpaquePointer lifecyclewhisper.cpp static library, Metal frameworkN/A (C library)
llama.cpp BridgeC bridging header exposing llama.h functions to Swift, OpaquePointer lifecyclellama.cpp static library, Metal frameworkN/A (C library)
AVAudioEngine (system)macOS system audio capture, device selection, format negotiationmacOS Audio subsystemAudio thread
CGEvent BridgeQuartz Event Services for keystroke simulation, global event tappingmacOS Accessibility frameworkHID event thread
SwiftDataStoreModelContainer and ModelContext factory, migration plan, background context creationSwiftData, SQLitePer-context
ModelFileManagerGGUF/bin model file download (URLSession), validation (file integrity), storage path managementURLSession, FileManagerBackground thread
NSWorkspace BridgeFrontmost application detection, app launching, bundle ID resolutionAppKitMain thread
NSPasteboard BridgeSystem clipboard read/write, content type handling, preservation/restoreAppKitMain thread

VaulType uses a combination of Swift Concurrency (actor, async/await, Task) and explicit GCD dispatch for components that interact with C libraries or system callbacks.

┌─────────────────────────────────────────────────────────────────────────┐
│ THREAD ARCHITECTURE │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ MAIN THREAD (@MainActor) │ │
│ │ │ │
│ │ • All SwiftUI views and state updates │ │
│ │ • AppState (@Observable) property mutations │ │
│ │ • PermissionManager (AXIsProcessTrusted checks) │ │
│ │ • ModeManager (NSWorkspace.frontmostApplication) │ │
│ │ • HotkeyManager (CGEvent tap registration) │ │
│ │ • NSPasteboard read/write │ │
│ │ • UserDefaults access │ │
│ │ │ │
│ │ Rule: No blocking operations. No inference calls. │ │
│ │ Maximum blocking time: < 16ms (one frame @ 60fps) │ │
│ └────────────────────────┬──────────────────────────────────────┘ │
│ │ │
│ ┌───────────┼───────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────────────┐ ┌──────────────┐ ┌───────────────────────────┐ │
│ │ AUDIO THREAD │ │ INFERENCE │ │ BACKGROUND THREAD(S) │ │
│ │ │ │ THREAD(S) │ │ │ │
│ │ AVAudioEngine │ │ │ │ Model file downloads │ │
│ │ installTap │ │ WhisperSvc │ │ (URLSession background) │ │
│ │ callback. │ │ .transcribe()│ │ │ │
│ │ │ │ │ │ SwiftData background │ │
│ │ Runs on Apple's│ │ LLMService │ │ ModelActor operations │ │
│ │ audio IO │ │ .process() │ │ (history cleanup, export) │ │
│ │ thread. │ │ │ │ │ │
│ │ │ │ CommandParser│ │ Model validation + │ │
│ │ MUST return │ │ .parse() │ │ integrity checks │ │
│ │ quickly │ │ │ │ │ │
│ │ (< 10ms). │ │ Each is a │ │ Vocabulary reloading │ │
│ │ │ │ Swift actor │ │ │ │
│ │ Only writes to │ │ with its own │ │ Clipboard restoration │ │
│ │ ring buffer. │ │ serial │ │ (delayed dispatch) │ │
│ │ │ │ executor. │ │ │ │
│ │ Lock-free │ │ │ │ CGEvent keystroke │ │
│ │ SPSC pattern. │ │ Can run │ │ simulation (with delays) │ │
│ │ │ │ concurrently │ │ │ │
│ │ │ │ with audio │ │ │ │
│ │ │ │ capture. │ │ │ │
│ └───────┬────────┘ └──────┬───────┘ └────────────┬──────────────┘ │
│ │ │ │ │
│ │ Sync Points │ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ SYNCHRONIZATION LAYER │ │
│ │ │ │
│ │ 1. Ring Buffer: Atomic read/write indices (lock-free SPSC) │ │
│ │ Audio thread → writes samples │ │
│ │ Inference thread → reads/drains samples │ │
│ │ │ │
│ │ 2. Actor isolation: WhisperService, LLMService, Command- │ │
│ │ Parser all use Swift actor isolation — mutual exclusion │ │
│ │ guaranteed by the Swift runtime │ │
│ │ │ │
│ │ 3. @MainActor: All UI state transitions dispatched via │ │
│ │ MainActor.run {} or @MainActor-annotated methods │ │
│ │ │ │
│ │ 4. SwiftData ModelContext: One context per thread/actor. │ │
│ │ Main context for UI reads. Background ModelActor for │ │
│ │ writes (cleanup, import). │ │
│ │ │ │
│ │ 5. Combine: @Published properties on @MainActor ensure │ │
│ │ UI updates are delivered on the main thread │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Sync PointMechanismProducerConsumerData
Audio samplesLock-free ring buffer (atomic indices)Audio callback threadInference thread[Float] PCM samples
Transcription resultSwift actor isolation (await)WhisperService actorTranscriptionCoordinator actorString raw text
LLM resultSwift actor isolation (await)LLMService actorTranscriptionCoordinator actorString processed text
UI state updates@MainActor + @ObservableAny actor (via MainActor.run)SwiftUI viewsAppState properties
SwiftData writesModelActor (background context)Background cleanup serviceMain context (auto-refresh)DictationEntry inserts
Pipeline stateCombine @PublishedTranscriptionCoordinatorMenuBarView, OverlayViewPipelineState enum
/// Pipeline states published to the UI via @MainActor.
enum PipelineState: String, Sendable {
case idle
case recording
case transcribing
case postProcessing
case injecting
case error
}
/// The TranscriptionCoordinator is the central orchestrator.
/// It is an actor to serialize pipeline operations and prevent
/// concurrent transcription attempts.
actor TranscriptionCoordinator {
private let audioService: AudioCaptureService
private let whisperService: WhisperService
private let llmService: LLMService
private let textInjector: TextInjectionService
private let commandParser: CommandParser
private let modeManager: ModeManager
/// Published to UI via @MainActor bridge
@MainActor var state: PipelineState = .idle
func startRecording() async throws {
await MainActor.run { state = .recording }
try audioService.startCapture()
}
func stopAndProcess() async throws {
audioService.stopCapture()
await MainActor.run { state = .transcribing }
let samples = audioService.getAccumulatedSamples()
let rawText = try await whisperService.transcribe(
samples: samples,
params: currentWhisperParams()
)
// Check for voice commands first
if let command = try await commandParser.parse(rawText) {
try await ActionExecutor().execute(command)
await MainActor.run { state = .idle }
return
}
// Normal text pipeline
await MainActor.run { state = .postProcessing }
let mode = await modeManager.resolveMode()
let processed: String
do {
processed = try await llmService.process(
rawText: rawText,
mode: mode,
templateEngine: PromptTemplateEngine(
modelContext: backgroundModelContext
)
)
} catch {
// Fallback: inject raw text if LLM fails
processed = rawText
}
await MainActor.run { state = .injecting }
try await textInjector.inject(processed)
await MainActor.run { state = .idle }
}
}

⚠️ Warning: The AudioCaptureService is intentionally not an actor because its installTap callback runs on Apple’s internal audio I/O thread. Making it an actor would cause the callback to hop to the actor’s executor, introducing unacceptable latency. Instead, the audio callback writes to a lock-free ring buffer, and the service exposes @unchecked Sendable conformance with carefully documented thread-safety invariants.


ML model memory management is critical for VaulType. A typical configuration loads 0.5-3 GB of model weights into memory. This section describes how models are loaded, retained, unloaded, and how the app responds to system memory pressure.

┌─────────────────────────────────────────────────────────────────────────┐
│ MODEL LIFECYCLE │
│ │
│ ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ COLD │────▶│ LOADING │────▶│ WARM │────▶│ INFERENCE │ │
│ │ (on disk │ │ │ │ (in RAM, │ │ (actively │ │
│ │ only) │ │ mmap + │ │ ready │ │ running │ │
│ │ │ │ context │ │ for │ │ whisper_ │ │
│ │ │ │ creation │ │ calls) │ │ full or │ │
│ │ │ │ │ │ │ │ llama │ │
│ │ │ │ Time: │ │ │ │ _decode) │ │
│ │ │ │ 100ms- │ │ │ │ │ │
│ │ │ │ 800ms │ │ │ │ │ │
│ └─────────┘ └──────────┘ └────┬─────┘ └──────┬───────┘ │
│ ▲ │ │ │
│ │ │ │ │
│ │ ┌──────────┐ │ │ │
│ │ │ UNLOADING│◀─────────┘ │ │
│ └──────────│ │◀──────────────────────────────┘ │
│ │ whisper_ │ │
│ │ free() / │ Triggers: │
│ │ llama_ │ • User switches model in Settings │
│ │ free() │ • Memory pressure notification │
│ │ │ • App enters background (optional) │
│ │ Time: │ • App termination (cleanup) │
│ │ < 10ms │ │
│ └──────────┘ │
│ │
│ PRELOADING STRATEGY: │
│ │
│ On app launch: │
│ 1. Load Whisper model immediately (required for core function) │
│ 2. Load LLM model in background after Whisper is ready │
│ 3. If both models exceed 60% of system RAM, show warning │
│ │
│ On model switch: │
│ 1. Unload current model of that type │
│ 2. Load new model │
│ 3. Warm up with a short test inference (optional, configurable) │
│ │
└─────────────────────────────────────────────────────────────────────────┘

Both whisper.cpp and llama.cpp support mmap (memory-mapped I/O) for loading model weight files. This is critical for memory efficiency:

/// Model loading configuration emphasizing mmap for memory efficiency.
struct ModelLoadConfiguration {
/// Enable memory-mapped I/O for model weights.
/// When true, the OS maps the model file directly into the process
/// address space. Only pages that are actively needed for inference
/// are loaded into physical RAM. The OS can evict pages under memory
/// pressure and reload them transparently from disk.
var useMmap: Bool = true
/// Number of GPU layers to offload to Metal.
/// -1 means offload all layers. 0 means CPU only.
/// Values in between split layers between CPU and GPU.
var gpuLayers: Int32 = -1
/// Lock model weights in RAM (prevent paging to disk).
/// Use only when real-time latency is critical and sufficient
/// RAM is available. Increases memory pressure.
var lockMemory: Bool = false
}

How mmap affects memory reporting:

┌───────────────────────────────────────────────────────────────┐
│ Memory Reporting for a 2 GB model with mmap enabled │
│ │
│ Activity Monitor "Memory" column: ~2.5 GB │
│ (Includes mmap'd pages — misleading!) │
│ │
│ Actual physical RAM usage: ~800 MB - 1.5 GB │
│ (Only actively-used pages) │
│ │
│ Memory Pressure gauge: Green/Yellow │
│ (OS can reclaim mmap pages freely) │
│ │
│ ┌─────────────────────────────────────────┐ │
│ │ Model file on disk (2 GB) │ │
│ │ ████████████████████████████████████████│ │
│ └─────────────────────────────────────────┘ │
│ ▲ ▲ ▲ │
│ │ │ │ mmap: OS loads pages │
│ │ │ │ on demand │
│ ┌────────┴───────────┴──────────┴─────────┐ │
│ │ Physical RAM (pages loaded on access) │ │
│ │ ████████░░░░████████░░░░░░████████░░░░ │ │
│ │ ^used^ ^not^ ^used^ ^used^ │ │
│ │ loaded │ │
│ └─────────────────────────────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────┘

VaulType responds to macOS memory pressure notifications to prevent the system from becoming unresponsive:

import Foundation
/// Monitors system memory pressure and triggers model unloading
/// when the system is under stress.
final class MemoryPressureMonitor {
private var source: DispatchSourceMemoryPressure?
private let whisperService: WhisperService
private let llmService: LLMService
func startMonitoring() {
source = DispatchSource.makeMemoryPressureSource(
eventMask: [.warning, .critical],
queue: .global(qos: .utility)
)
source?.setEventHandler { [weak self] in
guard let self else { return }
let event = self.source?.data ?? []
Task {
if event.contains(.critical) {
// Critical: Unload both models immediately
await self.llmService.unloadModel()
await self.whisperService.unloadModel()
await MainActor.run {
NotificationCenter.default.post(
name: .modelsUnloadedDueToMemoryPressure,
object: nil
)
}
} else if event.contains(.warning) {
// Warning: Unload LLM only (less essential)
// Whisper is needed for core transcription
await self.llmService.unloadModel()
}
}
}
source?.resume()
}
func stopMonitoring() {
source?.cancel()
source = nil
}
}

Memory management decision matrix:

System RAMRecommended WhisperRecommended LLMmmapGPU Layers
8 GBtiny or baseQwen2.5-0.5B Q4RequiredAll (-1)
8 GBsmallQwen2.5-1.5B Q4RequiredAll (-1)
16 GBsmall or mediumQwen2.5-3B Q4RecommendedAll (-1)
16 GBlarge-v3Llama-3.2-3B Q4RecommendedAll (-1)
32 GBlarge-v3Phi-3-mini Q4OptionalAll (-1)
32 GB+large-v3Any 7B Q4OptionalAll (-1)

🍎 macOS-specific: Apple Silicon’s unified memory architecture means GPU and CPU share the same physical RAM pool. Setting gpuLayers: -1 (offload all layers to Metal) does not consume additional memory beyond what the model already uses — it simply tells the GPU to read from the same memory addresses. On Intel Macs with discrete GPUs, GPU offloading requires a separate copy of the offloaded layers in VRAM.


VaulType is designed for future extensibility through a plugin system. While plugins are not yet implemented in the initial release, the architecture is designed to accommodate them without breaking changes.

import Foundation
/// A VaulType plugin that can process text at specific points in the pipeline.
///
/// Plugins are discovered at launch, instantiated in sandboxed containers,
/// and invoked at well-defined pipeline stages.
protocol VaulTypePlugin: AnyObject, Sendable {
/// Unique reverse-DNS identifier (e.g., "com.example.myplugin").
static var identifier: String { get }
/// Human-readable plugin name shown in Settings.
static var displayName: String { get }
/// Plugin version following semver.
static var version: String { get }
/// Which pipeline stages this plugin hooks into.
static var hooks: Set<PluginHook> { get }
/// Called once when the plugin is loaded. Use for setup.
func activate() async throws
/// Called when the plugin is being unloaded. Use for cleanup.
func deactivate() async
/// Process text at the given pipeline stage.
/// Return the (possibly modified) text to pass to the next stage.
func process(
text: String,
context: PluginContext,
hook: PluginHook
) async throws -> String
}
/// Points in the pipeline where plugins can intercept and modify text.
enum PluginHook: String, Sendable, CaseIterable {
/// After whisper.cpp transcription, before command parsing.
case postTranscription
/// After command parsing (only for non-command text), before LLM.
case preLLM
/// After LLM post-processing, before vocabulary replacement.
case postLLM
/// After vocabulary replacement, before text injection.
case preInjection
}
/// Read-only context provided to plugins during processing.
struct PluginContext: Sendable {
/// The current processing mode.
let mode: ProcessingMode
/// Detected language of the transcription.
let language: String
/// Bundle ID of the focused application.
let targetAppBundleId: String?
/// Duration of the audio recording in seconds.
let audioDuration: TimeInterval
/// Whisper confidence score (0.0 - 1.0).
let confidence: Double
}
/// Manages plugin discovery, lifecycle, and execution.
actor PluginManager {
private var loadedPlugins: [String: any VaulTypePlugin] = [:]
private var enabledPlugins: Set<String> = []
/// Plugin search paths (in priority order).
private let searchPaths: [URL] = [
// User plugins
FileManager.default.urls(
for: .applicationSupportDirectory,
in: .userDomainMask
).first!.appendingPathComponent("VaulType/Plugins"),
// Built-in plugins
Bundle.main.builtInPlugInsURL
].compactMap { $0 }
/// Discover and load all plugins from search paths.
func discoverPlugins() async throws {
for path in searchPaths {
guard FileManager.default.fileExists(atPath: path.path) else {
continue
}
let contents = try FileManager.default.contentsOfDirectory(
at: path,
includingPropertiesForKeys: nil
)
for item in contents where item.pathExtension == "hushplugin" {
try await loadPlugin(at: item)
}
}
}
/// Execute all enabled plugins for the given hook.
func executeHook(
_ hook: PluginHook,
text: String,
context: PluginContext
) async throws -> String {
var result = text
// Plugins execute in registration order
for (id, plugin) in loadedPlugins {
guard enabledPlugins.contains(id) else { continue }
guard type(of: plugin).hooks.contains(hook) else { continue }
result = try await plugin.process(
text: result,
context: context,
hook: hook
)
}
return result
}
}
┌─────────────────────────────────────────────────────────────────────────┐
│ PLUGIN SANDBOX MODEL │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ VaulType Main Process │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ TranscriptionCoordinator │ │ │
│ │ │ │ │ │ │
│ │ │ ▼ │ │ │
│ │ │ PluginManager.executeHook(.postTranscription, ...) │ │ │
│ │ └──────┬────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ └─────────┼─────────────────────────────────────────────────────┘ │
│ │ XPC connection (future) │
│ │ or in-process with restrictions (v1) │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Plugin Sandbox │ │
│ │ │ │
│ │ Restrictions: │ │
│ │ • No network access (URLSession blocked) │ │
│ │ • No file system access outside plugin's own container │ │
│ │ • No access to system APIs (CGEvent, NSWorkspace, etc.) │ │
│ │ • No access to SwiftData or other VaulType internal state │ │
│ │ • 5-second timeout per process() call │ │
│ │ • 50 MB memory limit per plugin │ │
│ │ │ │
│ │ Allowed: │ │
│ │ • Read PluginContext (read-only metadata) │ │
│ │ • Receive text (String) │ │
│ │ • Return modified text (String) │ │
│ │ • Use Foundation string processing │ │
│ │ • Use own bundled resources │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘

ℹ️ Info: The initial plugin architecture (v1) runs plugins in-process with soft restrictions enforced by API design (plugins only receive String and PluginContext, not service references). A future version (v2) will use XPC Services for true process-level isolation, enabling untrusted third-party plugins with hardware-enforced sandboxing.

⚠️ Warning: Plugin support is a planned feature for a future release. The protocols and architecture described here are subject to change. The initial release of VaulType does not load or execute plugins.


VaulType uses a structured error handling strategy with typed error domains, fallback chains for graceful degradation, and consistent user-facing error presentation.

┌─────────────────────────────────────────────────────────────────────────┐
│ ERROR DOMAIN HIERARCHY │
│ │
│ VaulTypeError (top-level) │
│ │ │
│ ├── AudioError │
│ │ ├── .microphonePermissionDenied │
│ │ ├── .noInputDeviceAvailable │
│ │ ├── .formatCreationFailed │
│ │ ├── .converterCreationFailed │
│ │ ├── .engineStartFailed(underlying: Error) │
│ │ └── .bufferOverflow │
│ │ │
│ ├── WhisperError │
│ │ ├── .modelLoadFailed(path: URL) │
│ │ ├── .contextNotLoaded │
│ │ ├── .inferenceFailed(code: Int32) │
│ │ ├── .emptyTranscription │
│ │ └── .modelFileCorrupted(path: URL) │
│ │ │
│ ├── LLMError │
│ │ ├── .modelLoadFailed(path: URL) │
│ │ ├── .contextCreationFailed │
│ │ ├── .inferenceFailed(underlying: Error) │
│ │ ├── .tokenizationFailed │
│ │ ├── .outputTruncated(maxTokens: Int) │
│ │ └── .modelNotLoaded │
│ │ │
│ ├── InjectionError │
│ │ ├── .accessibilityPermissionDenied │
│ │ ├── .cgEventCreationFailed │
│ │ ├── .clipboardWriteFailed │
│ │ ├── .noFocusedApplication │
│ │ └── .pasteTimeout │
│ │ │
│ ├── CommandError │
│ │ ├── .unrecognizedCommand(String) │
│ │ ├── .appNotFound(name: String) │
│ │ ├── .actionFailed(underlying: Error) │
│ │ └── .llmParsingFailed │
│ │ │
│ ├── ModelFileError │
│ │ ├── .downloadFailed(url: URL, underlying: Error) │
│ │ ├── .insufficientDiskSpace(required: UInt64, available: UInt64) │
│ │ ├── .checksumMismatch(expected: String, actual: String) │
│ │ └── .fileNotFound(path: URL) │
│ │ │
│ ├── PromptError │
│ │ ├── .noTemplateForMode(ProcessingMode) │
│ │ ├── .variableNotProvided(name: String) │
│ │ └── .templateRenderFailed │
│ │ │
│ └── PluginError │
│ ├── .loadFailed(identifier: String, underlying: Error) │
│ ├── .executionTimeout(identifier: String, hook: PluginHook) │
│ ├── .memoryLimitExceeded(identifier: String) │
│ └── .invalidOutput(identifier: String) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
/// Top-level error type encompassing all VaulType error domains.
enum VaulTypeError: Error, LocalizedError {
case audio(AudioError)
case whisper(WhisperError)
case llm(LLMError)
case injection(InjectionError)
case command(CommandError)
case modelFile(ModelFileError)
case prompt(PromptError)
case plugin(PluginError)
var errorDescription: String? {
switch self {
case .audio(let e): return e.localizedDescription
case .whisper(let e): return e.localizedDescription
case .llm(let e): return e.localizedDescription
case .injection(let e): return e.localizedDescription
case .command(let e): return e.localizedDescription
case .modelFile(let e): return e.localizedDescription
case .prompt(let e): return e.localizedDescription
case .plugin(let e): return e.localizedDescription
}
}
}
enum AudioError: Error, LocalizedError {
case microphonePermissionDenied
case noInputDeviceAvailable
case formatCreationFailed
case converterCreationFailed
case engineStartFailed(underlying: Error)
case bufferOverflow
var errorDescription: String? {
switch self {
case .microphonePermissionDenied:
return "Microphone access is required. Grant permission in System Settings > Privacy & Security > Microphone."
case .noInputDeviceAvailable:
return "No microphone detected. Connect a microphone and try again."
case .engineStartFailed(let err):
return "Audio engine failed to start: \(err.localizedDescription)"
default:
return "An audio error occurred."
}
}
}

VaulType implements fallback chains so that partial failures degrade functionality gracefully rather than blocking the user entirely.

┌─────────────────────────────────────────────────────────────────────────┐
│ FALLBACK CHAINS │
│ │
│ CHAIN 1: LLM Post-Processing Failure │
│ ───────────────────────────────────── │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ LLM │────▶│ Retry │────▶│ Inject │────▶│ Show │ │
│ │ inference │ │ once with│ │ raw text │ │ warning │ │
│ │ fails │ │ shorter │ │ (skip │ │ to user │ │
│ │ │ │ context │ │ post- │ │ "Text │ │
│ │ │ │ │ │ process) │ │ injected │ │
│ └──────────┘ └──────────┘ └──────────┘ │ without │ │
│ │ │ cleanup" │ │
│ (if retry fails) └──────────┘ │
│ │
│ CHAIN 2: Text Injection Failure │
│ ──────────────────────────────── │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ CGEvent │────▶│ Fall back│────▶│ Copy to │────▶│ Show │ │
│ │ injection│ │ to │ │ clipboard│ │ notifi- │ │
│ │ fails │ │ clipboard│ │ only │ │ cation: │ │
│ │ (no │ │ paste │ │ (no │ │ "Text │ │
│ │ a11y │ │ │ │ paste) │ │ copied" │ │
│ │ perm) │ │ │ │ │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ (if paste also fails) │
│ │
│ CHAIN 3: Whisper Inference Failure │
│ ────────────────────────────────── │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Whisper │────▶│ Retry │────▶│ Show │ │
│ │ inference│ │ with │ │ error │ │
│ │ fails │ │ smaller │ │ "Trans- │ │
│ │ │ │ model │ │ cription │ │
│ │ │ │ (if │ │ failed. │ │
│ │ │ │ avail- │ │ Try │ │
│ │ │ │ able) │ │ again." │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ (if no fallback model) │
│ │
│ CHAIN 4: Audio Capture Failure │
│ ────────────────────────────── │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Audio │────▶│ Try │────▶│ Show │ │
│ │ engine │ │ system │ │ error │ │
│ │ fails │ │ default │ │ with │ │
│ │ with │ │ device │ │ link to │ │
│ │ selected │ │ │ │ Sound │ │
│ │ device │ │ │ │ settings │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ (if default also fails) │
│ │
│ CHAIN 5: Model Loading Failure │
│ ────────────────────────────── │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Model │────▶│ Verify │────▶│ Offer │────▶│ Open │ │
│ │ fails to │ │ file │ │ re- │ │ model │ │
│ │ load │ │ integrity│ │ download │ │ manager │ │
│ │ │ │ (check │ │ (delete │ │ in │ │
│ │ │ │ size, │ │ corrupt │ │ settings │ │
│ │ │ │ header) │ │ file) │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
/// Converts internal errors into user-friendly presentation.
struct ErrorPresenter {
/// Determine the appropriate presentation style for an error.
static func presentation(for error: VaulTypeError) -> ErrorPresentation {
switch error {
case .audio(.microphonePermissionDenied):
return ErrorPresentation(
title: "Microphone Access Required",
message: "VaulType needs microphone access to transcribe your speech.",
style: .alert,
actions: [
.openSystemSettings("Privacy & Security > Microphone"),
.dismiss
],
severity: .blocking
)
case .whisper(.inferenceFailed):
return ErrorPresentation(
title: "Transcription Failed",
message: "The speech-to-text engine encountered an error. Please try again.",
style: .notification,
actions: [.retry, .dismiss],
severity: .recoverable
)
case .llm(.modelNotLoaded):
return ErrorPresentation(
title: "Text Processing Unavailable",
message: "The language model is not loaded. Raw transcription will be used.",
style: .toast,
actions: [.openModelManager, .dismiss],
severity: .degraded
)
case .injection(.accessibilityPermissionDenied):
return ErrorPresentation(
title: "Accessibility Permission Required",
message: "VaulType needs Accessibility access to type text into applications. Text has been copied to your clipboard instead.",
style: .alert,
actions: [
.openSystemSettings("Privacy & Security > Accessibility"),
.dismiss
],
severity: .degraded
)
default:
return ErrorPresentation(
title: "Something Went Wrong",
message: error.localizedDescription,
style: .notification,
actions: [.dismiss],
severity: .recoverable
)
}
}
}
struct ErrorPresentation {
let title: String
let message: String
let style: PresentationStyle
let actions: [ErrorAction]
let severity: ErrorSeverity
enum PresentationStyle {
case alert // Modal alert dialog (blocking errors)
case notification // macOS notification center (transient errors)
case toast // In-app toast overlay (informational)
case menuBarBadge // Red badge on menu bar icon (persistent warnings)
}
enum ErrorAction {
case dismiss
case retry
case openSystemSettings(String)
case openModelManager
case contactSupport
}
enum ErrorSeverity {
case blocking // App cannot function (no mic permission)
case degraded // App works with reduced functionality
case recoverable // Temporary failure, retry may succeed
case informational // No action needed
}
}

Do: Always provide a clear, actionable error message. Tell the user what happened, why it happened, and what they can do about it. Include a direct action (button, link) to resolve the issue.

Don’t: Expose raw error codes, stack traces, or internal component names in user-facing errors. The user does not need to know that whisper_full() returned error code -7.

💡 Tip: All errors are also logged to the unified logging system (os_log) with the com.vaultype subsystem. Users can collect diagnostic logs via Console.app for bug reports. Sensitive data (transcription text) is never included in log messages.



This document is part of the VaulType Documentation. For questions or corrections, please open an issue on the GitHub repository.