LLM Processing Pipeline
Last Updated: 2026-02-20 Component: LLM Processing Engine Module:
VaulType/Services/LLM/Status: Complete (Phases 2-5) License: GPL-3.0
Table of Contents
Section titled “Table of Contents”- 1. Overview
- 2. llama.cpp Integration Architecture
- 3. Ollama Integration as Alternative Backend
- 4. Unified Backend Protocol
- 5. Model Recommendations and Benchmarks
- 6. Prompt Template System Design
- 7. Processing Mode Implementations
- 8. Token Management and Context Window Handling
- 9. Latency Optimization Strategies
- 10. Error Handling and Fallback
- 11. LLM Processing Pipeline Architecture
- 12. Configuration Reference
- 13. Related Documentation
1. Overview
Section titled “1. Overview”VaulType’s LLM Processing Pipeline transforms raw speech-to-text transcriptions into polished, context-aware text output. The entire pipeline runs locally on the user’s machine, maintaining VaulType’s zero-network architecture: no cloud calls, no telemetry, no data exfiltration.
Implementation Status
Section titled “Implementation Status”- llama.cpp (version b8059) — compiled as a static library linked directly into VaulType. This is the sole LLM backend. Metal GPU acceleration is enabled with
GGML_METAL_EMBED_LIBRARY=ON(no separate.metallibneeded). - Ollama backend removed — the OllamaProvider was removed during Phase 5 refactoring. llama.cpp is the only backend.
- LlamaContext.swift lives in
VaulType/Services/LLM/and mirrors the WhisperContext pattern. - llama.cpp shares ggml libs with whisper.cpp — only
-lllamaadded to avoid duplicate ggml linking.
Processing Pipeline (Implemented)
Section titled “Processing Pipeline (Implemented)”AudioCaptureService (AVAudioEngine) → WhisperService (whisper.cpp transcription) → VoicePrefixDetector (strips mode prefix, detects mode switch) → VocabularyService (applies spoken→replacement pairs) → CommandDetector (command vs. dictation classification) → CustomCommandExecutor → AppAliasResolver (app-specific shortcut aliases) → GlobalAliasResolver (user-defined global aliases) → ProcessingModeRouter → PromptTemplateEngine → LLMService (llama.cpp) → OverlayWindow (optional edit-before-inject) → TextInjectionService (CGEvent/clipboard) → DictationHistory (SwiftData)Six Processing Modes
Section titled “Six Processing Modes”| Mode | LLM Required | Description |
|---|---|---|
| Raw | No | Unprocessed whisper output |
| Clean | Yes | Grammar, punctuation, remove filler words |
| Structure | Yes | Paragraphs, bullet lists, headings |
| Prompt | Yes | User-defined prompt template |
| Code | Yes | Convert spoken instructions to source code |
| Custom | Yes | Fully user-defined pipeline |
+------------------------------------------------------------------+| VaulType LLM Pipeline || || Speech Audio || | || v || +-----------------+ +-----------------+ || | whisper.cpp |---->| Raw Transcript | || | (ASR Engine) | | "i need to fix | || +-----------------+ | the login bug" | || +---------+-------+ || | || +---------v---------+ || | Processing Mode | || | Selection | || +---------+---------+ || | || +----------+----------+----------+-------+ || | | | | | || v v v v v || [Raw] [Clean] [Structure] [Prompt] [Code] || | | | | | || | +-----v----------v----------v--------v---+ || | | Prompt Template Engine | || | | - Variable substitution | || | | - Context injection | || | | - Token budget calculation | || | +-----+----------------------------------+ || | | || | +-----v----------------------------------+ || | | LLM Backend (llama.cpp / Ollama) | || | | - Model loading | || | | - Inference | || | | - Token generation | || | +-----+----------------------------------+ || | | || +-----+----+ || | || +-----v-----------+ || | Post-Processing | || | & Injection | || +-----------------+ |+------------------------------------------------------------------+Security: All LLM inference occurs on-device via llama.cpp. No model data, prompts, or outputs ever leave the machine. The Ollama backend was removed in Phase 5.
2. llama.cpp Integration Architecture
Section titled “2. llama.cpp Integration Architecture”2.1 Build and Compilation
Section titled “2.1 Build and Compilation”llama.cpp (version b8059) is compiled as a static library (.a) via CMake and linked directly into the VaulType binary. This eliminates any runtime dependency on external executables or dynamic libraries.
Build script: scripts/setup-llama.sh
# Run the setup script (also executed by Xcode build phase and CI)./scripts/setup-llama.shThe script clones llama.cpp b8059 into vendor/llama.cpp/, builds it with CMake, and installs the static libraries into vendor/llama.cpp/build/lib/.
Key CMake flags:
| Flag | Purpose |
|---|---|
GGML_METAL=ON | Enable Metal GPU acceleration |
GGML_METAL_EMBED_LIBRARY=ON | Embed Metal shaders in binary (no separate .metallib) |
LLAMA_BUILD_EXAMPLES=OFF | Skip building example executables |
CMAKE_BUILD_TYPE=Release | Optimized build |
Linking in Xcode: The Xcode project links against vendor/llama.cpp/build/lib/libllama.a and the shared ggml libraries. llama.cpp shares ggml with whisper.cpp — only -lllama is added; ggml is not duplicated.
llama.cpp is pinned to tag b8059 to ensure reproducible builds. Updates must be tested against all supported GGUF models before merging.
GGML_METAL_EMBED_LIBRARY=ONembeds the Metal shader library directly into the binary, eliminating the need to ship a separate.metallibfile. Metal GPU acceleration works on Apple Silicon and Intel Macs with AMD GPUs.
2.2 Bridging Headers and Swift-C Interop
Section titled “2.2 Bridging Headers and Swift-C Interop”Since llama.cpp is a C/C++ library, a bridging header and Swift-friendly wrapper are required.
Public bridging header (Sources/CLlama/include/llama_bridge.h):
// This is a C header file -- shown here for completeness.#ifndef LLAMA_BRIDGE_H#define LLAMA_BRIDGE_H
#include "llama.h"#include "ggml.h"
// Re-export the core llama.cpp API functions that VaulType uses.// Swift can call these directly through the CLlama module.
#endif /* LLAMA_BRIDGE_H */Swift module map (Sources/CLlama/include/module.modulemap):
module CLlama { header "llama_bridge.h" export *}Swift interop wrapper (Sources/LlamaSwift/LlamaContext.swift):
import Foundationimport CLlama
/// Wraps a llama.cpp context for safe use from Swift./// Manages the lifecycle of the underlying C model and context pointers.public final class LlamaContext: @unchecked Sendable {
// MARK: - Properties
private let model: OpaquePointer // llama_model * private let context: OpaquePointer // llama_context * private let sampler: OpaquePointer // llama_sampler * private let queue = DispatchQueue(label: "com.vaultype.llama", qos: .userInitiated)
public let contextSize: Int32 public let modelPath: String
// MARK: - Initialization
/// Loads a GGUF model and creates an inference context. /// - Parameters: /// - path: Absolute path to the `.gguf` model file. /// - contextSize: Maximum context window size in tokens. /// - gpuLayers: Number of layers to offload to Metal GPU (-1 for all). /// - seed: Random seed for sampling (0 for random). /// - Throws: `LlamaError` if the model cannot be loaded. public init( modelPath path: String, contextSize: Int32 = 4096, gpuLayers: Int32 = -1, seed: UInt32 = 0 ) throws { self.modelPath = path self.contextSize = contextSize
// Configure model parameters var modelParams = llama_model_default_params() modelParams.n_gpu_layers = gpuLayers modelParams.use_mmap = true // Memory-map the model file modelParams.use_mlock = false // Do not pin in RAM
// Load model from GGUF file guard let loadedModel = llama_load_model_from_file(path, modelParams) else { throw LlamaError.modelLoadFailed(path: path) } self.model = loadedModel
// Configure context parameters var ctxParams = llama_context_default_params() ctxParams.n_ctx = UInt32(contextSize) ctxParams.n_batch = 512 ctxParams.n_threads = UInt32(ProcessInfo.processInfo.activeProcessorCount) ctxParams.n_threads_batch = UInt32(ProcessInfo.processInfo.activeProcessorCount) ctxParams.seed = seed
// Create inference context guard let ctx = llama_new_context_with_model(loadedModel, ctxParams) else { llama_free_model(loadedModel) throw LlamaError.contextCreationFailed } self.context = ctx
// Configure sampler chain let samplerParams = llama_sampler_chain_default_params() guard let samplerChain = llama_sampler_chain_init(samplerParams) else { llama_free(ctx) llama_free_model(loadedModel) throw LlamaError.samplerCreationFailed } // Temperature + top-p sampling llama_sampler_chain_add(samplerChain, llama_sampler_init_temp(0.3)) llama_sampler_chain_add(samplerChain, llama_sampler_init_top_p(0.9, 1)) llama_sampler_chain_add(samplerChain, llama_sampler_init_dist(seed)) self.sampler = samplerChain }
deinit { llama_sampler_free(sampler) llama_free(context) llama_free_model(model) }
// MARK: - Tokenization
/// Tokenizes a string into llama tokens. public func tokenize(_ text: String, addBos: Bool = true) -> [llama_token] { let utf8 = text.utf8CString let maxTokens = Int32(utf8.count) + (addBos ? 1 : 0) var tokens = [llama_token](repeating: 0, count: Int(maxTokens)) let count = llama_tokenize(model, text, Int32(text.utf8.count), &tokens, maxTokens, addBos, false) guard count >= 0 else { return [] } return Array(tokens.prefix(Int(count))) }
/// Detokenizes a token back to its string representation. public func detokenize(_ token: llama_token) -> String { var buffer = [CChar](repeating: 0, count: 256) let length = llama_token_to_piece(model, token, &buffer, 256, 0, false) guard length > 0 else { return "" } return String(cString: buffer) }
// MARK: - Inference
/// Runs inference on the given prompt and returns generated text. /// - Parameters: /// - prompt: The input prompt string. /// - maxTokens: Maximum number of tokens to generate. /// - stopSequences: Strings that cause generation to halt. /// - Returns: The generated text. public func generate( prompt: String, maxTokens: Int = 512, stopSequences: [String] = [] ) async throws -> String { try await withCheckedThrowingContinuation { continuation in queue.async { [self] in do { let result = try self.syncGenerate( prompt: prompt, maxTokens: maxTokens, stopSequences: stopSequences ) continuation.resume(returning: result) } catch { continuation.resume(throwing: error) } } } }
private func syncGenerate( prompt: String, maxTokens: Int, stopSequences: [String] ) throws -> String { // Clear KV cache for fresh generation llama_kv_cache_clear(context)
// Tokenize prompt let tokens = tokenize(prompt, addBos: true) guard !tokens.isEmpty else { throw LlamaError.tokenizationFailed } guard tokens.count < contextSize else { throw LlamaError.contextOverflow( tokenCount: tokens.count, contextSize: Int(contextSize) ) }
// Create batch and decode prompt var batch = llama_batch_init(Int32(tokens.count), 0, 1) defer { llama_batch_free(batch) }
for (i, token) in tokens.enumerated() { llama_batch_add(&batch, token, Int32(i), [0], i == tokens.count - 1) }
guard llama_decode(context, batch) == 0 else { throw LlamaError.decodeFailed }
// Generate tokens var output = "" var generatedCount = 0
while generatedCount < maxTokens { let newToken = llama_sampler_sample(sampler, context, -1)
// Check for end-of-sequence if llama_token_is_eog(model, newToken) { break }
let piece = detokenize(newToken) output += piece generatedCount += 1
// Check stop sequences if stopSequences.contains(where: { output.hasSuffix($0) }) { for stop in stopSequences where output.hasSuffix(stop) { output = String(output.dropLast(stop.count)) } break }
// Prepare next batch (single token) llama_batch_clear(&batch) llama_batch_add(&batch, newToken, Int32(tokens.count + generatedCount), [0], true)
guard llama_decode(context, batch) == 0 else { throw LlamaError.decodeFailed } }
return output.trimmingCharacters(in: .whitespacesAndNewlines) }}:warning: Warning:
LlamaContextis marked@unchecked Sendablebecause all access to the underlying C pointers is serialized through a dedicatedDispatchQueue. Do not access the C pointers from multiple threads without synchronization.
2.3 Metal GPU Acceleration
Section titled “2.3 Metal GPU Acceleration”llama.cpp uses Apple’s Metal framework to offload tensor operations to the GPU. On Apple Silicon, this provides a substantial performance boost — typically 3-5x faster token generation compared to CPU-only inference.
How GPU offloading works:
+-------------------------------------------------+| Model Layer Distribution || || Layer 0 [GPU] ======================== || Layer 1 [GPU] ======================== || Layer 2 [GPU] ======================== || ... || Layer N-2 [GPU] ======================== || Layer N-1 [GPU] ======================== || Layer N [CPU] ======================== || (embedding/output layers may stay on CPU) || || gpuLayers = -1 --> offload ALL layers to GPU || gpuLayers = 0 --> CPU only || gpuLayers = 20 --> first 20 layers on GPU |+-------------------------------------------------+Configuration for Metal acceleration:
/// Determines optimal GPU layer count based on available memory.func optimalGPULayers(for modelSizeBytes: UInt64) -> Int32 { let device = MTLCreateSystemDefaultDevice() let availableVRAM = device?.recommendedMaxWorkingSetSize ?? 0
// Reserve 512 MB for KV cache and Metal overhead let reservedBytes: UInt64 = 512 * 1024 * 1024 let usableVRAM = availableVRAM > reservedBytes ? availableVRAM - reservedBytes : 0
if modelSizeBytes <= usableVRAM { return -1 // Offload all layers to GPU } else { // Estimate partial offload: proportional to available VRAM let ratio = Double(usableVRAM) / Double(modelSizeBytes) let estimatedLayers = Int32(ratio * 35) // Approximate for 3B models return max(estimatedLayers, 0) }}:apple: macOS-specific: Metal GPU acceleration is available on all Apple Silicon Macs and on Intel Macs with discrete AMD GPUs. Intel Macs with integrated graphics will fall back to CPU inference via the Accelerate framework.
:bulb: Tip: For best performance on Apple Silicon, set
gpuLayers = -1to offload the entire model to the GPU. A 3B parameter Q4_K_M model requires approximately 2 GB of VRAM, well within the unified memory of any M1 or later chip.
2.4 Memory-Mapped GGUF Models
Section titled “2.4 Memory-Mapped GGUF Models”GGUF (GGML Universal File Format) is the model file format used by llama.cpp. VaulType uses memory mapping (mmap) to load models, which provides:
- Fast startup — The OS maps the file into virtual memory without reading the entire file into RAM upfront.
- Shared memory — If multiple processes load the same model file, the OS can share the physical memory pages.
- Low resident memory — Only the pages actually accessed during inference are loaded into physical RAM.
+----------------------------------+| GGUF File on Disk || +----------------------------+ || | Header (metadata, vocab) | | <-- Read into RAM immediately| +----------------------------+ || | Tensor Data (weights) | | <-- Memory-mapped (mmap)| | - Layer 0 weights | | Pages loaded on demand| | - Layer 1 weights | | by the OS virtual memory| | - ... | | subsystem| | - Layer N weights | || +----------------------------+ |+----------------------------------+Memory mapping is enabled by default:
var modelParams = llama_model_default_params()modelParams.use_mmap = true // Enable memory-mapped loadingmodelParams.use_mlock = false // Do not pin pages in RAM (let OS manage):information_source: Info: Setting
use_mlock = truepins all model pages in RAM, preventing the OS from swapping them out. This reduces latency jitter but increases resident memory. Only enable this if the system has ample free RAM (model size + 2 GB headroom).
2.5 LlamaContext Wrapper
Section titled “2.5 LlamaContext Wrapper”The complete LlamaContext wrapper is shown in Section 2.2. Additional helper types used throughout the LLM pipeline:
// MARK: - Error Types
/// Errors that can occur during llama.cpp operations.public enum LlamaError: LocalizedError { case modelLoadFailed(path: String) case contextCreationFailed case samplerCreationFailed case tokenizationFailed case contextOverflow(tokenCount: Int, contextSize: Int) case decodeFailed case generationTimeout case memoryPressure(available: UInt64, required: UInt64) case cancelled
public var errorDescription: String? { switch self { case .modelLoadFailed(let path): return "Failed to load model from: \(path)" case .contextCreationFailed: return "Failed to create llama context" case .samplerCreationFailed: return "Failed to initialize token sampler" case .tokenizationFailed: return "Failed to tokenize input text" case .contextOverflow(let count, let size): return "Input tokens (\(count)) exceed context window (\(size))" case .decodeFailed: return "Model decode operation failed" case .generationTimeout: return "Text generation timed out" case .memoryPressure(let available, let required): return "Insufficient memory: \(available / 1_048_576) MB available, " + "\(required / 1_048_576) MB required" case .cancelled: return "Generation was cancelled" } }}
// MARK: - Generation Options
/// Configuration options for a single generation request.public struct GenerationOptions: Sendable { public var maxTokens: Int = 512 public var temperature: Float = 0.3 public var topP: Float = 0.9 public var topK: Int32 = 40 public var repeatPenalty: Float = 1.1 public var stopSequences: [String] = [] public var timeout: TimeInterval = 30.0
public static let `default` = GenerationOptions()
public static let creative = GenerationOptions( temperature: 0.7, topP: 0.95, topK: 50 )
public static let deterministic = GenerationOptions( temperature: 0.1, topP: 0.5, topK: 10 )
public init( maxTokens: Int = 512, temperature: Float = 0.3, topP: Float = 0.9, topK: Int32 = 40, repeatPenalty: Float = 1.1, stopSequences: [String] = [], timeout: TimeInterval = 30.0 ) { self.maxTokens = maxTokens self.temperature = temperature self.topP = topP self.topK = topK self.repeatPenalty = repeatPenalty self.stopSequences = stopSequences self.timeout = timeout }}3. Ollama Integration (Removed)
Section titled “3. Ollama Integration (Removed)”Note: The Ollama backend was removed during Phase 5 refactoring. VaulType uses llama.cpp exclusively. The sections below are retained for historical reference only.
3.1 When to Use Ollama vs llama.cpp
Section titled “3.1 When to Use Ollama vs llama.cpp”| Criteria | llama.cpp (Embedded) | Ollama (Local Server) |
|---|---|---|
| Latency | Lower (in-process) | Slightly higher (HTTP overhead) |
| Setup | Zero (bundled) | Requires Ollama install |
| Model management | Manual GGUF files | Ollama CLI (ollama pull) |
| Model switching | Requires reload (~2-5s) | Near-instant (Ollama caches) |
| Memory control | Fine-grained | Ollama manages |
| GPU layers | Configurable per-model | Ollama auto-configures |
| Multiple models | One at a time | Ollama can serve multiple |
| Offline guarantee | Always works | Requires Ollama daemon running |
| Recommended for | Default / production use | Power users, model experimentation |
:bulb: Tip: Use the embedded llama.cpp backend as the default. Switch to Ollama if you frequently experiment with different models or want to share a single model instance across multiple applications.
3.2 Ollama Setup Instructions
Section titled “3.2 Ollama Setup Instructions”Install Ollama:
# Install via Homebrewbrew install ollama
# Or download from https://ollama.com# The macOS app installs the CLI and daemon automaticallyPull a recommended model:
# Pull Qwen2.5-3B-Instruct (recommended for VaulType)ollama pull qwen2.5:3b-instruct-q4_K_M
# Pull Phi-3.5-mini as an alternativeollama pull phi3.5:3.8b-mini-instruct-q4_K_M
# Pull Llama-3.2-3B-Instructollama pull llama3.2:3b-instruct-q4_K_M
# Verify the model is availableollama listStart the Ollama server (if not running as a macOS service):
# Start the Ollama server daemonollama serve
# It will listen on localhost:11434 by default# Verify it is running:curl http://localhost:11434/api/tags:lock: Security: Ollama binds to
localhost(127.0.0.1) by default. VaulType only connects tolocalhost:11434. No data leaves the machine. If Ollama is configured to listen on0.0.0.0, VaulType will still only connect to127.0.0.1.
3.3 API Endpoints Used
Section titled “3.3 API Endpoints Used”VaulType uses two Ollama REST API endpoints:
| Endpoint | Method | Purpose |
|---|---|---|
/api/tags | GET | List available models |
/api/generate | POST | Run inference (non-streaming) |
/api/generate (stream) | POST | Run inference (streaming) |
Request format for /api/generate:
{ "model": "qwen2.5:3b-instruct-q4_K_M", "prompt": "<the full prompt text>", "stream": false, "options": { "temperature": 0.3, "top_p": 0.9, "top_k": 40, "repeat_penalty": 1.1, "num_predict": 512, "stop": ["\n\n---"] }}Response format:
{ "model": "qwen2.5:3b-instruct-q4_K_M", "response": "The generated text output...", "done": true, "total_duration": 1234567890, "load_duration": 123456789, "prompt_eval_count": 42, "eval_count": 128, "eval_duration": 987654321}3.4 OllamaClient Implementation
Section titled “3.4 OllamaClient Implementation”import Foundation
/// Client for the Ollama REST API running on localhost.public actor OllamaClient {
// MARK: - Types
public struct OllamaGenerateRequest: Codable, Sendable { let model: String let prompt: String let stream: Bool let options: OllamaOptions?
struct OllamaOptions: Codable, Sendable { let temperature: Float? let top_p: Float? let top_k: Int? let repeat_penalty: Float? let num_predict: Int? let stop: [String]? } }
public struct OllamaGenerateResponse: Codable, Sendable { let model: String let response: String let done: Bool let total_duration: UInt64? let load_duration: UInt64? let prompt_eval_count: Int? let eval_count: Int? let eval_duration: UInt64? }
public struct OllamaTagsResponse: Codable, Sendable { let models: [OllamaModelInfo]
struct OllamaModelInfo: Codable, Sendable { let name: String let size: UInt64 let digest: String } }
public enum OllamaError: LocalizedError { case serverNotRunning case modelNotFound(String) case requestFailed(statusCode: Int, body: String) case decodingFailed case timeout
public var errorDescription: String? { switch self { case .serverNotRunning: return "Ollama server is not running on localhost:11434" case .modelNotFound(let model): return "Model '\(model)' not found. Run: ollama pull \(model)" case .requestFailed(let code, let body): return "Ollama request failed (\(code)): \(body)" case .decodingFailed: return "Failed to decode Ollama response" case .timeout: return "Ollama request timed out" } } }
// MARK: - Properties
private let baseURL: URL private let session: URLSession
// MARK: - Initialization
public init( host: String = "127.0.0.1", port: Int = 11434, timeout: TimeInterval = 60.0 ) { self.baseURL = URL(string: "http://\(host):\(port)")! let config = URLSessionConfiguration.ephemeral config.timeoutIntervalForRequest = timeout config.timeoutIntervalForResource = timeout config.waitsForConnectivity = false self.session = URLSession(configuration: config) }
// MARK: - Public API
/// Lists all models available in the local Ollama instance. public func listModels() async throws -> [String] { let url = baseURL.appendingPathComponent("api/tags") let (data, response) = try await performRequest(url: url) guard let httpResponse = response as? HTTPURLResponse, httpResponse.statusCode == 200 else { throw OllamaError.serverNotRunning } let tagsResponse = try JSONDecoder().decode( OllamaTagsResponse.self, from: data ) return tagsResponse.models.map(\.name) }
/// Checks whether the Ollama server is reachable. public func isAvailable() async -> Bool { do { _ = try await listModels() return true } catch { return false } }
/// Generates text using the specified model and prompt. public func generate( model: String, prompt: String, options: GenerationOptions = .default ) async throws -> String { let url = baseURL.appendingPathComponent("api/generate")
let requestBody = OllamaGenerateRequest( model: model, prompt: prompt, stream: false, options: .init( temperature: options.temperature, top_p: options.topP, top_k: Int(options.topK), repeat_penalty: options.repeatPenalty, num_predict: options.maxTokens, stop: options.stopSequences.isEmpty ? nil : options.stopSequences ) )
var request = URLRequest(url: url) request.httpMethod = "POST" request.setValue("application/json", forHTTPHeaderField: "Content-Type") request.httpBody = try JSONEncoder().encode(requestBody)
let (data, response) = try await performRequest(request: request)
guard let httpResponse = response as? HTTPURLResponse else { throw OllamaError.requestFailed(statusCode: 0, body: "No HTTP response") }
switch httpResponse.statusCode { case 200: let generateResponse = try JSONDecoder().decode( OllamaGenerateResponse.self, from: data ) return generateResponse.response .trimmingCharacters(in: .whitespacesAndNewlines) case 404: throw OllamaError.modelNotFound(model) default: let body = String(data: data, encoding: .utf8) ?? "Unknown error" throw OllamaError.requestFailed( statusCode: httpResponse.statusCode, body: body ) } }
// MARK: - Private Helpers
private func performRequest(url: URL) async throws -> (Data, URLResponse) { do { return try await session.data(from: url) } catch let error as URLError where error.code == .cannotConnectToHost { throw OllamaError.serverNotRunning } catch let error as URLError where error.code == .timedOut { throw OllamaError.timeout } }
private func performRequest(request: URLRequest) async throws -> (Data, URLResponse) { do { return try await session.data(for: request) } catch let error as URLError where error.code == .cannotConnectToHost { throw OllamaError.serverNotRunning } catch let error as URLError where error.code == .timedOut { throw OllamaError.timeout } }}:warning: Warning: The
URLSessionis configured withephemeralconfiguration to avoid caching any prompt data or model responses to disk. This preserves the privacy-first architecture.
4. Unified Backend Protocol
Section titled “4. Unified Backend Protocol”4.1 LLMBackend Protocol
Section titled “4.1 LLMBackend Protocol”Both llama.cpp and Ollama conform to a unified protocol, enabling transparent backend switching.
import Foundation
/// Protocol that all LLM backends must conform to./// Enables transparent switching between llama.cpp and Ollama.public protocol LLMBackend: Sendable {
/// A human-readable name for this backend (e.g., "llama.cpp", "Ollama"). var name: String { get }
/// Whether the backend is currently ready to serve requests. var isReady: Bool { get async }
/// The name or identifier of the currently loaded model. var currentModel: String? { get async }
/// Loads a model, preparing the backend for inference. /// - Parameter modelIdentifier: Model path (llama.cpp) or model name (Ollama). func loadModel(_ modelIdentifier: String) async throws
/// Unloads the current model, freeing resources. func unloadModel() async
/// Generates text from the given prompt. /// - Parameters: /// - prompt: The full prompt string. /// - options: Generation configuration. /// - Returns: The generated text. func generate( prompt: String, options: GenerationOptions ) async throws -> String
/// Counts the approximate number of tokens in the given text. /// - Parameter text: The text to tokenize. /// - Returns: Approximate token count. func countTokens(in text: String) async -> Int}llama.cpp backend adapter:
/// Adapts LlamaContext to the LLMBackend protocol.public actor LlamaCppBackend: LLMBackend {
public let name = "llama.cpp"
private var context: LlamaContext? private var loadedModelPath: String?
public var isReady: Bool { context != nil }
public var currentModel: String? { loadedModelPath }
public func loadModel(_ modelIdentifier: String) async throws { // Unload previous model if any await unloadModel()
let gpuLayers = optimalGPULayers( for: try fileSize(at: modelIdentifier) )
context = try LlamaContext( modelPath: modelIdentifier, contextSize: 4096, gpuLayers: gpuLayers ) loadedModelPath = modelIdentifier }
public func unloadModel() async { context = nil loadedModelPath = nil }
public func generate( prompt: String, options: GenerationOptions ) async throws -> String { guard let context else { throw LlamaError.contextCreationFailed } return try await context.generate( prompt: prompt, maxTokens: options.maxTokens, stopSequences: options.stopSequences ) }
public func countTokens(in text: String) async -> Int { guard let context else { return text.count / 4 } // rough estimate return context.tokenize(text, addBos: false).count }
private func fileSize(at path: String) throws -> UInt64 { let attrs = try FileManager.default.attributesOfItem(atPath: path) return attrs[.size] as? UInt64 ?? 0 }}Ollama backend adapter:
/// Adapts OllamaClient to the LLMBackend protocol.public actor OllamaBackend: LLMBackend {
public let name = "Ollama"
private let client: OllamaClient private var modelName: String?
public init(client: OllamaClient = OllamaClient()) { self.client = client }
public var isReady: Bool { get async { await client.isAvailable() && modelName != nil } }
public var currentModel: String? { modelName }
public func loadModel(_ modelIdentifier: String) async throws { // Verify model exists in Ollama let models = try await client.listModels() guard models.contains(where: { $0.hasPrefix(modelIdentifier) }) else { throw OllamaClient.OllamaError.modelNotFound(modelIdentifier) } modelName = modelIdentifier }
public func unloadModel() async { modelName = nil }
public func generate( prompt: String, options: GenerationOptions ) async throws -> String { guard let modelName else { throw OllamaClient.OllamaError.modelNotFound("No model loaded") } return try await client.generate( model: modelName, prompt: prompt, options: options ) }
public func countTokens(in text: String) async -> Int { // Approximate: Ollama does not expose tokenization directly. // Use a rough heuristic of ~4 characters per token for English text. return max(1, text.utf8.count / 4) }}4.2 Backend Switching
Section titled “4.2 Backend Switching”import SwiftUIimport Combine
/// Manages the active LLM backend and provides a unified interface for/// the rest of the application.@Observablepublic final class LLMService {
// MARK: - Types
public enum BackendType: String, Codable, CaseIterable, Sendable { case llamaCpp = "llama.cpp" case ollama = "Ollama" }
public enum ServiceState: Sendable { case idle case loading case ready case processing case error(Error) }
// MARK: - Published State
public private(set) var state: ServiceState = .idle public private(set) var activeBackendType: BackendType = .llamaCpp
// MARK: - Private
private var activeBackend: (any LLMBackend)? private let llamaCppBackend = LlamaCppBackend() private let ollamaBackend = OllamaBackend()
// MARK: - Backend Switching
/// Switches to the specified backend and loads the given model. /// - Parameters: /// - backendType: The backend to switch to. /// - modelIdentifier: The model path (llama.cpp) or name (Ollama). public func switchBackend( to backendType: BackendType, model modelIdentifier: String ) async throws { state = .loading
// Unload current backend if let current = activeBackend { await current.unloadModel() }
// Select and load new backend let backend: any LLMBackend = switch backendType { case .llamaCpp: llamaCppBackend case .ollama: ollamaBackend }
do { try await backend.loadModel(modelIdentifier) activeBackend = backend activeBackendType = backendType state = .ready } catch { state = .error(error) throw error } }
/// Generates text using the active backend. public func generate( prompt: String, options: GenerationOptions = .default ) async throws -> String { guard let backend = activeBackend else { throw LLMServiceError.noBackendLoaded } state = .processing defer { state = .ready }
return try await backend.generate(prompt: prompt, options: options) }
/// Counts tokens in the given text using the active backend. public func countTokens(in text: String) async -> Int { guard let backend = activeBackend else { return text.count / 4 } return await backend.countTokens(in: text) }}
public enum LLMServiceError: LocalizedError { case noBackendLoaded
public var errorDescription: String? { switch self { case .noBackendLoaded: return "No LLM backend is loaded. Please select a model first." } }}Example: Switching backends at runtime:
// In a ViewModel or Settings handler:let llmService = LLMService()
// Start with llama.cpp (default)try await llmService.switchBackend( to: .llamaCpp, model: "/Users/me/Library/Application Support/VaulType/Models/qwen2.5-3b-instruct-q4_K_M.gguf")
// Later, switch to Ollamatry await llmService.switchBackend( to: .ollama, model: "qwen2.5:3b-instruct-q4_K_M")
// Generate text -- same API regardless of backendlet result = try await llmService.generate( prompt: "Fix this text: i went to teh store", options: .deterministic)5. Model Recommendations and Benchmarks
Section titled “5. Model Recommendations and Benchmarks”5.1 Recommended Models
Section titled “5.1 Recommended Models”VaulType is optimized for small, fast instruction-following models in the 1-4B parameter range. These models balance quality, speed, and memory usage for real-time text processing tasks.
| Model | Parameters | Quant | File Size | Context Window |
|---|---|---|---|---|
| Qwen2.5-3B-Instruct (recommended) | 3.09B | Q4_K_M | 2.0 GB | 32,768 |
| Phi-3.5-mini-instruct | 3.82B | Q4_K_M | 2.4 GB | 128,000 |
| Llama-3.2-3B-Instruct | 3.21B | Q4_K_M | 2.0 GB | 8,192 |
| Qwen2.5-1.5B-Instruct (lightweight) | 1.54B | Q4_K_M | 1.0 GB | 32,768 |
| Gemma-2-2B-IT (alternative) | 2.61B | Q4_K_M | 1.7 GB | 8,192 |
5.2 Performance Benchmarks
Section titled “5.2 Performance Benchmarks”Benchmarks were conducted on three representative Apple Silicon configurations. All tests used Q4_K_M quantization with all layers offloaded to GPU (gpuLayers = -1).
Test methodology: Average of 50 inference runs. Input: 150 tokens. Output: 100 tokens. Temperature: 0.3.
Apple Silicon M1 (8 GB Unified Memory)
Section titled “Apple Silicon M1 (8 GB Unified Memory)”| Model | Prompt Eval (tok/s) | Generation (tok/s) | Total Latency | RAM Usage | VRAM Usage |
|---|---|---|---|---|---|
| Qwen2.5-3B-Instruct | 285 | 32 | 3.8s | 2.3 GB | 2.0 GB |
| Phi-3.5-mini | 240 | 27 | 4.5s | 2.8 GB | 2.4 GB |
| Llama-3.2-3B-Instruct | 275 | 30 | 4.0s | 2.3 GB | 2.0 GB |
| Qwen2.5-1.5B-Instruct | 410 | 48 | 2.6s | 1.4 GB | 1.0 GB |
Apple Silicon M2 Pro (16 GB Unified Memory)
Section titled “Apple Silicon M2 Pro (16 GB Unified Memory)”| Model | Prompt Eval (tok/s) | Generation (tok/s) | Total Latency | RAM Usage | VRAM Usage |
|---|---|---|---|---|---|
| Qwen2.5-3B-Instruct | 420 | 48 | 2.6s | 2.3 GB | 2.0 GB |
| Phi-3.5-mini | 355 | 40 | 3.1s | 2.8 GB | 2.4 GB |
| Llama-3.2-3B-Instruct | 400 | 45 | 2.8s | 2.3 GB | 2.0 GB |
| Qwen2.5-1.5B-Instruct | 620 | 72 | 1.8s | 1.4 GB | 1.0 GB |
Apple Silicon M3 Max (36 GB Unified Memory)
Section titled “Apple Silicon M3 Max (36 GB Unified Memory)”| Model | Prompt Eval (tok/s) | Generation (tok/s) | Total Latency | RAM Usage | VRAM Usage |
|---|---|---|---|---|---|
| Qwen2.5-3B-Instruct | 580 | 65 | 1.9s | 2.3 GB | 2.0 GB |
| Phi-3.5-mini | 490 | 55 | 2.3s | 2.8 GB | 2.4 GB |
| Llama-3.2-3B-Instruct | 550 | 62 | 2.1s | 2.3 GB | 2.0 GB |
| Qwen2.5-1.5B-Instruct | 850 | 95 | 1.3s | 1.4 GB | 1.0 GB |
Intel Mac (i7-9750H + AMD Radeon Pro 5500M, 16 GB RAM)
Section titled “Intel Mac (i7-9750H + AMD Radeon Pro 5500M, 16 GB RAM)”| Model | Prompt Eval (tok/s) | Generation (tok/s) | Total Latency | RAM Usage | VRAM Usage |
|---|---|---|---|---|---|
| Qwen2.5-3B-Instruct | 95 | 12 | 10.2s | 2.3 GB | N/A* |
| Phi-3.5-mini | 80 | 10 | 12.1s | 2.8 GB | N/A* |
| Llama-3.2-3B-Instruct | 90 | 11 | 10.8s | 2.3 GB | N/A* |
| Qwen2.5-1.5B-Instruct | 140 | 18 | 7.0s | 1.4 GB | N/A* |
*Intel Macs use CPU inference via the Accelerate framework. AMD GPU support through Metal is available but yields inconsistent improvements for LLM workloads on discrete AMD GPUs.
5.3 Model Selection Guidance
Section titled “5.3 Model Selection Guidance”Decision Tree: Which Model Should I Use?
+------------------+ | Available RAM? | +--------+---------+ | +--------------+--------------+ | | | < 6 GB 6-12 GB > 12 GB | | | v v v +------------------+ +-----------+ +-----------+ | Qwen2.5-1.5B | | Qwen2.5 | | Phi-3.5 | | (lightweight, | | -3B | | -mini | | fits in 4 GB | | (best | | (largest | | total RAM) | | balance) | | context) | +------------------+ +-----------+ +-----------+:information_source: Info: Qwen2.5-3B-Instruct is the recommended default because it offers the best combination of output quality, speed, and memory efficiency for text post-processing tasks. It consistently outperforms Llama-3.2-3B on grammar correction and formatting tasks while using the same memory footprint.
:bulb: Tip: If you need to process very long transcriptions (over 4,000 words), consider Phi-3.5-mini with its 128K context window. For most real-time dictation use, Qwen2.5-3B’s 32K context is more than sufficient.
6. Prompt Template System Design
Section titled “6. Prompt Template System Design”6.1 Template Variables
Section titled “6.1 Template Variables”VaulType’s prompt template system uses variable substitution with {variable_name} syntax. The following variables are available in all templates:
| Variable | Description | Example Value |
|---|---|---|
{text} | The raw transcription from whisper.cpp | "i need to fix the login bug before tomorrow" |
{language} | Detected or configured language code | "en" |
{app_name} | Name of the frontmost application | "Xcode" |
{context} | Optional user-provided context string | "Writing a pull request description" |
{timestamp} | Current ISO 8601 timestamp | "2026-02-13T14:30:00Z" |
{word_count} | Approximate word count of input text | "42" |
{mode} | Current processing mode name | "Clean" |
Variable resolution order:
- Built-in variables (
{text},{language},{timestamp},{word_count},{mode}) - Application context (
{app_name},{context}) - User-defined variables (from Custom mode template configuration)
6.2 SwiftData PromptTemplate Model
Section titled “6.2 SwiftData PromptTemplate Model”import Foundationimport SwiftData
/// Represents a reusable prompt template stored in SwiftData.@Modelpublic final class PromptTemplate {
// MARK: - Stored Properties
/// Unique identifier for the template. @Attribute(.unique) public var id: UUID
/// Human-readable name shown in the UI. public var name: String
/// The template string with {variable} placeholders. public var templateBody: String
/// Description of what this template does. public var templateDescription: String
/// The processing mode this template belongs to. public var mode: ProcessingMode
/// Whether this is a built-in template (cannot be deleted). public var isBuiltIn: Bool
/// Display order within the mode's template list. public var sortOrder: Int
/// Creation timestamp. public var createdAt: Date
/// Last modified timestamp. public var updatedAt: Date
/// Optional system prompt prepended to the template. public var systemPrompt: String?
/// Stop sequences for this template. public var stopSequences: [String]
/// Recommended temperature for this template. public var temperature: Float
/// Maximum tokens to generate. public var maxTokens: Int
// MARK: - Initialization
public init( name: String, templateBody: String, description: String = "", mode: ProcessingMode = .custom, isBuiltIn: Bool = false, sortOrder: Int = 0, systemPrompt: String? = nil, stopSequences: [String] = [], temperature: Float = 0.3, maxTokens: Int = 512 ) { self.id = UUID() self.name = name self.templateBody = templateBody self.templateDescription = description self.mode = mode self.isBuiltIn = isBuiltIn self.sortOrder = sortOrder self.createdAt = Date() self.updatedAt = Date() self.systemPrompt = systemPrompt self.stopSequences = stopSequences self.temperature = temperature self.maxTokens = maxTokens }}
// MARK: - Processing Mode Enum
/// All available text processing modes.public enum ProcessingMode: String, Codable, CaseIterable, Sendable { case raw = "Raw" case clean = "Clean" case structure = "Structure" case prompt = "Prompt" case code = "Code" case custom = "Custom"
/// Whether this mode requires LLM processing. public var requiresLLM: Bool { switch self { case .raw: return false default: return true } }
/// Default generation options for this mode. public var defaultOptions: GenerationOptions { switch self { case .raw: return .default case .clean: return GenerationOptions( maxTokens: 512, temperature: 0.1, topP: 0.5, topK: 10 ) case .structure: return GenerationOptions( maxTokens: 768, temperature: 0.2, topP: 0.7, topK: 20 ) case .prompt: return GenerationOptions( maxTokens: 1024, temperature: 0.4, topP: 0.9, topK: 40 ) case .code: return GenerationOptions( maxTokens: 512, temperature: 0.1, topP: 0.5, topK: 10 ) case .custom: return .default } }}6.3 Built-in Templates
Section titled “6.3 Built-in Templates”VaulType ships with one built-in template per processing mode (except Raw, which has no template). These templates are seeded into SwiftData on first launch.
/// Seeds the default built-in prompt templates into the SwiftData store.func seedBuiltInTemplates(context: ModelContext) { let builtIns: [PromptTemplate] = [ // Clean Mode PromptTemplate( name: "Standard Clean", templateBody: BuiltInTemplates.clean, description: "Fix spelling, grammar, punctuation, and capitalization.", mode: .clean, isBuiltIn: true, sortOrder: 0, temperature: 0.1, maxTokens: 512 ), // Structure Mode PromptTemplate( name: "Auto-Structure", templateBody: BuiltInTemplates.structure, description: "Format text as bullet lists, numbered steps, or tables.", mode: .structure, isBuiltIn: true, sortOrder: 0, temperature: 0.2, maxTokens: 768 ), // Prompt Mode PromptTemplate( name: "Prompt Formatter", templateBody: BuiltInTemplates.prompt, description: "Reformat speech into a well-structured AI prompt.", mode: .prompt, isBuiltIn: true, sortOrder: 0, temperature: 0.4, maxTokens: 1024 ), // Code Mode PromptTemplate( name: "Code Documentation", templateBody: BuiltInTemplates.code, description: "Format as code comments, commit messages, or docs.", mode: .code, isBuiltIn: true, sortOrder: 0, temperature: 0.1, maxTokens: 512 ), ]
for template in builtIns { context.insert(template) } try? context.save()}6.4 Custom Templates
Section titled “6.4 Custom Templates”Users can create, edit, and manage custom templates through the Settings UI. Custom templates support all built-in variables plus user-defined ones.
/// Resolves template variables and produces a final prompt string.public struct TemplateEngine {
/// Resolves all {variable} placeholders in the template body. /// - Parameters: /// - template: The prompt template to resolve. /// - text: The raw transcription text. /// - context: Additional contextual information. /// - Returns: The fully resolved prompt string. public static func resolve( template: PromptTemplate, text: String, language: String = "en", appName: String = "Unknown", context: String = "", customVariables: [String: String] = [:] ) -> String { var resolved = ""
// Prepend system prompt if present if let systemPrompt = template.systemPrompt, !systemPrompt.isEmpty { resolved += systemPrompt + "\n\n" }
resolved += template.templateBody
// Built-in variables let builtInVars: [String: String] = [ "text": text, "language": language, "app_name": appName, "context": context, "timestamp": ISO8601DateFormatter().string(from: Date()), "word_count": "\(text.split(separator: " ").count)", "mode": template.mode.rawValue, ]
// Merge built-in with custom (custom takes precedence) let allVars = builtInVars.merging(customVariables) { _, custom in custom }
// Replace all {variable} placeholders for (key, value) in allVars { resolved = resolved.replacingOccurrences( of: "{\(key)}", with: value ) }
return resolved }}6.5 Template Import and Export
Section titled “6.5 Template Import and Export”Templates can be exported as JSON files for sharing and backup, and imported from JSON.
/// Handles import/export of prompt templates as JSON files.public struct TemplateExporter {
// MARK: - Codable DTO
public struct TemplateDTO: Codable { let name: String let templateBody: String let description: String let mode: String let systemPrompt: String? let stopSequences: [String] let temperature: Float let maxTokens: Int let version: Int // Schema version for forward compatibility }
public struct TemplateBundle: Codable { let bundleVersion: Int let exportedAt: String let templates: [TemplateDTO] }
// MARK: - Export
/// Exports templates to a JSON file. public static func export( templates: [PromptTemplate], to url: URL ) throws { let dtos = templates.map { template in TemplateDTO( name: template.name, templateBody: template.templateBody, description: template.templateDescription, mode: template.mode.rawValue, systemPrompt: template.systemPrompt, stopSequences: template.stopSequences, temperature: template.temperature, maxTokens: template.maxTokens, version: 1 ) } let bundle = TemplateBundle( bundleVersion: 1, exportedAt: ISO8601DateFormatter().string(from: Date()), templates: dtos ) let data = try JSONEncoder.prettyPrinted.encode(bundle) try data.write(to: url, options: .atomic) }
// MARK: - Import
/// Imports templates from a JSON file. /// - Returns: Array of PromptTemplate objects (not yet inserted into context). public static func importTemplates( from url: URL ) throws -> [PromptTemplate] { let data = try Data(contentsOf: url) let bundle = try JSONDecoder().decode(TemplateBundle.self, from: data)
return bundle.templates.compactMap { dto in guard let mode = ProcessingMode(rawValue: dto.mode) else { return nil } return PromptTemplate( name: dto.name, templateBody: dto.templateBody, description: dto.description, mode: mode, isBuiltIn: false, systemPrompt: dto.systemPrompt, stopSequences: dto.stopSequences, temperature: dto.temperature, maxTokens: dto.maxTokens ) } }}
private extension JSONEncoder { static let prettyPrinted: JSONEncoder = { let encoder = JSONEncoder() encoder.outputFormatting = [.prettyPrinted, .sortedKeys] return encoder }()}:information_source: Info: Exported template bundles include a
bundleVersionfield for forward compatibility. Future versions of VaulType can migrate older template formats automatically.
7. Processing Mode Implementations
Section titled “7. Processing Mode Implementations”7.1 Raw Mode
Section titled “7.1 Raw Mode”Purpose: Pass-through mode. No LLM processing. The raw transcription from whisper.cpp is injected directly into the target application.
When to use: When you want exact, unmodified speech-to-text output, or when LLM processing is disabled or unavailable.
Template: None (no LLM call is made).
// Raw mode implementation -- trivially returns the input text.func processRaw(text: String) -> String { return text}Input: "i went to the store and bought some milk and bread and eggs"Output: "i went to the store and bought some milk and bread and eggs":information_source: Info: Raw mode is the automatic fallback if the LLM backend is not loaded or encounters an error. See Section 10: Error Handling and Fallback.
7.2 Clean Mode
Section titled “7.2 Clean Mode”Purpose: Fix spelling, grammar, punctuation, and capitalization while preserving the original meaning and tone.
Actual prompt template:
You are a text editor. Fix the spelling, grammar, punctuation, and capitalization of the following transcribed speech. Preserve the original meaning, tone, and intent exactly. Do not add, remove, or rephrase content. Do not add explanations or commentary. Output only the corrected text.
Language: {language}
Input text:{text}
Corrected text:Generation options: Temperature 0.1, Top-P 0.5, Top-K 10, Max tokens 512.
Example:
Input: "i went to teh store and buoght some milk and bread and egs"Output: "I went to the store and bought some milk, bread, and eggs."Input: "so basically what im trying to say is that the feature isnt working right and we need to look into it before the release"Output: "So basically, what I'm trying to say is that the feature isn't working right, and we need to look into it before the release."7.3 Structure Mode
Section titled “7.3 Structure Mode”Purpose: Reorganize transcribed speech into structured formats: bullet lists, numbered steps, tables, or headings.
Actual prompt template:
You are a text formatter. Reorganize the following transcribed speech into a well-structured format. Use the most appropriate structure based on the content:- If the text describes steps or a process, use a numbered list.- If the text lists items or ideas, use bullet points.- If the text contains comparisons or data, use a markdown table.- If the text covers multiple topics, use headings with bullet points.
Preserve all information from the original text. Do not add new information or commentary. Output only the structured text.
Language: {language}
Input text:{text}
Structured output:Generation options: Temperature 0.2, Top-P 0.7, Top-K 20, Max tokens 768.
Example:
Input: "so for the project we need to first set up the database then create the api endpoints and after that build the frontend and finally deploy to production also we need to write tests"
Output: "1. Set up the database 2. Create the API endpoints 3. Build the frontend 4. Write tests 5. Deploy to production"Input: "the pros of using swift are type safety good performance and nice syntax but the cons are that it only works on apple platforms and the compile times can be slow"
Output: "## Swift: Pros and Cons
**Pros:** - Type safety - Good performance - Nice syntax
**Cons:** - Only works on Apple platforms - Compile times can be slow"7.4 Prompt Mode
Section titled “7.4 Prompt Mode”Purpose: Transform spoken instructions into well-structured AI prompts suitable for use with ChatGPT, Claude, or similar tools.
Actual prompt template:
You are a prompt engineer. Transform the following transcribed speech into a well-structured, clear AI prompt. The output should be a prompt that a user can paste directly into an AI assistant (like ChatGPT or Claude).
Follow these guidelines:- Start with a clear role or task description.- Include specific constraints and requirements mentioned in the speech.- Organize multi-part requests into numbered sections.- Add formatting instructions if the speech implies a specific output format.- Preserve all technical details and requirements from the original speech.
Do not execute the prompt. Only output the formatted prompt text. Do not add meta-commentary.
Application context: {app_name}Additional context: {context}
Transcribed speech:{text}
Formatted prompt:Generation options: Temperature 0.4, Top-P 0.9, Top-K 40, Max tokens 1024.
Example:
Input: "i need you to help me write a python function that takes a list of numbers and returns the top three largest ones and it should handle edge cases like empty lists and lists with fewer than three items and make sure its well documented"
Output: "Write a Python function with the following specifications:
1. **Function**: Accept a list of numbers as input and return the top 3 largest values.
2. **Edge cases to handle**: - Empty list (return an empty list or raise a descriptive error) - List with fewer than 3 items (return all items, sorted descending)
3. **Requirements**: - Include a complete docstring with parameters, return type, and examples - Add inline comments for complex logic - Follow PEP 8 style conventions"7.5 Code Mode
Section titled “7.5 Code Mode”Purpose: Format transcribed speech into code-related artifacts: code comments, commit messages, documentation, or inline annotations.
Actual prompt template:
You are a developer assistant. Transform the following transcribed speech into well-formatted code documentation. Based on the content, output the most appropriate format:
- If describing a change or fix: format as a Git commit message (subject line + body).- If describing functionality: format as a code documentation comment (using the appropriate doc-comment syntax for the language context).- If describing a bug or issue: format as a structured bug report.- If describing a task: format as a TODO comment or task description.
Use concise, technical language. Follow conventional formatting standards.
Application: {app_name}Context: {context}Language: {language}
Transcribed speech:{text}
Formatted output:Generation options: Temperature 0.1, Top-P 0.5, Top-K 10, Max tokens 512.
Example (commit message):
Input: "fixed the bug where the login screen would crash if the user entered a really long password because we werent validating the input length before sending it to the api"
Output: "fix: prevent crash on login with oversized password input
Validate password length before sending to the authentication API. Previously, passwords exceeding 256 characters caused a buffer overflow in the API serialization layer, crashing the login screen.
- Add input length validation (max 256 chars) to LoginViewController - Show user-facing error message for passwords exceeding the limit - Add unit test for boundary password lengths"Example (code comment):
Input: "this function calculates the total price including tax and it takes the subtotal and tax rate as parameters and returns the final amount rounded to two decimal places"
Output: "/// Calculates the total price including tax. /// /// - Parameters: /// - subtotal: The pre-tax subtotal amount. /// - taxRate: The tax rate as a decimal (e.g., 0.08 for 8%). /// - Returns: The total price rounded to two decimal places."7.6 Custom Mode
Section titled “7.6 Custom Mode”Purpose: User-defined templates. Users create their own prompt templates using the available variables.
Template: Defined by the user. Example custom template:
Translate the following text from {language} to formal business English.Maintain professional tone and corporate terminology.
Context: {context}
Original text:{text}
Translation:Custom templates can reference any built-in variable and any user-defined variables.
7.7 Processing Mode Selection and Execution
Section titled “7.7 Processing Mode Selection and Execution”The TextProcessor class orchestrates the full pipeline from raw transcription through mode-specific LLM processing.
import Foundationimport SwiftData
/// Orchestrates the text processing pipeline./// Receives raw transcription, applies the selected processing mode,/// and returns the final processed text.public actor TextProcessor {
// MARK: - Dependencies
private let llmService: LLMService private let templateEngine: TemplateEngine.Type private let modelContext: ModelContext
// MARK: - State
public private(set) var currentMode: ProcessingMode = .clean
// MARK: - Initialization
public init( llmService: LLMService, modelContext: ModelContext ) { self.llmService = llmService self.templateEngine = TemplateEngine.self self.modelContext = modelContext }
// MARK: - Mode Selection
/// Sets the active processing mode. public func setMode(_ mode: ProcessingMode) { currentMode = mode }
// MARK: - Processing
/// Processes raw transcription text through the active mode's pipeline. /// - Parameters: /// - rawText: The raw transcription from whisper.cpp. /// - language: The detected language code. /// - appName: The frontmost application name. /// - context: Optional user-provided context. /// - Returns: The processed text, or raw text if processing fails. public func process( rawText: String, language: String = "en", appName: String = "Unknown", context: String = "" ) async -> ProcessingResult { // Raw mode: no LLM needed guard currentMode.requiresLLM else { return ProcessingResult( text: rawText, mode: .raw, processingTime: 0, tokensUsed: 0, didFallback: false ) }
let startTime = CFAbsoluteTimeGetCurrent()
do { // Fetch the active template for the current mode let template = try fetchActiveTemplate(for: currentMode)
// Resolve template variables let prompt = templateEngine.resolve( template: template, text: rawText, language: language, appName: appName, context: context )
// Check token budget let tokenCount = await llmService.countTokens(in: prompt) guard tokenCount < 3500 else { // Prompt is too long; truncate input text and retry let truncatedText = truncateToFit( text: rawText, maxTokens: 2500, currentTokens: tokenCount ) let truncatedPrompt = templateEngine.resolve( template: template, text: truncatedText, language: language, appName: appName, context: context ) let result = try await llmService.generate( prompt: truncatedPrompt, options: currentMode.defaultOptions ) let elapsed = CFAbsoluteTimeGetCurrent() - startTime return ProcessingResult( text: result, mode: currentMode, processingTime: elapsed, tokensUsed: tokenCount, didFallback: false ) }
// Run inference let result = try await llmService.generate( prompt: prompt, options: currentMode.defaultOptions )
let elapsed = CFAbsoluteTimeGetCurrent() - startTime return ProcessingResult( text: result, mode: currentMode, processingTime: elapsed, tokensUsed: tokenCount, didFallback: false )
} catch { // Fallback: return raw text let elapsed = CFAbsoluteTimeGetCurrent() - startTime return ProcessingResult( text: rawText, mode: currentMode, processingTime: elapsed, tokensUsed: 0, didFallback: true, fallbackReason: error.localizedDescription ) } }
// MARK: - Private Helpers
private func fetchActiveTemplate( for mode: ProcessingMode ) throws -> PromptTemplate { let descriptor = FetchDescriptor<PromptTemplate>( predicate: #Predicate { $0.mode == mode }, sortBy: [SortDescriptor(\.sortOrder)] ) let templates = try modelContext.fetch(descriptor) guard let template = templates.first else { throw ProcessingError.noTemplateFound(mode: mode) } return template }
private func truncateToFit( text: String, maxTokens: Int, currentTokens: Int ) -> String { let ratio = Double(maxTokens) / Double(currentTokens) let targetCharCount = Int(Double(text.count) * ratio * 0.9) if targetCharCount < text.count { let index = text.index( text.startIndex, offsetBy: targetCharCount, limitedBy: text.endIndex ) ?? text.endIndex return String(text[..<index]) + "..." } return text }}
// MARK: - Supporting Types
public struct ProcessingResult: Sendable { public let text: String public let mode: ProcessingMode public let processingTime: TimeInterval public let tokensUsed: Int public let didFallback: Bool public var fallbackReason: String?}
public enum ProcessingError: LocalizedError { case noTemplateFound(mode: ProcessingMode)
public var errorDescription: String? { switch self { case .noTemplateFound(let mode): return "No prompt template found for mode: \(mode.rawValue)" } }}8. Token Management and Context Window Handling
Section titled “8. Token Management and Context Window Handling”8.1 Context Window Sizes
Section titled “8.1 Context Window Sizes”Each model has a maximum context window that limits the total number of tokens (prompt + generated output) that can be processed in a single inference call.
| Model | Max Context Window | Recommended Max for VaulType | Prompt Budget | Output Budget |
|---|---|---|---|---|
| Qwen2.5-3B-Instruct | 32,768 | 4,096 | 3,500 | 596 |
| Phi-3.5-mini | 128,000 | 4,096 | 3,500 | 596 |
| Llama-3.2-3B-Instruct | 8,192 | 4,096 | 3,500 | 596 |
| Qwen2.5-1.5B-Instruct | 32,768 | 2,048 | 1,700 | 348 |
:warning: Warning: While models like Phi-3.5-mini support up to 128K tokens, VaulType defaults to a 4,096 context window to optimize latency and memory usage. Larger context windows require proportionally more memory for the KV cache. Users can increase this in settings if needed.
Context window allocation:
+------------------------------------------------------------+| Context Window (4,096 tokens) || || +------------------------+-----------+------------------+ || | System Prompt + Template| Input Text| Generated Output | || | (~200 tokens) | (variable) | (up to 512 tok) | || +------------------------+-----------+------------------+ || || <-------- Prompt Budget (3,500) -------->|<-- Output -->| || (596 max) |+------------------------------------------------------------+8.2 Token Counting
Section titled “8.2 Token Counting”/// Utility for managing token budgets within the context window.public struct TokenBudget {
/// The total context window size in tokens. public let contextSize: Int
/// Tokens reserved for the system prompt and template chrome. public let templateOverhead: Int
/// Maximum tokens for generated output. public let maxOutputTokens: Int
/// Tokens available for the input text. public var inputBudget: Int { contextSize - templateOverhead - maxOutputTokens }
public init( contextSize: Int = 4096, templateOverhead: Int = 200, maxOutputTokens: Int = 512 ) { self.contextSize = contextSize self.templateOverhead = templateOverhead self.maxOutputTokens = maxOutputTokens }
/// Checks whether the given text fits within the input budget. /// - Parameters: /// - text: The input text. /// - tokenCounter: A closure that counts tokens in a string. /// - Returns: Whether the text fits, and the token count. public func fits( text: String, tokenCounter: (String) async -> Int ) async -> (fits: Bool, tokenCount: Int) { let count = await tokenCounter(text) return (count <= inputBudget, count) }
/// Describes the current budget allocation. public var description: String { """ Context: \(contextSize) tokens Template overhead: \(templateOverhead) tokens Input budget: \(inputBudget) tokens Output budget: \(maxOutputTokens) tokens """ }}8.3 Truncation Strategies
Section titled “8.3 Truncation Strategies”When input text exceeds the token budget, VaulType applies one of three truncation strategies:
/// Strategies for truncating text that exceeds the token budget.public enum TruncationStrategy: String, Codable, CaseIterable, Sendable { /// Keep the end of the text (most recent speech). Default for dictation. case keepEnd = "Keep End"
/// Keep the beginning of the text. case keepBeginning = "Keep Beginning"
/// Keep both the beginning and end, removing the middle. case keepEdges = "Keep Edges"}
extension TruncationStrategy {
/// Truncates the given text to approximately the target token count. /// - Parameters: /// - text: The text to truncate. /// - targetTokens: The target number of tokens. /// - currentTokens: The current number of tokens in the text. /// - Returns: The truncated text with an ellipsis marker. public func truncate( text: String, targetTokens: Int, currentTokens: Int ) -> String { guard currentTokens > targetTokens else { return text }
let ratio = Double(targetTokens) / Double(currentTokens) let targetChars = Int(Double(text.count) * ratio * 0.9) // 10% safety margin
switch self { case .keepEnd: let startIndex = text.index( text.endIndex, offsetBy: -targetChars, limitedBy: text.startIndex ) ?? text.startIndex return "..." + text[startIndex...]
case .keepBeginning: let endIndex = text.index( text.startIndex, offsetBy: targetChars, limitedBy: text.endIndex ) ?? text.endIndex return text[..<endIndex] + "..."
case .keepEdges: let halfChars = targetChars / 2 let frontEnd = text.index( text.startIndex, offsetBy: halfChars, limitedBy: text.endIndex ) ?? text.endIndex let backStart = text.index( text.endIndex, offsetBy: -halfChars, limitedBy: text.startIndex ) ?? text.startIndex return text[..<frontEnd] + "\n...[truncated]...\n" + text[backStart...] } }}:bulb: Tip: The default truncation strategy is
keepEnd, which preserves the most recently dictated text. This works best for real-time dictation where the latest content is most relevant. Switch tokeepBeginningwhen processing complete documents.
9. Latency Optimization Strategies
Section titled “9. Latency Optimization Strategies”Minimizing latency is critical for VaulType because the user is waiting for their speech to be transformed and injected into the active application. The goal is sub-3-second total pipeline time from end-of-speech to text injection.
Latency Budget (Target: < 3 seconds total)+-------+-------+-------+-------+-------+| ASR | Mode | Templ | LLM | Inject|| 0.5s | 0.01s | 0.01s | 2.0s | 0.1s |+-------+-------+-------+-------+-------+ <---- Optimization focus ---->9.1 Model Preloading
Section titled “9.1 Model Preloading”Load the LLM model at application startup so it is ready for the first inference request.
/// Preloads the LLM model at application startup./// Called from the AppDelegate or App struct's initialization.@MainActorfinal class AppBootstrap {
static func preloadLLM(llmService: LLMService, settings: AppSettings) { Task.detached(priority: .userInitiated) { do { try await llmService.switchBackend( to: settings.preferredBackend, model: settings.activeModelPath ) } catch { // Log but do not crash -- LLM will be loaded on first use // or fallback to raw mode. Logger.llm.error("Failed to preload LLM: \(error.localizedDescription)") } } }}Warm-up inference:
/// Runs a minimal inference to warm up the Metal pipeline and JIT caches.func warmUp(backend: any LLMBackend) async { _ = try? await backend.generate( prompt: "Hello", options: GenerationOptions(maxTokens: 1, temperature: 0) )}:information_source: Info: The first inference after model load is typically 2-3x slower than subsequent inferences because Metal shaders must be compiled and cached. The warm-up call eliminates this cold-start penalty.
9.2 Speculative Decoding
Section titled “9.2 Speculative Decoding”Speculative decoding uses a smaller “draft” model to generate candidate tokens, which are then verified by the main model in a single batch. This can improve generation speed by 1.5-2x.
/// Configuration for speculative decoding./// Uses a smaller draft model to speed up token generation.struct SpeculativeDecodingConfig { /// Path to the smaller draft model (e.g., Qwen2.5-0.5B). let draftModelPath: String
/// Number of draft tokens to generate before verification. let draftTokenCount: Int // Typically 4-8
/// Whether speculative decoding is enabled. let isEnabled: Bool
static let `default` = SpeculativeDecodingConfig( draftModelPath: "", draftTokenCount: 6, isEnabled: false // Disabled by default; requires a second model )}:warning: Warning: Speculative decoding requires loading a second (smaller) model alongside the main model, which increases memory usage by approximately 500 MB - 1 GB. Only enable this on machines with 16 GB or more of unified memory.
9.3 Batch Size Tuning
Section titled “9.3 Batch Size Tuning”The batch size controls how many tokens are processed in a single forward pass during prompt evaluation. Larger batch sizes improve prompt processing speed but use more memory.
| Batch Size | Prompt Eval Speed | Memory Overhead | Recommended For |
|---|---|---|---|
| 128 | Baseline | Minimal | Machines with < 8 GB RAM |
| 256 | ~1.3x | Low | Default for 8 GB machines |
| 512 | ~1.8x | Moderate | Default for 16 GB+ machines |
| 1024 | ~2.2x | High | 32 GB+ machines, long prompts |
| 2048 | ~2.5x | Very high | 64 GB+ machines only |
/// Determines the optimal batch size based on available system memory.func optimalBatchSize() -> Int32 { let totalMemory = ProcessInfo.processInfo.physicalMemory switch totalMemory { case ..<(8 * 1_073_741_824): // < 8 GB return 256 case ..<(16 * 1_073_741_824): // < 16 GB return 512 case ..<(32 * 1_073_741_824): // < 32 GB return 1024 default: // >= 32 GB return 2048 }}9.4 Metal GPU Layer Offloading
Section titled “9.4 Metal GPU Layer Offloading”As described in Section 2.3, offloading model layers to the Metal GPU is the single most impactful optimization. The key configuration points:
/// Comprehensive GPU offloading configuration.struct MetalConfig { /// Number of layers to offload (-1 = all). var gpuLayers: Int32 = -1
/// Whether to use Metal for matrix multiplications. var useMetalMatMul: Bool = true
/// Metal buffer size for tensor allocations (bytes). var metalBufferSize: UInt64 = 512 * 1024 * 1024 // 512 MB
/// Whether to use float16 for Metal operations (faster, slight quality loss). var useFloat16: Bool = true
/// Auto-configure based on the system's Metal device capabilities. static func autoDetect() -> MetalConfig { var config = MetalConfig()
guard let device = MTLCreateSystemDefaultDevice() else { // No Metal device available (very old Intel Mac) config.gpuLayers = 0 config.useMetalMatMul = false return config }
let vram = device.recommendedMaxWorkingSetSize
// Apple Silicon: use all GPU layers if device.supportsFamily(.apple7) { // M1 and later config.gpuLayers = -1 config.useFloat16 = true } // AMD discrete GPU (Intel Macs) else if vram > 2 * 1_073_741_824 { // > 2 GB VRAM config.gpuLayers = 20 // Partial offload config.useFloat16 = false // AMD compatibility } // Integrated graphics or insufficient VRAM else { config.gpuLayers = 0 config.useMetalMatMul = false }
return config }}:apple: macOS-specific: On Apple Silicon, the unified memory architecture means that GPU and CPU share the same physical memory. Setting
gpuLayers = -1does not “use more memory” — it routes computations through the GPU cores instead of the CPU cores, which is substantially faster for matrix multiplications.
10. Error Handling and Fallback
Section titled “10. Error Handling and Fallback”10.1 Error Types
Section titled “10.1 Error Types”The LLM pipeline defines a comprehensive error hierarchy:
/// All errors that can occur in the LLM processing pipeline.public enum LLMPipelineError: LocalizedError {
// Backend errors case backendNotLoaded case backendUnavailable(backend: String, reason: String)
// Model errors case modelNotFound(path: String) case modelCorrupted(path: String) case modelIncompatible(model: String, reason: String)
// Inference errors case generationFailed(underlying: Error) case generationTimeout(seconds: TimeInterval) case emptyOutput
// Resource errors case insufficientMemory(available: UInt64, required: UInt64) case insufficientDiskSpace(available: UInt64, required: UInt64)
// Template errors case templateNotFound(mode: ProcessingMode) case templateResolutionFailed(variable: String)
public var errorDescription: String? { switch self { case .backendNotLoaded: return "No LLM backend is loaded" case .backendUnavailable(let backend, let reason): return "\(backend) backend unavailable: \(reason)" case .modelNotFound(let path): return "Model file not found: \(path)" case .modelCorrupted(let path): return "Model file is corrupted: \(path)" case .modelIncompatible(let model, let reason): return "Model '\(model)' is incompatible: \(reason)" case .generationFailed(let underlying): return "Text generation failed: \(underlying.localizedDescription)" case .generationTimeout(let seconds): return "Text generation timed out after \(Int(seconds))s" case .emptyOutput: return "Model produced empty output" case .insufficientMemory(let available, let required): let avail = available / 1_048_576 let req = required / 1_048_576 return "Insufficient memory: \(avail) MB available, \(req) MB required" case .insufficientDiskSpace(let available, let required): let avail = available / 1_048_576 let req = required / 1_048_576 return "Insufficient disk space: \(avail) MB available, \(req) MB required" case .templateNotFound(let mode): return "No template found for mode: \(mode.rawValue)" case .templateResolutionFailed(let variable): return "Failed to resolve template variable: {\(variable)}" } }}10.2 Fallback Chain
Section titled “10.2 Fallback Chain”When the LLM pipeline encounters an error, VaulType follows a defined fallback chain to ensure the user always gets some output:
Fallback Chain:
LLM Generation | | failure v Retry with lower temperature (0.1) | | failure v Switch to alternate backend (if available) (e.g., llama.cpp fails -> try Ollama) | | failure v Switch to smaller model (if available) (e.g., 3B model fails -> try 1.5B model) | | failure v Inject raw text (Raw mode fallback) | | always succeeds v User sees unprocessed transcriptionImplementation:
/// Executes the LLM processing pipeline with full fallback chain.public actor ResilientTextProcessor {
private let primaryBackend: any LLMBackend private let fallbackBackend: (any LLMBackend)? private let fallbackModelPath: String? private let maxRetries: Int = 2 private let retryDelay: TimeInterval = 0.5
public init( primaryBackend: any LLMBackend, fallbackBackend: (any LLMBackend)? = nil, fallbackModelPath: String? = nil ) { self.primaryBackend = primaryBackend self.fallbackBackend = fallbackBackend self.fallbackModelPath = fallbackModelPath }
/// Processes text with full fallback chain. /// Guaranteed to return a result (worst case: raw text). public func process( prompt: String, rawText: String, options: GenerationOptions ) async -> FallbackResult { let startTime = CFAbsoluteTimeGetCurrent()
// Step 1: Try primary backend do { let result = try await withTimeout(options.timeout) { try await self.primaryBackend.generate( prompt: prompt, options: options ) } if !result.isEmpty { return FallbackResult( text: result, source: .primary, elapsed: CFAbsoluteTimeGetCurrent() - startTime ) } } catch { Logger.llm.warning( "Primary backend failed: \(error.localizedDescription)" ) }
// Step 2: Retry with conservative settings do { var conservativeOptions = options conservativeOptions.temperature = 0.1 conservativeOptions.maxTokens = min(options.maxTokens, 256)
let result = try await withTimeout(options.timeout) { try await self.primaryBackend.generate( prompt: prompt, options: conservativeOptions ) } if !result.isEmpty { return FallbackResult( text: result, source: .primaryRetry, elapsed: CFAbsoluteTimeGetCurrent() - startTime ) } } catch { Logger.llm.warning( "Primary retry failed: \(error.localizedDescription)" ) }
// Step 3: Try fallback backend (e.g., Ollama if llama.cpp failed) if let fallback = fallbackBackend, await fallback.isReady { do { let result = try await withTimeout(options.timeout) { try await fallback.generate( prompt: prompt, options: options ) } if !result.isEmpty { return FallbackResult( text: result, source: .fallbackBackend, elapsed: CFAbsoluteTimeGetCurrent() - startTime ) } } catch { Logger.llm.warning( "Fallback backend failed: \(error.localizedDescription)" ) } }
// Step 4: Final fallback -- return raw text Logger.llm.error( "All LLM backends failed. Falling back to raw text." ) return FallbackResult( text: rawText, source: .rawFallback, elapsed: CFAbsoluteTimeGetCurrent() - startTime ) }
// MARK: - Timeout Helper
private func withTimeout<T: Sendable>( _ timeout: TimeInterval, operation: @escaping @Sendable () async throws -> T ) async throws -> T { try await withThrowingTaskGroup(of: T.self) { group in group.addTask { try await operation() } group.addTask { try await Task.sleep(for: .seconds(timeout)) throw LLMPipelineError.generationTimeout(seconds: timeout) } let result = try await group.next()! group.cancelAll() return result } }}
// MARK: - Fallback Result
public struct FallbackResult: Sendable {
public enum Source: String, Sendable { case primary = "Primary backend" case primaryRetry = "Primary backend (retry)" case fallbackBackend = "Fallback backend" case rawFallback = "Raw text (no LLM)" }
public let text: String public let source: Source public let elapsed: TimeInterval
public var didFallback: Bool { source != .primary }}10.3 Timeout Handling
Section titled “10.3 Timeout Handling”Each processing mode has a configurable timeout. If the LLM does not produce output within the timeout, the pipeline falls back.
| Processing Mode | Default Timeout | Max Timeout (User Configurable) |
|---|---|---|
| Clean | 15s | 60s |
| Structure | 20s | 60s |
| Prompt | 25s | 90s |
| Code | 15s | 60s |
| Custom | 30s | 120s |
:warning: Warning: If inference consistently times out, this usually indicates the model is too large for the available hardware. Consider switching to a smaller model (e.g., Qwen2.5-1.5B) or enabling more GPU layers.
10.4 Memory Pressure Handling
Section titled “10.4 Memory Pressure Handling”VaulType monitors system memory pressure and takes protective action to prevent the system from becoming unresponsive.
import Foundationimport os
/// Monitors system memory pressure and adjusts LLM behavior accordingly.final class MemoryPressureMonitor {
private let source: DispatchSourceMemoryPressure private let llmService: LLMService
enum PressureLevel: Sendable { case normal case warning case critical }
private(set) var currentLevel: PressureLevel = .normal
init(llmService: LLMService) { self.llmService = llmService self.source = DispatchSource.makeMemoryPressureSource( eventMask: [.warning, .critical], queue: .global(qos: .utility) )
source.setEventHandler { [weak self] in guard let self else { return } let event = self.source.data if event.contains(.critical) { self.handleCriticalPressure() } else if event.contains(.warning) { self.handleWarningPressure() } } source.activate() }
deinit { source.cancel() }
private func handleWarningPressure() { currentLevel = .warning Logger.llm.warning("Memory pressure: WARNING. Reducing LLM batch size.") // Notify the LLM service to reduce batch size and context window Task { // Reduce context window to conserve memory // This is a signal, not a direct operation, because the // LlamaContext would need to be recreated. } }
private func handleCriticalPressure() { currentLevel = .critical Logger.llm.error("Memory pressure: CRITICAL. Unloading LLM model.") Task { // Unload the model to free memory immediately await llmService.unloadCurrentModel() // Future requests will fall back to raw text mode } }}:x: Error: If you see “Memory pressure: CRITICAL. Unloading LLM model.” in the logs, VaulType has unloaded the LLM to prevent system instability. All text processing will fall back to Raw mode until the user manually reloads a model or memory pressure subsides. Consider using a smaller model or closing other memory-intensive applications.
11. LLM Processing Pipeline Architecture
Section titled “11. LLM Processing Pipeline Architecture”This section provides the complete end-to-end view of the LLM processing pipeline as implemented in Phases 1-5.
Implemented Pipeline (Phase 5 Final)
Section titled “Implemented Pipeline (Phase 5 Final)”AudioCaptureService (AVAudioEngine tap, 16kHz mono Float32) → WhisperService (whisper.cpp v1.7.4 transcription, Metal GPU) → VoicePrefixDetector (detects "code mode:", "clean this up:", etc.; strips prefix) → VocabularyService (spoken→replacement pairs, per-app then global) → CommandDetector (wake phrase prefix match → command vs dictation) → CustomCommandExecutor (SwiftData CustomCommand evaluation) → AppAliasResolver (AppProfile.shortcutAliases) → GlobalAliasResolver (UserSettings.globalShortcutAliases) → ProcessingModeRouter ├── Raw → direct output (no LLM) └── Clean / Structure / Prompt / Code / Custom → PromptTemplateEngine (variable substitution) → LLMService (llama.cpp b8059, Metal GPU) → OverlayWindow (optional edit-before-inject) → TextInjectionService (CGEvent or clipboard paste) → DictationHistory (SwiftData DictationEntry)DictationController
Section titled “DictationController”DictationController is the pipeline orchestrator. It owns and wires all pipeline components:
- Receives hotkey events from
HotkeyManager - Starts/stops
AudioCaptureService - Passes audio to
WhisperServiceafter VAD silence trim - Feeds transcript through the full pipeline above
- Posts status updates to
AppStatefor the menu bar and overlay UI
This section also provides a simplified view of the LLM-specific portion of the pipeline:
+====================================================================+|| VaulType LLM Processing Pipeline ||+====================================================================+
User speaks into microphone | v +------------------+ | AVAudioEngine | (see SPEECH_RECOGNITION.md) | Audio Capture | +--------+---------+ | v +------------------+ | whisper.cpp | (see SPEECH_RECOGNITION.md) | Transcription | +--------+---------+ | | Raw text: "i need to fix the login bug" v +------------------+ | TextProcessor | | .process() | +--------+---------+ | | Check processing mode v +------------------+ +------------------+ | Mode == .raw? |---->| Return raw text |---> Inject +--------+---------+ yes +------------------+ | no v +------------------+ | Fetch template | | from SwiftData | +--------+---------+ | v +------------------+ | TemplateEngine | | .resolve() | | - {text} | | - {language} | | - {app_name} | | - {context} | +--------+---------+ | | Fully resolved prompt string v +------------------+ | TokenBudget | | .fits()? | +--------+---------+ | +-----+-----+ | fits | too long v v | +------------------+ | | TruncationStrategy| | | .truncate() | | +--------+---------+ | | +------+------+ | v +------------------+ | LLMBackend | | .generate() | +--------+---------+ | +-----+-----+ | ok | error v v | +------------------+ | | FallbackChain | | | 1. Retry | | | 2. Alt backend | | | 3. Raw fallback | | +--------+---------+ | | +------+------+ | v +------------------+ | Post-process | | - Trim whitespace| | - Validate output| +--------+---------+ | v +------------------+ | CGEvent Injection| (see ../architecture/ARCHITECTURE.md) | into active app | +------------------+Complete pipeline timing breakdown (typical, M2 Pro, Qwen2.5-3B):
| Stage | Duration | Cumulative |
|---|---|---|
| Audio capture (last buffer) | 0ms | 0ms |
| whisper.cpp transcription | ~500ms | ~500ms |
| Mode selection | <1ms | ~500ms |
| Template resolution | <1ms | ~500ms |
| Token counting | ~5ms | ~505ms |
| LLM prompt evaluation | ~350ms | ~855ms |
| LLM token generation (100 tokens) | ~2,100ms | ~2,955ms |
| Post-processing | <1ms | ~2,956ms |
| CGEvent injection | ~50ms | ~3,006ms |
| Total | ~3.0s |
:white_check_mark: Success: On Apple Silicon with a 3B model, the complete pipeline from end-of-speech to text injection is consistently under 3.5 seconds for typical dictation lengths (10-50 words).
12. Configuration Reference
Section titled “12. Configuration Reference”All LLM-related configuration is stored in UserDefaults (for simple preferences) and SwiftData (for templates and model metadata).
| Setting | Key | Default | Range |
|---|---|---|---|
| Active backend | llm.backend | llamaCpp | llamaCpp, ollama |
| Model path (llama.cpp) | llm.modelPath | (bundled model) | Valid file path |
| Model name (Ollama) | llm.ollamaModel | qwen2.5:3b-instruct-q4_K_M | Valid Ollama model |
| Context window | llm.contextSize | 4096 | 512 - 131072 |
| GPU layers | llm.gpuLayers | -1 (all) | -1 to model layer count |
| Batch size | llm.batchSize | 512 | 64 - 2048 |
| Processing mode | llm.mode | clean | See ProcessingMode |
| Temperature | llm.temperature | 0.3 | 0.0 - 2.0 |
| Top-P | llm.topP | 0.9 | 0.0 - 1.0 |
| Top-K | llm.topK | 40 | 1 - 100 |
| Max output tokens | llm.maxTokens | 512 | 1 - 4096 |
| Timeout (seconds) | llm.timeout | 30 | 5 - 120 |
| Truncation strategy | llm.truncation | keepEnd | See TruncationStrategy |
| Speculative decoding | llm.speculative | false | true, false |
| Memory lock (mlock) | llm.useMlock | false | true, false |
| Ollama host | llm.ollamaHost | 127.0.0.1 | Valid hostname/IP |
| Ollama port | llm.ollamaPort | 11434 | 1024 - 65535 |
13. Related Documentation
Section titled “13. Related Documentation”| Document | Description |
|---|---|
| Architecture Overview | System architecture, module boundaries, data flow |
| Speech Recognition | whisper.cpp integration, audio capture, transcription pipeline |
| Model Management | Model download, storage, updates, GGUF file handling |
| API Documentation | Internal API reference for all modules |
| Tech Stack | Full technology stack and dependency details |
| Security | Security model, privacy guarantees, threat analysis |
| Permissions | macOS permissions (microphone, accessibility) |
| Database Schema | SwiftData models and persistence layer |
This document is part of the VaulType project documentation. VaulType is licensed under GPL-3.0. For more information, see the LICENSE file in the repository root.