Skip to content

LLM Processing Pipeline

Last Updated: 2026-02-20 Component: LLM Processing Engine Module: VaulType/Services/LLM/ Status: Complete (Phases 2-5) License: GPL-3.0



VaulType’s LLM Processing Pipeline transforms raw speech-to-text transcriptions into polished, context-aware text output. The entire pipeline runs locally on the user’s machine, maintaining VaulType’s zero-network architecture: no cloud calls, no telemetry, no data exfiltration.

  • llama.cpp (version b8059) — compiled as a static library linked directly into VaulType. This is the sole LLM backend. Metal GPU acceleration is enabled with GGML_METAL_EMBED_LIBRARY=ON (no separate .metallib needed).
  • Ollama backend removed — the OllamaProvider was removed during Phase 5 refactoring. llama.cpp is the only backend.
  • LlamaContext.swift lives in VaulType/Services/LLM/ and mirrors the WhisperContext pattern.
  • llama.cpp shares ggml libs with whisper.cpp — only -lllama added to avoid duplicate ggml linking.
AudioCaptureService (AVAudioEngine)
→ WhisperService (whisper.cpp transcription)
→ VoicePrefixDetector (strips mode prefix, detects mode switch)
→ VocabularyService (applies spoken→replacement pairs)
→ CommandDetector (command vs. dictation classification)
→ CustomCommandExecutor
→ AppAliasResolver (app-specific shortcut aliases)
→ GlobalAliasResolver (user-defined global aliases)
→ ProcessingModeRouter → PromptTemplateEngine → LLMService (llama.cpp)
→ OverlayWindow (optional edit-before-inject)
→ TextInjectionService (CGEvent/clipboard)
→ DictationHistory (SwiftData)
ModeLLM RequiredDescription
RawNoUnprocessed whisper output
CleanYesGrammar, punctuation, remove filler words
StructureYesParagraphs, bullet lists, headings
PromptYesUser-defined prompt template
CodeYesConvert spoken instructions to source code
CustomYesFully user-defined pipeline
+------------------------------------------------------------------+
| VaulType LLM Pipeline |
| |
| Speech Audio |
| | |
| v |
| +-----------------+ +-----------------+ |
| | whisper.cpp |---->| Raw Transcript | |
| | (ASR Engine) | | "i need to fix | |
| +-----------------+ | the login bug" | |
| +---------+-------+ |
| | |
| +---------v---------+ |
| | Processing Mode | |
| | Selection | |
| +---------+---------+ |
| | |
| +----------+----------+----------+-------+ |
| | | | | | |
| v v v v v |
| [Raw] [Clean] [Structure] [Prompt] [Code] |
| | | | | | |
| | +-----v----------v----------v--------v---+ |
| | | Prompt Template Engine | |
| | | - Variable substitution | |
| | | - Context injection | |
| | | - Token budget calculation | |
| | +-----+----------------------------------+ |
| | | |
| | +-----v----------------------------------+ |
| | | LLM Backend (llama.cpp / Ollama) | |
| | | - Model loading | |
| | | - Inference | |
| | | - Token generation | |
| | +-----+----------------------------------+ |
| | | |
| +-----+----+ |
| | |
| +-----v-----------+ |
| | Post-Processing | |
| | & Injection | |
| +-----------------+ |
+------------------------------------------------------------------+

Security: All LLM inference occurs on-device via llama.cpp. No model data, prompts, or outputs ever leave the machine. The Ollama backend was removed in Phase 5.


llama.cpp (version b8059) is compiled as a static library (.a) via CMake and linked directly into the VaulType binary. This eliminates any runtime dependency on external executables or dynamic libraries.

Build script: scripts/setup-llama.sh

Terminal window
# Run the setup script (also executed by Xcode build phase and CI)
./scripts/setup-llama.sh

The script clones llama.cpp b8059 into vendor/llama.cpp/, builds it with CMake, and installs the static libraries into vendor/llama.cpp/build/lib/.

Key CMake flags:

FlagPurpose
GGML_METAL=ONEnable Metal GPU acceleration
GGML_METAL_EMBED_LIBRARY=ONEmbed Metal shaders in binary (no separate .metallib)
LLAMA_BUILD_EXAMPLES=OFFSkip building example executables
CMAKE_BUILD_TYPE=ReleaseOptimized build

Linking in Xcode: The Xcode project links against vendor/llama.cpp/build/lib/libllama.a and the shared ggml libraries. llama.cpp shares ggml with whisper.cpp — only -lllama is added; ggml is not duplicated.

llama.cpp is pinned to tag b8059 to ensure reproducible builds. Updates must be tested against all supported GGUF models before merging.

GGML_METAL_EMBED_LIBRARY=ON embeds the Metal shader library directly into the binary, eliminating the need to ship a separate .metallib file. Metal GPU acceleration works on Apple Silicon and Intel Macs with AMD GPUs.

Since llama.cpp is a C/C++ library, a bridging header and Swift-friendly wrapper are required.

Public bridging header (Sources/CLlama/include/llama_bridge.h):

Sources/CLlama/include/llama_bridge.h
// This is a C header file -- shown here for completeness.
#ifndef LLAMA_BRIDGE_H
#define LLAMA_BRIDGE_H
#include "llama.h"
#include "ggml.h"
// Re-export the core llama.cpp API functions that VaulType uses.
// Swift can call these directly through the CLlama module.
#endif /* LLAMA_BRIDGE_H */

Swift module map (Sources/CLlama/include/module.modulemap):

module CLlama {
header "llama_bridge.h"
export *
}

Swift interop wrapper (Sources/LlamaSwift/LlamaContext.swift):

import Foundation
import CLlama
/// Wraps a llama.cpp context for safe use from Swift.
/// Manages the lifecycle of the underlying C model and context pointers.
public final class LlamaContext: @unchecked Sendable {
// MARK: - Properties
private let model: OpaquePointer // llama_model *
private let context: OpaquePointer // llama_context *
private let sampler: OpaquePointer // llama_sampler *
private let queue = DispatchQueue(label: "com.vaultype.llama", qos: .userInitiated)
public let contextSize: Int32
public let modelPath: String
// MARK: - Initialization
/// Loads a GGUF model and creates an inference context.
/// - Parameters:
/// - path: Absolute path to the `.gguf` model file.
/// - contextSize: Maximum context window size in tokens.
/// - gpuLayers: Number of layers to offload to Metal GPU (-1 for all).
/// - seed: Random seed for sampling (0 for random).
/// - Throws: `LlamaError` if the model cannot be loaded.
public init(
modelPath path: String,
contextSize: Int32 = 4096,
gpuLayers: Int32 = -1,
seed: UInt32 = 0
) throws {
self.modelPath = path
self.contextSize = contextSize
// Configure model parameters
var modelParams = llama_model_default_params()
modelParams.n_gpu_layers = gpuLayers
modelParams.use_mmap = true // Memory-map the model file
modelParams.use_mlock = false // Do not pin in RAM
// Load model from GGUF file
guard let loadedModel = llama_load_model_from_file(path, modelParams) else {
throw LlamaError.modelLoadFailed(path: path)
}
self.model = loadedModel
// Configure context parameters
var ctxParams = llama_context_default_params()
ctxParams.n_ctx = UInt32(contextSize)
ctxParams.n_batch = 512
ctxParams.n_threads = UInt32(ProcessInfo.processInfo.activeProcessorCount)
ctxParams.n_threads_batch = UInt32(ProcessInfo.processInfo.activeProcessorCount)
ctxParams.seed = seed
// Create inference context
guard let ctx = llama_new_context_with_model(loadedModel, ctxParams) else {
llama_free_model(loadedModel)
throw LlamaError.contextCreationFailed
}
self.context = ctx
// Configure sampler chain
let samplerParams = llama_sampler_chain_default_params()
guard let samplerChain = llama_sampler_chain_init(samplerParams) else {
llama_free(ctx)
llama_free_model(loadedModel)
throw LlamaError.samplerCreationFailed
}
// Temperature + top-p sampling
llama_sampler_chain_add(samplerChain, llama_sampler_init_temp(0.3))
llama_sampler_chain_add(samplerChain, llama_sampler_init_top_p(0.9, 1))
llama_sampler_chain_add(samplerChain, llama_sampler_init_dist(seed))
self.sampler = samplerChain
}
deinit {
llama_sampler_free(sampler)
llama_free(context)
llama_free_model(model)
}
// MARK: - Tokenization
/// Tokenizes a string into llama tokens.
public func tokenize(_ text: String, addBos: Bool = true) -> [llama_token] {
let utf8 = text.utf8CString
let maxTokens = Int32(utf8.count) + (addBos ? 1 : 0)
var tokens = [llama_token](repeating: 0, count: Int(maxTokens))
let count = llama_tokenize(model, text, Int32(text.utf8.count),
&tokens, maxTokens, addBos, false)
guard count >= 0 else { return [] }
return Array(tokens.prefix(Int(count)))
}
/// Detokenizes a token back to its string representation.
public func detokenize(_ token: llama_token) -> String {
var buffer = [CChar](repeating: 0, count: 256)
let length = llama_token_to_piece(model, token, &buffer, 256, 0, false)
guard length > 0 else { return "" }
return String(cString: buffer)
}
// MARK: - Inference
/// Runs inference on the given prompt and returns generated text.
/// - Parameters:
/// - prompt: The input prompt string.
/// - maxTokens: Maximum number of tokens to generate.
/// - stopSequences: Strings that cause generation to halt.
/// - Returns: The generated text.
public func generate(
prompt: String,
maxTokens: Int = 512,
stopSequences: [String] = []
) async throws -> String {
try await withCheckedThrowingContinuation { continuation in
queue.async { [self] in
do {
let result = try self.syncGenerate(
prompt: prompt,
maxTokens: maxTokens,
stopSequences: stopSequences
)
continuation.resume(returning: result)
} catch {
continuation.resume(throwing: error)
}
}
}
}
private func syncGenerate(
prompt: String,
maxTokens: Int,
stopSequences: [String]
) throws -> String {
// Clear KV cache for fresh generation
llama_kv_cache_clear(context)
// Tokenize prompt
let tokens = tokenize(prompt, addBos: true)
guard !tokens.isEmpty else {
throw LlamaError.tokenizationFailed
}
guard tokens.count < contextSize else {
throw LlamaError.contextOverflow(
tokenCount: tokens.count,
contextSize: Int(contextSize)
)
}
// Create batch and decode prompt
var batch = llama_batch_init(Int32(tokens.count), 0, 1)
defer { llama_batch_free(batch) }
for (i, token) in tokens.enumerated() {
llama_batch_add(&batch, token, Int32(i), [0], i == tokens.count - 1)
}
guard llama_decode(context, batch) == 0 else {
throw LlamaError.decodeFailed
}
// Generate tokens
var output = ""
var generatedCount = 0
while generatedCount < maxTokens {
let newToken = llama_sampler_sample(sampler, context, -1)
// Check for end-of-sequence
if llama_token_is_eog(model, newToken) {
break
}
let piece = detokenize(newToken)
output += piece
generatedCount += 1
// Check stop sequences
if stopSequences.contains(where: { output.hasSuffix($0) }) {
for stop in stopSequences where output.hasSuffix(stop) {
output = String(output.dropLast(stop.count))
}
break
}
// Prepare next batch (single token)
llama_batch_clear(&batch)
llama_batch_add(&batch, newToken,
Int32(tokens.count + generatedCount), [0], true)
guard llama_decode(context, batch) == 0 else {
throw LlamaError.decodeFailed
}
}
return output.trimmingCharacters(in: .whitespacesAndNewlines)
}
}

:warning: Warning: LlamaContext is marked @unchecked Sendable because all access to the underlying C pointers is serialized through a dedicated DispatchQueue. Do not access the C pointers from multiple threads without synchronization.

llama.cpp uses Apple’s Metal framework to offload tensor operations to the GPU. On Apple Silicon, this provides a substantial performance boost — typically 3-5x faster token generation compared to CPU-only inference.

How GPU offloading works:

+-------------------------------------------------+
| Model Layer Distribution |
| |
| Layer 0 [GPU] ======================== |
| Layer 1 [GPU] ======================== |
| Layer 2 [GPU] ======================== |
| ... |
| Layer N-2 [GPU] ======================== |
| Layer N-1 [GPU] ======================== |
| Layer N [CPU] ======================== |
| (embedding/output layers may stay on CPU) |
| |
| gpuLayers = -1 --> offload ALL layers to GPU |
| gpuLayers = 0 --> CPU only |
| gpuLayers = 20 --> first 20 layers on GPU |
+-------------------------------------------------+

Configuration for Metal acceleration:

/// Determines optimal GPU layer count based on available memory.
func optimalGPULayers(for modelSizeBytes: UInt64) -> Int32 {
let device = MTLCreateSystemDefaultDevice()
let availableVRAM = device?.recommendedMaxWorkingSetSize ?? 0
// Reserve 512 MB for KV cache and Metal overhead
let reservedBytes: UInt64 = 512 * 1024 * 1024
let usableVRAM = availableVRAM > reservedBytes
? availableVRAM - reservedBytes
: 0
if modelSizeBytes <= usableVRAM {
return -1 // Offload all layers to GPU
} else {
// Estimate partial offload: proportional to available VRAM
let ratio = Double(usableVRAM) / Double(modelSizeBytes)
let estimatedLayers = Int32(ratio * 35) // Approximate for 3B models
return max(estimatedLayers, 0)
}
}

:apple: macOS-specific: Metal GPU acceleration is available on all Apple Silicon Macs and on Intel Macs with discrete AMD GPUs. Intel Macs with integrated graphics will fall back to CPU inference via the Accelerate framework.

:bulb: Tip: For best performance on Apple Silicon, set gpuLayers = -1 to offload the entire model to the GPU. A 3B parameter Q4_K_M model requires approximately 2 GB of VRAM, well within the unified memory of any M1 or later chip.

GGUF (GGML Universal File Format) is the model file format used by llama.cpp. VaulType uses memory mapping (mmap) to load models, which provides:

  • Fast startup — The OS maps the file into virtual memory without reading the entire file into RAM upfront.
  • Shared memory — If multiple processes load the same model file, the OS can share the physical memory pages.
  • Low resident memory — Only the pages actually accessed during inference are loaded into physical RAM.
+----------------------------------+
| GGUF File on Disk |
| +----------------------------+ |
| | Header (metadata, vocab) | | <-- Read into RAM immediately
| +----------------------------+ |
| | Tensor Data (weights) | | <-- Memory-mapped (mmap)
| | - Layer 0 weights | | Pages loaded on demand
| | - Layer 1 weights | | by the OS virtual memory
| | - ... | | subsystem
| | - Layer N weights | |
| +----------------------------+ |
+----------------------------------+

Memory mapping is enabled by default:

var modelParams = llama_model_default_params()
modelParams.use_mmap = true // Enable memory-mapped loading
modelParams.use_mlock = false // Do not pin pages in RAM (let OS manage)

:information_source: Info: Setting use_mlock = true pins all model pages in RAM, preventing the OS from swapping them out. This reduces latency jitter but increases resident memory. Only enable this if the system has ample free RAM (model size + 2 GB headroom).

The complete LlamaContext wrapper is shown in Section 2.2. Additional helper types used throughout the LLM pipeline:

// MARK: - Error Types
/// Errors that can occur during llama.cpp operations.
public enum LlamaError: LocalizedError {
case modelLoadFailed(path: String)
case contextCreationFailed
case samplerCreationFailed
case tokenizationFailed
case contextOverflow(tokenCount: Int, contextSize: Int)
case decodeFailed
case generationTimeout
case memoryPressure(available: UInt64, required: UInt64)
case cancelled
public var errorDescription: String? {
switch self {
case .modelLoadFailed(let path):
return "Failed to load model from: \(path)"
case .contextCreationFailed:
return "Failed to create llama context"
case .samplerCreationFailed:
return "Failed to initialize token sampler"
case .tokenizationFailed:
return "Failed to tokenize input text"
case .contextOverflow(let count, let size):
return "Input tokens (\(count)) exceed context window (\(size))"
case .decodeFailed:
return "Model decode operation failed"
case .generationTimeout:
return "Text generation timed out"
case .memoryPressure(let available, let required):
return "Insufficient memory: \(available / 1_048_576) MB available, "
+ "\(required / 1_048_576) MB required"
case .cancelled:
return "Generation was cancelled"
}
}
}
// MARK: - Generation Options
/// Configuration options for a single generation request.
public struct GenerationOptions: Sendable {
public var maxTokens: Int = 512
public var temperature: Float = 0.3
public var topP: Float = 0.9
public var topK: Int32 = 40
public var repeatPenalty: Float = 1.1
public var stopSequences: [String] = []
public var timeout: TimeInterval = 30.0
public static let `default` = GenerationOptions()
public static let creative = GenerationOptions(
temperature: 0.7, topP: 0.95, topK: 50
)
public static let deterministic = GenerationOptions(
temperature: 0.1, topP: 0.5, topK: 10
)
public init(
maxTokens: Int = 512,
temperature: Float = 0.3,
topP: Float = 0.9,
topK: Int32 = 40,
repeatPenalty: Float = 1.1,
stopSequences: [String] = [],
timeout: TimeInterval = 30.0
) {
self.maxTokens = maxTokens
self.temperature = temperature
self.topP = topP
self.topK = topK
self.repeatPenalty = repeatPenalty
self.stopSequences = stopSequences
self.timeout = timeout
}
}

Note: The Ollama backend was removed during Phase 5 refactoring. VaulType uses llama.cpp exclusively. The sections below are retained for historical reference only.

Criteriallama.cpp (Embedded)Ollama (Local Server)
LatencyLower (in-process)Slightly higher (HTTP overhead)
SetupZero (bundled)Requires Ollama install
Model managementManual GGUF filesOllama CLI (ollama pull)
Model switchingRequires reload (~2-5s)Near-instant (Ollama caches)
Memory controlFine-grainedOllama manages
GPU layersConfigurable per-modelOllama auto-configures
Multiple modelsOne at a timeOllama can serve multiple
Offline guaranteeAlways worksRequires Ollama daemon running
Recommended forDefault / production usePower users, model experimentation

:bulb: Tip: Use the embedded llama.cpp backend as the default. Switch to Ollama if you frequently experiment with different models or want to share a single model instance across multiple applications.

Install Ollama:

Terminal window
# Install via Homebrew
brew install ollama
# Or download from https://ollama.com
# The macOS app installs the CLI and daemon automatically

Pull a recommended model:

Terminal window
# Pull Qwen2.5-3B-Instruct (recommended for VaulType)
ollama pull qwen2.5:3b-instruct-q4_K_M
# Pull Phi-3.5-mini as an alternative
ollama pull phi3.5:3.8b-mini-instruct-q4_K_M
# Pull Llama-3.2-3B-Instruct
ollama pull llama3.2:3b-instruct-q4_K_M
# Verify the model is available
ollama list

Start the Ollama server (if not running as a macOS service):

Terminal window
# Start the Ollama server daemon
ollama serve
# It will listen on localhost:11434 by default
# Verify it is running:
curl http://localhost:11434/api/tags

:lock: Security: Ollama binds to localhost (127.0.0.1) by default. VaulType only connects to localhost:11434. No data leaves the machine. If Ollama is configured to listen on 0.0.0.0, VaulType will still only connect to 127.0.0.1.

VaulType uses two Ollama REST API endpoints:

EndpointMethodPurpose
/api/tagsGETList available models
/api/generatePOSTRun inference (non-streaming)
/api/generate (stream)POSTRun inference (streaming)

Request format for /api/generate:

{
"model": "qwen2.5:3b-instruct-q4_K_M",
"prompt": "<the full prompt text>",
"stream": false,
"options": {
"temperature": 0.3,
"top_p": 0.9,
"top_k": 40,
"repeat_penalty": 1.1,
"num_predict": 512,
"stop": ["\n\n---"]
}
}

Response format:

{
"model": "qwen2.5:3b-instruct-q4_K_M",
"response": "The generated text output...",
"done": true,
"total_duration": 1234567890,
"load_duration": 123456789,
"prompt_eval_count": 42,
"eval_count": 128,
"eval_duration": 987654321
}
import Foundation
/// Client for the Ollama REST API running on localhost.
public actor OllamaClient {
// MARK: - Types
public struct OllamaGenerateRequest: Codable, Sendable {
let model: String
let prompt: String
let stream: Bool
let options: OllamaOptions?
struct OllamaOptions: Codable, Sendable {
let temperature: Float?
let top_p: Float?
let top_k: Int?
let repeat_penalty: Float?
let num_predict: Int?
let stop: [String]?
}
}
public struct OllamaGenerateResponse: Codable, Sendable {
let model: String
let response: String
let done: Bool
let total_duration: UInt64?
let load_duration: UInt64?
let prompt_eval_count: Int?
let eval_count: Int?
let eval_duration: UInt64?
}
public struct OllamaTagsResponse: Codable, Sendable {
let models: [OllamaModelInfo]
struct OllamaModelInfo: Codable, Sendable {
let name: String
let size: UInt64
let digest: String
}
}
public enum OllamaError: LocalizedError {
case serverNotRunning
case modelNotFound(String)
case requestFailed(statusCode: Int, body: String)
case decodingFailed
case timeout
public var errorDescription: String? {
switch self {
case .serverNotRunning:
return "Ollama server is not running on localhost:11434"
case .modelNotFound(let model):
return "Model '\(model)' not found. Run: ollama pull \(model)"
case .requestFailed(let code, let body):
return "Ollama request failed (\(code)): \(body)"
case .decodingFailed:
return "Failed to decode Ollama response"
case .timeout:
return "Ollama request timed out"
}
}
}
// MARK: - Properties
private let baseURL: URL
private let session: URLSession
// MARK: - Initialization
public init(
host: String = "127.0.0.1",
port: Int = 11434,
timeout: TimeInterval = 60.0
) {
self.baseURL = URL(string: "http://\(host):\(port)")!
let config = URLSessionConfiguration.ephemeral
config.timeoutIntervalForRequest = timeout
config.timeoutIntervalForResource = timeout
config.waitsForConnectivity = false
self.session = URLSession(configuration: config)
}
// MARK: - Public API
/// Lists all models available in the local Ollama instance.
public func listModels() async throws -> [String] {
let url = baseURL.appendingPathComponent("api/tags")
let (data, response) = try await performRequest(url: url)
guard let httpResponse = response as? HTTPURLResponse,
httpResponse.statusCode == 200 else {
throw OllamaError.serverNotRunning
}
let tagsResponse = try JSONDecoder().decode(
OllamaTagsResponse.self, from: data
)
return tagsResponse.models.map(\.name)
}
/// Checks whether the Ollama server is reachable.
public func isAvailable() async -> Bool {
do {
_ = try await listModels()
return true
} catch {
return false
}
}
/// Generates text using the specified model and prompt.
public func generate(
model: String,
prompt: String,
options: GenerationOptions = .default
) async throws -> String {
let url = baseURL.appendingPathComponent("api/generate")
let requestBody = OllamaGenerateRequest(
model: model,
prompt: prompt,
stream: false,
options: .init(
temperature: options.temperature,
top_p: options.topP,
top_k: Int(options.topK),
repeat_penalty: options.repeatPenalty,
num_predict: options.maxTokens,
stop: options.stopSequences.isEmpty ? nil : options.stopSequences
)
)
var request = URLRequest(url: url)
request.httpMethod = "POST"
request.setValue("application/json", forHTTPHeaderField: "Content-Type")
request.httpBody = try JSONEncoder().encode(requestBody)
let (data, response) = try await performRequest(request: request)
guard let httpResponse = response as? HTTPURLResponse else {
throw OllamaError.requestFailed(statusCode: 0, body: "No HTTP response")
}
switch httpResponse.statusCode {
case 200:
let generateResponse = try JSONDecoder().decode(
OllamaGenerateResponse.self, from: data
)
return generateResponse.response
.trimmingCharacters(in: .whitespacesAndNewlines)
case 404:
throw OllamaError.modelNotFound(model)
default:
let body = String(data: data, encoding: .utf8) ?? "Unknown error"
throw OllamaError.requestFailed(
statusCode: httpResponse.statusCode, body: body
)
}
}
// MARK: - Private Helpers
private func performRequest(url: URL) async throws -> (Data, URLResponse) {
do {
return try await session.data(from: url)
} catch let error as URLError where error.code == .cannotConnectToHost {
throw OllamaError.serverNotRunning
} catch let error as URLError where error.code == .timedOut {
throw OllamaError.timeout
}
}
private func performRequest(request: URLRequest) async throws -> (Data, URLResponse) {
do {
return try await session.data(for: request)
} catch let error as URLError where error.code == .cannotConnectToHost {
throw OllamaError.serverNotRunning
} catch let error as URLError where error.code == .timedOut {
throw OllamaError.timeout
}
}
}

:warning: Warning: The URLSession is configured with ephemeral configuration to avoid caching any prompt data or model responses to disk. This preserves the privacy-first architecture.


Both llama.cpp and Ollama conform to a unified protocol, enabling transparent backend switching.

import Foundation
/// Protocol that all LLM backends must conform to.
/// Enables transparent switching between llama.cpp and Ollama.
public protocol LLMBackend: Sendable {
/// A human-readable name for this backend (e.g., "llama.cpp", "Ollama").
var name: String { get }
/// Whether the backend is currently ready to serve requests.
var isReady: Bool { get async }
/// The name or identifier of the currently loaded model.
var currentModel: String? { get async }
/// Loads a model, preparing the backend for inference.
/// - Parameter modelIdentifier: Model path (llama.cpp) or model name (Ollama).
func loadModel(_ modelIdentifier: String) async throws
/// Unloads the current model, freeing resources.
func unloadModel() async
/// Generates text from the given prompt.
/// - Parameters:
/// - prompt: The full prompt string.
/// - options: Generation configuration.
/// - Returns: The generated text.
func generate(
prompt: String,
options: GenerationOptions
) async throws -> String
/// Counts the approximate number of tokens in the given text.
/// - Parameter text: The text to tokenize.
/// - Returns: Approximate token count.
func countTokens(in text: String) async -> Int
}

llama.cpp backend adapter:

/// Adapts LlamaContext to the LLMBackend protocol.
public actor LlamaCppBackend: LLMBackend {
public let name = "llama.cpp"
private var context: LlamaContext?
private var loadedModelPath: String?
public var isReady: Bool {
context != nil
}
public var currentModel: String? {
loadedModelPath
}
public func loadModel(_ modelIdentifier: String) async throws {
// Unload previous model if any
await unloadModel()
let gpuLayers = optimalGPULayers(
for: try fileSize(at: modelIdentifier)
)
context = try LlamaContext(
modelPath: modelIdentifier,
contextSize: 4096,
gpuLayers: gpuLayers
)
loadedModelPath = modelIdentifier
}
public func unloadModel() async {
context = nil
loadedModelPath = nil
}
public func generate(
prompt: String,
options: GenerationOptions
) async throws -> String {
guard let context else {
throw LlamaError.contextCreationFailed
}
return try await context.generate(
prompt: prompt,
maxTokens: options.maxTokens,
stopSequences: options.stopSequences
)
}
public func countTokens(in text: String) async -> Int {
guard let context else { return text.count / 4 } // rough estimate
return context.tokenize(text, addBos: false).count
}
private func fileSize(at path: String) throws -> UInt64 {
let attrs = try FileManager.default.attributesOfItem(atPath: path)
return attrs[.size] as? UInt64 ?? 0
}
}

Ollama backend adapter:

/// Adapts OllamaClient to the LLMBackend protocol.
public actor OllamaBackend: LLMBackend {
public let name = "Ollama"
private let client: OllamaClient
private var modelName: String?
public init(client: OllamaClient = OllamaClient()) {
self.client = client
}
public var isReady: Bool {
get async {
await client.isAvailable() && modelName != nil
}
}
public var currentModel: String? {
modelName
}
public func loadModel(_ modelIdentifier: String) async throws {
// Verify model exists in Ollama
let models = try await client.listModels()
guard models.contains(where: { $0.hasPrefix(modelIdentifier) }) else {
throw OllamaClient.OllamaError.modelNotFound(modelIdentifier)
}
modelName = modelIdentifier
}
public func unloadModel() async {
modelName = nil
}
public func generate(
prompt: String,
options: GenerationOptions
) async throws -> String {
guard let modelName else {
throw OllamaClient.OllamaError.modelNotFound("No model loaded")
}
return try await client.generate(
model: modelName,
prompt: prompt,
options: options
)
}
public func countTokens(in text: String) async -> Int {
// Approximate: Ollama does not expose tokenization directly.
// Use a rough heuristic of ~4 characters per token for English text.
return max(1, text.utf8.count / 4)
}
}
import SwiftUI
import Combine
/// Manages the active LLM backend and provides a unified interface for
/// the rest of the application.
@Observable
public final class LLMService {
// MARK: - Types
public enum BackendType: String, Codable, CaseIterable, Sendable {
case llamaCpp = "llama.cpp"
case ollama = "Ollama"
}
public enum ServiceState: Sendable {
case idle
case loading
case ready
case processing
case error(Error)
}
// MARK: - Published State
public private(set) var state: ServiceState = .idle
public private(set) var activeBackendType: BackendType = .llamaCpp
// MARK: - Private
private var activeBackend: (any LLMBackend)?
private let llamaCppBackend = LlamaCppBackend()
private let ollamaBackend = OllamaBackend()
// MARK: - Backend Switching
/// Switches to the specified backend and loads the given model.
/// - Parameters:
/// - backendType: The backend to switch to.
/// - modelIdentifier: The model path (llama.cpp) or name (Ollama).
public func switchBackend(
to backendType: BackendType,
model modelIdentifier: String
) async throws {
state = .loading
// Unload current backend
if let current = activeBackend {
await current.unloadModel()
}
// Select and load new backend
let backend: any LLMBackend = switch backendType {
case .llamaCpp: llamaCppBackend
case .ollama: ollamaBackend
}
do {
try await backend.loadModel(modelIdentifier)
activeBackend = backend
activeBackendType = backendType
state = .ready
} catch {
state = .error(error)
throw error
}
}
/// Generates text using the active backend.
public func generate(
prompt: String,
options: GenerationOptions = .default
) async throws -> String {
guard let backend = activeBackend else {
throw LLMServiceError.noBackendLoaded
}
state = .processing
defer { state = .ready }
return try await backend.generate(prompt: prompt, options: options)
}
/// Counts tokens in the given text using the active backend.
public func countTokens(in text: String) async -> Int {
guard let backend = activeBackend else { return text.count / 4 }
return await backend.countTokens(in: text)
}
}
public enum LLMServiceError: LocalizedError {
case noBackendLoaded
public var errorDescription: String? {
switch self {
case .noBackendLoaded:
return "No LLM backend is loaded. Please select a model first."
}
}
}

Example: Switching backends at runtime:

// In a ViewModel or Settings handler:
let llmService = LLMService()
// Start with llama.cpp (default)
try await llmService.switchBackend(
to: .llamaCpp,
model: "/Users/me/Library/Application Support/VaulType/Models/qwen2.5-3b-instruct-q4_K_M.gguf"
)
// Later, switch to Ollama
try await llmService.switchBackend(
to: .ollama,
model: "qwen2.5:3b-instruct-q4_K_M"
)
// Generate text -- same API regardless of backend
let result = try await llmService.generate(
prompt: "Fix this text: i went to teh store",
options: .deterministic
)

VaulType is optimized for small, fast instruction-following models in the 1-4B parameter range. These models balance quality, speed, and memory usage for real-time text processing tasks.

ModelParametersQuantFile SizeContext Window
Qwen2.5-3B-Instruct (recommended)3.09BQ4_K_M2.0 GB32,768
Phi-3.5-mini-instruct3.82BQ4_K_M2.4 GB128,000
Llama-3.2-3B-Instruct3.21BQ4_K_M2.0 GB8,192
Qwen2.5-1.5B-Instruct (lightweight)1.54BQ4_K_M1.0 GB32,768
Gemma-2-2B-IT (alternative)2.61BQ4_K_M1.7 GB8,192

Benchmarks were conducted on three representative Apple Silicon configurations. All tests used Q4_K_M quantization with all layers offloaded to GPU (gpuLayers = -1).

Test methodology: Average of 50 inference runs. Input: 150 tokens. Output: 100 tokens. Temperature: 0.3.

ModelPrompt Eval (tok/s)Generation (tok/s)Total LatencyRAM UsageVRAM Usage
Qwen2.5-3B-Instruct285323.8s2.3 GB2.0 GB
Phi-3.5-mini240274.5s2.8 GB2.4 GB
Llama-3.2-3B-Instruct275304.0s2.3 GB2.0 GB
Qwen2.5-1.5B-Instruct410482.6s1.4 GB1.0 GB

Apple Silicon M2 Pro (16 GB Unified Memory)

Section titled “Apple Silicon M2 Pro (16 GB Unified Memory)”
ModelPrompt Eval (tok/s)Generation (tok/s)Total LatencyRAM UsageVRAM Usage
Qwen2.5-3B-Instruct420482.6s2.3 GB2.0 GB
Phi-3.5-mini355403.1s2.8 GB2.4 GB
Llama-3.2-3B-Instruct400452.8s2.3 GB2.0 GB
Qwen2.5-1.5B-Instruct620721.8s1.4 GB1.0 GB

Apple Silicon M3 Max (36 GB Unified Memory)

Section titled “Apple Silicon M3 Max (36 GB Unified Memory)”
ModelPrompt Eval (tok/s)Generation (tok/s)Total LatencyRAM UsageVRAM Usage
Qwen2.5-3B-Instruct580651.9s2.3 GB2.0 GB
Phi-3.5-mini490552.3s2.8 GB2.4 GB
Llama-3.2-3B-Instruct550622.1s2.3 GB2.0 GB
Qwen2.5-1.5B-Instruct850951.3s1.4 GB1.0 GB

Intel Mac (i7-9750H + AMD Radeon Pro 5500M, 16 GB RAM)

Section titled “Intel Mac (i7-9750H + AMD Radeon Pro 5500M, 16 GB RAM)”
ModelPrompt Eval (tok/s)Generation (tok/s)Total LatencyRAM UsageVRAM Usage
Qwen2.5-3B-Instruct951210.2s2.3 GBN/A*
Phi-3.5-mini801012.1s2.8 GBN/A*
Llama-3.2-3B-Instruct901110.8s2.3 GBN/A*
Qwen2.5-1.5B-Instruct140187.0s1.4 GBN/A*

*Intel Macs use CPU inference via the Accelerate framework. AMD GPU support through Metal is available but yields inconsistent improvements for LLM workloads on discrete AMD GPUs.

Decision Tree: Which Model Should I Use?
+------------------+
| Available RAM? |
+--------+---------+
|
+--------------+--------------+
| | |
< 6 GB 6-12 GB > 12 GB
| | |
v v v
+------------------+ +-----------+ +-----------+
| Qwen2.5-1.5B | | Qwen2.5 | | Phi-3.5 |
| (lightweight, | | -3B | | -mini |
| fits in 4 GB | | (best | | (largest |
| total RAM) | | balance) | | context) |
+------------------+ +-----------+ +-----------+

:information_source: Info: Qwen2.5-3B-Instruct is the recommended default because it offers the best combination of output quality, speed, and memory efficiency for text post-processing tasks. It consistently outperforms Llama-3.2-3B on grammar correction and formatting tasks while using the same memory footprint.

:bulb: Tip: If you need to process very long transcriptions (over 4,000 words), consider Phi-3.5-mini with its 128K context window. For most real-time dictation use, Qwen2.5-3B’s 32K context is more than sufficient.


VaulType’s prompt template system uses variable substitution with {variable_name} syntax. The following variables are available in all templates:

VariableDescriptionExample Value
{text}The raw transcription from whisper.cpp"i need to fix the login bug before tomorrow"
{language}Detected or configured language code"en"
{app_name}Name of the frontmost application"Xcode"
{context}Optional user-provided context string"Writing a pull request description"
{timestamp}Current ISO 8601 timestamp"2026-02-13T14:30:00Z"
{word_count}Approximate word count of input text"42"
{mode}Current processing mode name"Clean"

Variable resolution order:

  1. Built-in variables ({text}, {language}, {timestamp}, {word_count}, {mode})
  2. Application context ({app_name}, {context})
  3. User-defined variables (from Custom mode template configuration)
import Foundation
import SwiftData
/// Represents a reusable prompt template stored in SwiftData.
@Model
public final class PromptTemplate {
// MARK: - Stored Properties
/// Unique identifier for the template.
@Attribute(.unique)
public var id: UUID
/// Human-readable name shown in the UI.
public var name: String
/// The template string with {variable} placeholders.
public var templateBody: String
/// Description of what this template does.
public var templateDescription: String
/// The processing mode this template belongs to.
public var mode: ProcessingMode
/// Whether this is a built-in template (cannot be deleted).
public var isBuiltIn: Bool
/// Display order within the mode's template list.
public var sortOrder: Int
/// Creation timestamp.
public var createdAt: Date
/// Last modified timestamp.
public var updatedAt: Date
/// Optional system prompt prepended to the template.
public var systemPrompt: String?
/// Stop sequences for this template.
public var stopSequences: [String]
/// Recommended temperature for this template.
public var temperature: Float
/// Maximum tokens to generate.
public var maxTokens: Int
// MARK: - Initialization
public init(
name: String,
templateBody: String,
description: String = "",
mode: ProcessingMode = .custom,
isBuiltIn: Bool = false,
sortOrder: Int = 0,
systemPrompt: String? = nil,
stopSequences: [String] = [],
temperature: Float = 0.3,
maxTokens: Int = 512
) {
self.id = UUID()
self.name = name
self.templateBody = templateBody
self.templateDescription = description
self.mode = mode
self.isBuiltIn = isBuiltIn
self.sortOrder = sortOrder
self.createdAt = Date()
self.updatedAt = Date()
self.systemPrompt = systemPrompt
self.stopSequences = stopSequences
self.temperature = temperature
self.maxTokens = maxTokens
}
}
// MARK: - Processing Mode Enum
/// All available text processing modes.
public enum ProcessingMode: String, Codable, CaseIterable, Sendable {
case raw = "Raw"
case clean = "Clean"
case structure = "Structure"
case prompt = "Prompt"
case code = "Code"
case custom = "Custom"
/// Whether this mode requires LLM processing.
public var requiresLLM: Bool {
switch self {
case .raw: return false
default: return true
}
}
/// Default generation options for this mode.
public var defaultOptions: GenerationOptions {
switch self {
case .raw:
return .default
case .clean:
return GenerationOptions(
maxTokens: 512, temperature: 0.1,
topP: 0.5, topK: 10
)
case .structure:
return GenerationOptions(
maxTokens: 768, temperature: 0.2,
topP: 0.7, topK: 20
)
case .prompt:
return GenerationOptions(
maxTokens: 1024, temperature: 0.4,
topP: 0.9, topK: 40
)
case .code:
return GenerationOptions(
maxTokens: 512, temperature: 0.1,
topP: 0.5, topK: 10
)
case .custom:
return .default
}
}
}

VaulType ships with one built-in template per processing mode (except Raw, which has no template). These templates are seeded into SwiftData on first launch.

/// Seeds the default built-in prompt templates into the SwiftData store.
func seedBuiltInTemplates(context: ModelContext) {
let builtIns: [PromptTemplate] = [
// Clean Mode
PromptTemplate(
name: "Standard Clean",
templateBody: BuiltInTemplates.clean,
description: "Fix spelling, grammar, punctuation, and capitalization.",
mode: .clean,
isBuiltIn: true,
sortOrder: 0,
temperature: 0.1,
maxTokens: 512
),
// Structure Mode
PromptTemplate(
name: "Auto-Structure",
templateBody: BuiltInTemplates.structure,
description: "Format text as bullet lists, numbered steps, or tables.",
mode: .structure,
isBuiltIn: true,
sortOrder: 0,
temperature: 0.2,
maxTokens: 768
),
// Prompt Mode
PromptTemplate(
name: "Prompt Formatter",
templateBody: BuiltInTemplates.prompt,
description: "Reformat speech into a well-structured AI prompt.",
mode: .prompt,
isBuiltIn: true,
sortOrder: 0,
temperature: 0.4,
maxTokens: 1024
),
// Code Mode
PromptTemplate(
name: "Code Documentation",
templateBody: BuiltInTemplates.code,
description: "Format as code comments, commit messages, or docs.",
mode: .code,
isBuiltIn: true,
sortOrder: 0,
temperature: 0.1,
maxTokens: 512
),
]
for template in builtIns {
context.insert(template)
}
try? context.save()
}

Users can create, edit, and manage custom templates through the Settings UI. Custom templates support all built-in variables plus user-defined ones.

/// Resolves template variables and produces a final prompt string.
public struct TemplateEngine {
/// Resolves all {variable} placeholders in the template body.
/// - Parameters:
/// - template: The prompt template to resolve.
/// - text: The raw transcription text.
/// - context: Additional contextual information.
/// - Returns: The fully resolved prompt string.
public static func resolve(
template: PromptTemplate,
text: String,
language: String = "en",
appName: String = "Unknown",
context: String = "",
customVariables: [String: String] = [:]
) -> String {
var resolved = ""
// Prepend system prompt if present
if let systemPrompt = template.systemPrompt, !systemPrompt.isEmpty {
resolved += systemPrompt + "\n\n"
}
resolved += template.templateBody
// Built-in variables
let builtInVars: [String: String] = [
"text": text,
"language": language,
"app_name": appName,
"context": context,
"timestamp": ISO8601DateFormatter().string(from: Date()),
"word_count": "\(text.split(separator: " ").count)",
"mode": template.mode.rawValue,
]
// Merge built-in with custom (custom takes precedence)
let allVars = builtInVars.merging(customVariables) { _, custom in custom }
// Replace all {variable} placeholders
for (key, value) in allVars {
resolved = resolved.replacingOccurrences(
of: "{\(key)}",
with: value
)
}
return resolved
}
}

Templates can be exported as JSON files for sharing and backup, and imported from JSON.

/// Handles import/export of prompt templates as JSON files.
public struct TemplateExporter {
// MARK: - Codable DTO
public struct TemplateDTO: Codable {
let name: String
let templateBody: String
let description: String
let mode: String
let systemPrompt: String?
let stopSequences: [String]
let temperature: Float
let maxTokens: Int
let version: Int // Schema version for forward compatibility
}
public struct TemplateBundle: Codable {
let bundleVersion: Int
let exportedAt: String
let templates: [TemplateDTO]
}
// MARK: - Export
/// Exports templates to a JSON file.
public static func export(
templates: [PromptTemplate],
to url: URL
) throws {
let dtos = templates.map { template in
TemplateDTO(
name: template.name,
templateBody: template.templateBody,
description: template.templateDescription,
mode: template.mode.rawValue,
systemPrompt: template.systemPrompt,
stopSequences: template.stopSequences,
temperature: template.temperature,
maxTokens: template.maxTokens,
version: 1
)
}
let bundle = TemplateBundle(
bundleVersion: 1,
exportedAt: ISO8601DateFormatter().string(from: Date()),
templates: dtos
)
let data = try JSONEncoder.prettyPrinted.encode(bundle)
try data.write(to: url, options: .atomic)
}
// MARK: - Import
/// Imports templates from a JSON file.
/// - Returns: Array of PromptTemplate objects (not yet inserted into context).
public static func importTemplates(
from url: URL
) throws -> [PromptTemplate] {
let data = try Data(contentsOf: url)
let bundle = try JSONDecoder().decode(TemplateBundle.self, from: data)
return bundle.templates.compactMap { dto in
guard let mode = ProcessingMode(rawValue: dto.mode) else { return nil }
return PromptTemplate(
name: dto.name,
templateBody: dto.templateBody,
description: dto.description,
mode: mode,
isBuiltIn: false,
systemPrompt: dto.systemPrompt,
stopSequences: dto.stopSequences,
temperature: dto.temperature,
maxTokens: dto.maxTokens
)
}
}
}
private extension JSONEncoder {
static let prettyPrinted: JSONEncoder = {
let encoder = JSONEncoder()
encoder.outputFormatting = [.prettyPrinted, .sortedKeys]
return encoder
}()
}

:information_source: Info: Exported template bundles include a bundleVersion field for forward compatibility. Future versions of VaulType can migrate older template formats automatically.


Purpose: Pass-through mode. No LLM processing. The raw transcription from whisper.cpp is injected directly into the target application.

When to use: When you want exact, unmodified speech-to-text output, or when LLM processing is disabled or unavailable.

Template: None (no LLM call is made).

// Raw mode implementation -- trivially returns the input text.
func processRaw(text: String) -> String {
return text
}
Input: "i went to the store and bought some milk and bread and eggs"
Output: "i went to the store and bought some milk and bread and eggs"

:information_source: Info: Raw mode is the automatic fallback if the LLM backend is not loaded or encounters an error. See Section 10: Error Handling and Fallback.

Purpose: Fix spelling, grammar, punctuation, and capitalization while preserving the original meaning and tone.

Actual prompt template:

You are a text editor. Fix the spelling, grammar, punctuation, and capitalization of the following transcribed speech. Preserve the original meaning, tone, and intent exactly. Do not add, remove, or rephrase content. Do not add explanations or commentary. Output only the corrected text.
Language: {language}
Input text:
{text}
Corrected text:

Generation options: Temperature 0.1, Top-P 0.5, Top-K 10, Max tokens 512.

Example:

Input: "i went to teh store and buoght some milk and bread and egs"
Output: "I went to the store and bought some milk, bread, and eggs."
Input: "so basically what im trying to say is that the feature isnt working right
and we need to look into it before the release"
Output: "So basically, what I'm trying to say is that the feature isn't working right,
and we need to look into it before the release."

Purpose: Reorganize transcribed speech into structured formats: bullet lists, numbered steps, tables, or headings.

Actual prompt template:

You are a text formatter. Reorganize the following transcribed speech into a well-structured format. Use the most appropriate structure based on the content:
- If the text describes steps or a process, use a numbered list.
- If the text lists items or ideas, use bullet points.
- If the text contains comparisons or data, use a markdown table.
- If the text covers multiple topics, use headings with bullet points.
Preserve all information from the original text. Do not add new information or commentary. Output only the structured text.
Language: {language}
Input text:
{text}
Structured output:

Generation options: Temperature 0.2, Top-P 0.7, Top-K 20, Max tokens 768.

Example:

Input: "so for the project we need to first set up the database then create
the api endpoints and after that build the frontend and finally
deploy to production also we need to write tests"
Output: "1. Set up the database
2. Create the API endpoints
3. Build the frontend
4. Write tests
5. Deploy to production"
Input: "the pros of using swift are type safety good performance and nice
syntax but the cons are that it only works on apple platforms and
the compile times can be slow"
Output: "## Swift: Pros and Cons
**Pros:**
- Type safety
- Good performance
- Nice syntax
**Cons:**
- Only works on Apple platforms
- Compile times can be slow"

Purpose: Transform spoken instructions into well-structured AI prompts suitable for use with ChatGPT, Claude, or similar tools.

Actual prompt template:

You are a prompt engineer. Transform the following transcribed speech into a well-structured, clear AI prompt. The output should be a prompt that a user can paste directly into an AI assistant (like ChatGPT or Claude).
Follow these guidelines:
- Start with a clear role or task description.
- Include specific constraints and requirements mentioned in the speech.
- Organize multi-part requests into numbered sections.
- Add formatting instructions if the speech implies a specific output format.
- Preserve all technical details and requirements from the original speech.
Do not execute the prompt. Only output the formatted prompt text. Do not add meta-commentary.
Application context: {app_name}
Additional context: {context}
Transcribed speech:
{text}
Formatted prompt:

Generation options: Temperature 0.4, Top-P 0.9, Top-K 40, Max tokens 1024.

Example:

Input: "i need you to help me write a python function that takes a list of
numbers and returns the top three largest ones and it should handle
edge cases like empty lists and lists with fewer than three items
and make sure its well documented"
Output: "Write a Python function with the following specifications:
1. **Function**: Accept a list of numbers as input and return the
top 3 largest values.
2. **Edge cases to handle**:
- Empty list (return an empty list or raise a descriptive error)
- List with fewer than 3 items (return all items, sorted descending)
3. **Requirements**:
- Include a complete docstring with parameters, return type, and examples
- Add inline comments for complex logic
- Follow PEP 8 style conventions"

Purpose: Format transcribed speech into code-related artifacts: code comments, commit messages, documentation, or inline annotations.

Actual prompt template:

You are a developer assistant. Transform the following transcribed speech into well-formatted code documentation. Based on the content, output the most appropriate format:
- If describing a change or fix: format as a Git commit message (subject line + body).
- If describing functionality: format as a code documentation comment (using the appropriate doc-comment syntax for the language context).
- If describing a bug or issue: format as a structured bug report.
- If describing a task: format as a TODO comment or task description.
Use concise, technical language. Follow conventional formatting standards.
Application: {app_name}
Context: {context}
Language: {language}
Transcribed speech:
{text}
Formatted output:

Generation options: Temperature 0.1, Top-P 0.5, Top-K 10, Max tokens 512.

Example (commit message):

Input: "fixed the bug where the login screen would crash if the user
entered a really long password because we werent validating
the input length before sending it to the api"
Output: "fix: prevent crash on login with oversized password input
Validate password length before sending to the authentication API.
Previously, passwords exceeding 256 characters caused a buffer
overflow in the API serialization layer, crashing the login screen.
- Add input length validation (max 256 chars) to LoginViewController
- Show user-facing error message for passwords exceeding the limit
- Add unit test for boundary password lengths"

Example (code comment):

Input: "this function calculates the total price including tax and it
takes the subtotal and tax rate as parameters and returns the
final amount rounded to two decimal places"
Output: "/// Calculates the total price including tax.
///
/// - Parameters:
/// - subtotal: The pre-tax subtotal amount.
/// - taxRate: The tax rate as a decimal (e.g., 0.08 for 8%).
/// - Returns: The total price rounded to two decimal places."

Purpose: User-defined templates. Users create their own prompt templates using the available variables.

Template: Defined by the user. Example custom template:

Translate the following text from {language} to formal business English.
Maintain professional tone and corporate terminology.
Context: {context}
Original text:
{text}
Translation:

Custom templates can reference any built-in variable and any user-defined variables.

7.7 Processing Mode Selection and Execution

Section titled “7.7 Processing Mode Selection and Execution”

The TextProcessor class orchestrates the full pipeline from raw transcription through mode-specific LLM processing.

import Foundation
import SwiftData
/// Orchestrates the text processing pipeline.
/// Receives raw transcription, applies the selected processing mode,
/// and returns the final processed text.
public actor TextProcessor {
// MARK: - Dependencies
private let llmService: LLMService
private let templateEngine: TemplateEngine.Type
private let modelContext: ModelContext
// MARK: - State
public private(set) var currentMode: ProcessingMode = .clean
// MARK: - Initialization
public init(
llmService: LLMService,
modelContext: ModelContext
) {
self.llmService = llmService
self.templateEngine = TemplateEngine.self
self.modelContext = modelContext
}
// MARK: - Mode Selection
/// Sets the active processing mode.
public func setMode(_ mode: ProcessingMode) {
currentMode = mode
}
// MARK: - Processing
/// Processes raw transcription text through the active mode's pipeline.
/// - Parameters:
/// - rawText: The raw transcription from whisper.cpp.
/// - language: The detected language code.
/// - appName: The frontmost application name.
/// - context: Optional user-provided context.
/// - Returns: The processed text, or raw text if processing fails.
public func process(
rawText: String,
language: String = "en",
appName: String = "Unknown",
context: String = ""
) async -> ProcessingResult {
// Raw mode: no LLM needed
guard currentMode.requiresLLM else {
return ProcessingResult(
text: rawText,
mode: .raw,
processingTime: 0,
tokensUsed: 0,
didFallback: false
)
}
let startTime = CFAbsoluteTimeGetCurrent()
do {
// Fetch the active template for the current mode
let template = try fetchActiveTemplate(for: currentMode)
// Resolve template variables
let prompt = templateEngine.resolve(
template: template,
text: rawText,
language: language,
appName: appName,
context: context
)
// Check token budget
let tokenCount = await llmService.countTokens(in: prompt)
guard tokenCount < 3500 else {
// Prompt is too long; truncate input text and retry
let truncatedText = truncateToFit(
text: rawText,
maxTokens: 2500,
currentTokens: tokenCount
)
let truncatedPrompt = templateEngine.resolve(
template: template,
text: truncatedText,
language: language,
appName: appName,
context: context
)
let result = try await llmService.generate(
prompt: truncatedPrompt,
options: currentMode.defaultOptions
)
let elapsed = CFAbsoluteTimeGetCurrent() - startTime
return ProcessingResult(
text: result,
mode: currentMode,
processingTime: elapsed,
tokensUsed: tokenCount,
didFallback: false
)
}
// Run inference
let result = try await llmService.generate(
prompt: prompt,
options: currentMode.defaultOptions
)
let elapsed = CFAbsoluteTimeGetCurrent() - startTime
return ProcessingResult(
text: result,
mode: currentMode,
processingTime: elapsed,
tokensUsed: tokenCount,
didFallback: false
)
} catch {
// Fallback: return raw text
let elapsed = CFAbsoluteTimeGetCurrent() - startTime
return ProcessingResult(
text: rawText,
mode: currentMode,
processingTime: elapsed,
tokensUsed: 0,
didFallback: true,
fallbackReason: error.localizedDescription
)
}
}
// MARK: - Private Helpers
private func fetchActiveTemplate(
for mode: ProcessingMode
) throws -> PromptTemplate {
let descriptor = FetchDescriptor<PromptTemplate>(
predicate: #Predicate { $0.mode == mode },
sortBy: [SortDescriptor(\.sortOrder)]
)
let templates = try modelContext.fetch(descriptor)
guard let template = templates.first else {
throw ProcessingError.noTemplateFound(mode: mode)
}
return template
}
private func truncateToFit(
text: String,
maxTokens: Int,
currentTokens: Int
) -> String {
let ratio = Double(maxTokens) / Double(currentTokens)
let targetCharCount = Int(Double(text.count) * ratio * 0.9)
if targetCharCount < text.count {
let index = text.index(
text.startIndex,
offsetBy: targetCharCount,
limitedBy: text.endIndex
) ?? text.endIndex
return String(text[..<index]) + "..."
}
return text
}
}
// MARK: - Supporting Types
public struct ProcessingResult: Sendable {
public let text: String
public let mode: ProcessingMode
public let processingTime: TimeInterval
public let tokensUsed: Int
public let didFallback: Bool
public var fallbackReason: String?
}
public enum ProcessingError: LocalizedError {
case noTemplateFound(mode: ProcessingMode)
public var errorDescription: String? {
switch self {
case .noTemplateFound(let mode):
return "No prompt template found for mode: \(mode.rawValue)"
}
}
}

8. Token Management and Context Window Handling

Section titled “8. Token Management and Context Window Handling”

Each model has a maximum context window that limits the total number of tokens (prompt + generated output) that can be processed in a single inference call.

ModelMax Context WindowRecommended Max for VaulTypePrompt BudgetOutput Budget
Qwen2.5-3B-Instruct32,7684,0963,500596
Phi-3.5-mini128,0004,0963,500596
Llama-3.2-3B-Instruct8,1924,0963,500596
Qwen2.5-1.5B-Instruct32,7682,0481,700348

:warning: Warning: While models like Phi-3.5-mini support up to 128K tokens, VaulType defaults to a 4,096 context window to optimize latency and memory usage. Larger context windows require proportionally more memory for the KV cache. Users can increase this in settings if needed.

Context window allocation:

+------------------------------------------------------------+
| Context Window (4,096 tokens) |
| |
| +------------------------+-----------+------------------+ |
| | System Prompt + Template| Input Text| Generated Output | |
| | (~200 tokens) | (variable) | (up to 512 tok) | |
| +------------------------+-----------+------------------+ |
| |
| <-------- Prompt Budget (3,500) -------->|<-- Output -->| |
| (596 max) |
+------------------------------------------------------------+
/// Utility for managing token budgets within the context window.
public struct TokenBudget {
/// The total context window size in tokens.
public let contextSize: Int
/// Tokens reserved for the system prompt and template chrome.
public let templateOverhead: Int
/// Maximum tokens for generated output.
public let maxOutputTokens: Int
/// Tokens available for the input text.
public var inputBudget: Int {
contextSize - templateOverhead - maxOutputTokens
}
public init(
contextSize: Int = 4096,
templateOverhead: Int = 200,
maxOutputTokens: Int = 512
) {
self.contextSize = contextSize
self.templateOverhead = templateOverhead
self.maxOutputTokens = maxOutputTokens
}
/// Checks whether the given text fits within the input budget.
/// - Parameters:
/// - text: The input text.
/// - tokenCounter: A closure that counts tokens in a string.
/// - Returns: Whether the text fits, and the token count.
public func fits(
text: String,
tokenCounter: (String) async -> Int
) async -> (fits: Bool, tokenCount: Int) {
let count = await tokenCounter(text)
return (count <= inputBudget, count)
}
/// Describes the current budget allocation.
public var description: String {
"""
Context: \(contextSize) tokens
Template overhead: \(templateOverhead) tokens
Input budget: \(inputBudget) tokens
Output budget: \(maxOutputTokens) tokens
"""
}
}

When input text exceeds the token budget, VaulType applies one of three truncation strategies:

/// Strategies for truncating text that exceeds the token budget.
public enum TruncationStrategy: String, Codable, CaseIterable, Sendable {
/// Keep the end of the text (most recent speech). Default for dictation.
case keepEnd = "Keep End"
/// Keep the beginning of the text.
case keepBeginning = "Keep Beginning"
/// Keep both the beginning and end, removing the middle.
case keepEdges = "Keep Edges"
}
extension TruncationStrategy {
/// Truncates the given text to approximately the target token count.
/// - Parameters:
/// - text: The text to truncate.
/// - targetTokens: The target number of tokens.
/// - currentTokens: The current number of tokens in the text.
/// - Returns: The truncated text with an ellipsis marker.
public func truncate(
text: String,
targetTokens: Int,
currentTokens: Int
) -> String {
guard currentTokens > targetTokens else { return text }
let ratio = Double(targetTokens) / Double(currentTokens)
let targetChars = Int(Double(text.count) * ratio * 0.9) // 10% safety margin
switch self {
case .keepEnd:
let startIndex = text.index(
text.endIndex,
offsetBy: -targetChars,
limitedBy: text.startIndex
) ?? text.startIndex
return "..." + text[startIndex...]
case .keepBeginning:
let endIndex = text.index(
text.startIndex,
offsetBy: targetChars,
limitedBy: text.endIndex
) ?? text.endIndex
return text[..<endIndex] + "..."
case .keepEdges:
let halfChars = targetChars / 2
let frontEnd = text.index(
text.startIndex,
offsetBy: halfChars,
limitedBy: text.endIndex
) ?? text.endIndex
let backStart = text.index(
text.endIndex,
offsetBy: -halfChars,
limitedBy: text.startIndex
) ?? text.startIndex
return text[..<frontEnd] + "\n...[truncated]...\n" + text[backStart...]
}
}
}

:bulb: Tip: The default truncation strategy is keepEnd, which preserves the most recently dictated text. This works best for real-time dictation where the latest content is most relevant. Switch to keepBeginning when processing complete documents.


Minimizing latency is critical for VaulType because the user is waiting for their speech to be transformed and injected into the active application. The goal is sub-3-second total pipeline time from end-of-speech to text injection.

Latency Budget (Target: < 3 seconds total)
+-------+-------+-------+-------+-------+
| ASR | Mode | Templ | LLM | Inject|
| 0.5s | 0.01s | 0.01s | 2.0s | 0.1s |
+-------+-------+-------+-------+-------+
<---- Optimization focus ---->

Load the LLM model at application startup so it is ready for the first inference request.

/// Preloads the LLM model at application startup.
/// Called from the AppDelegate or App struct's initialization.
@MainActor
final class AppBootstrap {
static func preloadLLM(llmService: LLMService, settings: AppSettings) {
Task.detached(priority: .userInitiated) {
do {
try await llmService.switchBackend(
to: settings.preferredBackend,
model: settings.activeModelPath
)
} catch {
// Log but do not crash -- LLM will be loaded on first use
// or fallback to raw mode.
Logger.llm.error("Failed to preload LLM: \(error.localizedDescription)")
}
}
}
}

Warm-up inference:

/// Runs a minimal inference to warm up the Metal pipeline and JIT caches.
func warmUp(backend: any LLMBackend) async {
_ = try? await backend.generate(
prompt: "Hello",
options: GenerationOptions(maxTokens: 1, temperature: 0)
)
}

:information_source: Info: The first inference after model load is typically 2-3x slower than subsequent inferences because Metal shaders must be compiled and cached. The warm-up call eliminates this cold-start penalty.

Speculative decoding uses a smaller “draft” model to generate candidate tokens, which are then verified by the main model in a single batch. This can improve generation speed by 1.5-2x.

/// Configuration for speculative decoding.
/// Uses a smaller draft model to speed up token generation.
struct SpeculativeDecodingConfig {
/// Path to the smaller draft model (e.g., Qwen2.5-0.5B).
let draftModelPath: String
/// Number of draft tokens to generate before verification.
let draftTokenCount: Int // Typically 4-8
/// Whether speculative decoding is enabled.
let isEnabled: Bool
static let `default` = SpeculativeDecodingConfig(
draftModelPath: "",
draftTokenCount: 6,
isEnabled: false // Disabled by default; requires a second model
)
}

:warning: Warning: Speculative decoding requires loading a second (smaller) model alongside the main model, which increases memory usage by approximately 500 MB - 1 GB. Only enable this on machines with 16 GB or more of unified memory.

The batch size controls how many tokens are processed in a single forward pass during prompt evaluation. Larger batch sizes improve prompt processing speed but use more memory.

Batch SizePrompt Eval SpeedMemory OverheadRecommended For
128BaselineMinimalMachines with < 8 GB RAM
256~1.3xLowDefault for 8 GB machines
512~1.8xModerateDefault for 16 GB+ machines
1024~2.2xHigh32 GB+ machines, long prompts
2048~2.5xVery high64 GB+ machines only
/// Determines the optimal batch size based on available system memory.
func optimalBatchSize() -> Int32 {
let totalMemory = ProcessInfo.processInfo.physicalMemory
switch totalMemory {
case ..<(8 * 1_073_741_824): // < 8 GB
return 256
case ..<(16 * 1_073_741_824): // < 16 GB
return 512
case ..<(32 * 1_073_741_824): // < 32 GB
return 1024
default: // >= 32 GB
return 2048
}
}

As described in Section 2.3, offloading model layers to the Metal GPU is the single most impactful optimization. The key configuration points:

/// Comprehensive GPU offloading configuration.
struct MetalConfig {
/// Number of layers to offload (-1 = all).
var gpuLayers: Int32 = -1
/// Whether to use Metal for matrix multiplications.
var useMetalMatMul: Bool = true
/// Metal buffer size for tensor allocations (bytes).
var metalBufferSize: UInt64 = 512 * 1024 * 1024 // 512 MB
/// Whether to use float16 for Metal operations (faster, slight quality loss).
var useFloat16: Bool = true
/// Auto-configure based on the system's Metal device capabilities.
static func autoDetect() -> MetalConfig {
var config = MetalConfig()
guard let device = MTLCreateSystemDefaultDevice() else {
// No Metal device available (very old Intel Mac)
config.gpuLayers = 0
config.useMetalMatMul = false
return config
}
let vram = device.recommendedMaxWorkingSetSize
// Apple Silicon: use all GPU layers
if device.supportsFamily(.apple7) { // M1 and later
config.gpuLayers = -1
config.useFloat16 = true
}
// AMD discrete GPU (Intel Macs)
else if vram > 2 * 1_073_741_824 { // > 2 GB VRAM
config.gpuLayers = 20 // Partial offload
config.useFloat16 = false // AMD compatibility
}
// Integrated graphics or insufficient VRAM
else {
config.gpuLayers = 0
config.useMetalMatMul = false
}
return config
}
}

:apple: macOS-specific: On Apple Silicon, the unified memory architecture means that GPU and CPU share the same physical memory. Setting gpuLayers = -1 does not “use more memory” — it routes computations through the GPU cores instead of the CPU cores, which is substantially faster for matrix multiplications.


The LLM pipeline defines a comprehensive error hierarchy:

/// All errors that can occur in the LLM processing pipeline.
public enum LLMPipelineError: LocalizedError {
// Backend errors
case backendNotLoaded
case backendUnavailable(backend: String, reason: String)
// Model errors
case modelNotFound(path: String)
case modelCorrupted(path: String)
case modelIncompatible(model: String, reason: String)
// Inference errors
case generationFailed(underlying: Error)
case generationTimeout(seconds: TimeInterval)
case emptyOutput
// Resource errors
case insufficientMemory(available: UInt64, required: UInt64)
case insufficientDiskSpace(available: UInt64, required: UInt64)
// Template errors
case templateNotFound(mode: ProcessingMode)
case templateResolutionFailed(variable: String)
public var errorDescription: String? {
switch self {
case .backendNotLoaded:
return "No LLM backend is loaded"
case .backendUnavailable(let backend, let reason):
return "\(backend) backend unavailable: \(reason)"
case .modelNotFound(let path):
return "Model file not found: \(path)"
case .modelCorrupted(let path):
return "Model file is corrupted: \(path)"
case .modelIncompatible(let model, let reason):
return "Model '\(model)' is incompatible: \(reason)"
case .generationFailed(let underlying):
return "Text generation failed: \(underlying.localizedDescription)"
case .generationTimeout(let seconds):
return "Text generation timed out after \(Int(seconds))s"
case .emptyOutput:
return "Model produced empty output"
case .insufficientMemory(let available, let required):
let avail = available / 1_048_576
let req = required / 1_048_576
return "Insufficient memory: \(avail) MB available, \(req) MB required"
case .insufficientDiskSpace(let available, let required):
let avail = available / 1_048_576
let req = required / 1_048_576
return "Insufficient disk space: \(avail) MB available, \(req) MB required"
case .templateNotFound(let mode):
return "No template found for mode: \(mode.rawValue)"
case .templateResolutionFailed(let variable):
return "Failed to resolve template variable: {\(variable)}"
}
}
}

When the LLM pipeline encounters an error, VaulType follows a defined fallback chain to ensure the user always gets some output:

Fallback Chain:
LLM Generation
|
| failure
v
Retry with lower temperature (0.1)
|
| failure
v
Switch to alternate backend (if available)
(e.g., llama.cpp fails -> try Ollama)
|
| failure
v
Switch to smaller model (if available)
(e.g., 3B model fails -> try 1.5B model)
|
| failure
v
Inject raw text (Raw mode fallback)
|
| always succeeds
v
User sees unprocessed transcription

Implementation:

/// Executes the LLM processing pipeline with full fallback chain.
public actor ResilientTextProcessor {
private let primaryBackend: any LLMBackend
private let fallbackBackend: (any LLMBackend)?
private let fallbackModelPath: String?
private let maxRetries: Int = 2
private let retryDelay: TimeInterval = 0.5
public init(
primaryBackend: any LLMBackend,
fallbackBackend: (any LLMBackend)? = nil,
fallbackModelPath: String? = nil
) {
self.primaryBackend = primaryBackend
self.fallbackBackend = fallbackBackend
self.fallbackModelPath = fallbackModelPath
}
/// Processes text with full fallback chain.
/// Guaranteed to return a result (worst case: raw text).
public func process(
prompt: String,
rawText: String,
options: GenerationOptions
) async -> FallbackResult {
let startTime = CFAbsoluteTimeGetCurrent()
// Step 1: Try primary backend
do {
let result = try await withTimeout(options.timeout) {
try await self.primaryBackend.generate(
prompt: prompt, options: options
)
}
if !result.isEmpty {
return FallbackResult(
text: result,
source: .primary,
elapsed: CFAbsoluteTimeGetCurrent() - startTime
)
}
} catch {
Logger.llm.warning(
"Primary backend failed: \(error.localizedDescription)"
)
}
// Step 2: Retry with conservative settings
do {
var conservativeOptions = options
conservativeOptions.temperature = 0.1
conservativeOptions.maxTokens = min(options.maxTokens, 256)
let result = try await withTimeout(options.timeout) {
try await self.primaryBackend.generate(
prompt: prompt, options: conservativeOptions
)
}
if !result.isEmpty {
return FallbackResult(
text: result,
source: .primaryRetry,
elapsed: CFAbsoluteTimeGetCurrent() - startTime
)
}
} catch {
Logger.llm.warning(
"Primary retry failed: \(error.localizedDescription)"
)
}
// Step 3: Try fallback backend (e.g., Ollama if llama.cpp failed)
if let fallback = fallbackBackend, await fallback.isReady {
do {
let result = try await withTimeout(options.timeout) {
try await fallback.generate(
prompt: prompt, options: options
)
}
if !result.isEmpty {
return FallbackResult(
text: result,
source: .fallbackBackend,
elapsed: CFAbsoluteTimeGetCurrent() - startTime
)
}
} catch {
Logger.llm.warning(
"Fallback backend failed: \(error.localizedDescription)"
)
}
}
// Step 4: Final fallback -- return raw text
Logger.llm.error(
"All LLM backends failed. Falling back to raw text."
)
return FallbackResult(
text: rawText,
source: .rawFallback,
elapsed: CFAbsoluteTimeGetCurrent() - startTime
)
}
// MARK: - Timeout Helper
private func withTimeout<T: Sendable>(
_ timeout: TimeInterval,
operation: @escaping @Sendable () async throws -> T
) async throws -> T {
try await withThrowingTaskGroup(of: T.self) { group in
group.addTask {
try await operation()
}
group.addTask {
try await Task.sleep(for: .seconds(timeout))
throw LLMPipelineError.generationTimeout(seconds: timeout)
}
let result = try await group.next()!
group.cancelAll()
return result
}
}
}
// MARK: - Fallback Result
public struct FallbackResult: Sendable {
public enum Source: String, Sendable {
case primary = "Primary backend"
case primaryRetry = "Primary backend (retry)"
case fallbackBackend = "Fallback backend"
case rawFallback = "Raw text (no LLM)"
}
public let text: String
public let source: Source
public let elapsed: TimeInterval
public var didFallback: Bool {
source != .primary
}
}

Each processing mode has a configurable timeout. If the LLM does not produce output within the timeout, the pipeline falls back.

Processing ModeDefault TimeoutMax Timeout (User Configurable)
Clean15s60s
Structure20s60s
Prompt25s90s
Code15s60s
Custom30s120s

:warning: Warning: If inference consistently times out, this usually indicates the model is too large for the available hardware. Consider switching to a smaller model (e.g., Qwen2.5-1.5B) or enabling more GPU layers.

VaulType monitors system memory pressure and takes protective action to prevent the system from becoming unresponsive.

import Foundation
import os
/// Monitors system memory pressure and adjusts LLM behavior accordingly.
final class MemoryPressureMonitor {
private let source: DispatchSourceMemoryPressure
private let llmService: LLMService
enum PressureLevel: Sendable {
case normal
case warning
case critical
}
private(set) var currentLevel: PressureLevel = .normal
init(llmService: LLMService) {
self.llmService = llmService
self.source = DispatchSource.makeMemoryPressureSource(
eventMask: [.warning, .critical],
queue: .global(qos: .utility)
)
source.setEventHandler { [weak self] in
guard let self else { return }
let event = self.source.data
if event.contains(.critical) {
self.handleCriticalPressure()
} else if event.contains(.warning) {
self.handleWarningPressure()
}
}
source.activate()
}
deinit {
source.cancel()
}
private func handleWarningPressure() {
currentLevel = .warning
Logger.llm.warning("Memory pressure: WARNING. Reducing LLM batch size.")
// Notify the LLM service to reduce batch size and context window
Task {
// Reduce context window to conserve memory
// This is a signal, not a direct operation, because the
// LlamaContext would need to be recreated.
}
}
private func handleCriticalPressure() {
currentLevel = .critical
Logger.llm.error("Memory pressure: CRITICAL. Unloading LLM model.")
Task {
// Unload the model to free memory immediately
await llmService.unloadCurrentModel()
// Future requests will fall back to raw text mode
}
}
}

:x: Error: If you see “Memory pressure: CRITICAL. Unloading LLM model.” in the logs, VaulType has unloaded the LLM to prevent system instability. All text processing will fall back to Raw mode until the user manually reloads a model or memory pressure subsides. Consider using a smaller model or closing other memory-intensive applications.


This section provides the complete end-to-end view of the LLM processing pipeline as implemented in Phases 1-5.

AudioCaptureService (AVAudioEngine tap, 16kHz mono Float32)
→ WhisperService (whisper.cpp v1.7.4 transcription, Metal GPU)
→ VoicePrefixDetector (detects "code mode:", "clean this up:", etc.; strips prefix)
→ VocabularyService (spoken→replacement pairs, per-app then global)
→ CommandDetector (wake phrase prefix match → command vs dictation)
→ CustomCommandExecutor (SwiftData CustomCommand evaluation)
→ AppAliasResolver (AppProfile.shortcutAliases)
→ GlobalAliasResolver (UserSettings.globalShortcutAliases)
→ ProcessingModeRouter
├── Raw → direct output (no LLM)
└── Clean / Structure / Prompt / Code / Custom
→ PromptTemplateEngine (variable substitution)
→ LLMService (llama.cpp b8059, Metal GPU)
→ OverlayWindow (optional edit-before-inject)
→ TextInjectionService (CGEvent or clipboard paste)
→ DictationHistory (SwiftData DictationEntry)

DictationController is the pipeline orchestrator. It owns and wires all pipeline components:

  • Receives hotkey events from HotkeyManager
  • Starts/stops AudioCaptureService
  • Passes audio to WhisperService after VAD silence trim
  • Feeds transcript through the full pipeline above
  • Posts status updates to AppState for the menu bar and overlay UI

This section also provides a simplified view of the LLM-specific portion of the pipeline:

+====================================================================+
|| VaulType LLM Processing Pipeline ||
+====================================================================+
User speaks into microphone
|
v
+------------------+
| AVAudioEngine | (see SPEECH_RECOGNITION.md)
| Audio Capture |
+--------+---------+
|
v
+------------------+
| whisper.cpp | (see SPEECH_RECOGNITION.md)
| Transcription |
+--------+---------+
|
| Raw text: "i need to fix the login bug"
v
+------------------+
| TextProcessor |
| .process() |
+--------+---------+
|
| Check processing mode
v
+------------------+ +------------------+
| Mode == .raw? |---->| Return raw text |---> Inject
+--------+---------+ yes +------------------+
| no
v
+------------------+
| Fetch template |
| from SwiftData |
+--------+---------+
|
v
+------------------+
| TemplateEngine |
| .resolve() |
| - {text} |
| - {language} |
| - {app_name} |
| - {context} |
+--------+---------+
|
| Fully resolved prompt string
v
+------------------+
| TokenBudget |
| .fits()? |
+--------+---------+
|
+-----+-----+
| fits | too long
v v
| +------------------+
| | TruncationStrategy|
| | .truncate() |
| +--------+---------+
| |
+------+------+
|
v
+------------------+
| LLMBackend |
| .generate() |
+--------+---------+
|
+-----+-----+
| ok | error
v v
| +------------------+
| | FallbackChain |
| | 1. Retry |
| | 2. Alt backend |
| | 3. Raw fallback |
| +--------+---------+
| |
+------+------+
|
v
+------------------+
| Post-process |
| - Trim whitespace|
| - Validate output|
+--------+---------+
|
v
+------------------+
| CGEvent Injection| (see ../architecture/ARCHITECTURE.md)
| into active app |
+------------------+

Complete pipeline timing breakdown (typical, M2 Pro, Qwen2.5-3B):

StageDurationCumulative
Audio capture (last buffer)0ms0ms
whisper.cpp transcription~500ms~500ms
Mode selection<1ms~500ms
Template resolution<1ms~500ms
Token counting~5ms~505ms
LLM prompt evaluation~350ms~855ms
LLM token generation (100 tokens)~2,100ms~2,955ms
Post-processing<1ms~2,956ms
CGEvent injection~50ms~3,006ms
Total~3.0s

:white_check_mark: Success: On Apple Silicon with a 3B model, the complete pipeline from end-of-speech to text injection is consistently under 3.5 seconds for typical dictation lengths (10-50 words).


All LLM-related configuration is stored in UserDefaults (for simple preferences) and SwiftData (for templates and model metadata).

SettingKeyDefaultRange
Active backendllm.backendllamaCppllamaCpp, ollama
Model path (llama.cpp)llm.modelPath(bundled model)Valid file path
Model name (Ollama)llm.ollamaModelqwen2.5:3b-instruct-q4_K_MValid Ollama model
Context windowllm.contextSize4096512 - 131072
GPU layersllm.gpuLayers-1 (all)-1 to model layer count
Batch sizellm.batchSize51264 - 2048
Processing modellm.modecleanSee ProcessingMode
Temperaturellm.temperature0.30.0 - 2.0
Top-Pllm.topP0.90.0 - 1.0
Top-Kllm.topK401 - 100
Max output tokensllm.maxTokens5121 - 4096
Timeout (seconds)llm.timeout305 - 120
Truncation strategyllm.truncationkeepEndSee TruncationStrategy
Speculative decodingllm.speculativefalsetrue, false
Memory lock (mlock)llm.useMlockfalsetrue, false
Ollama hostllm.ollamaHost127.0.0.1Valid hostname/IP
Ollama portllm.ollamaPort114341024 - 65535

DocumentDescription
Architecture OverviewSystem architecture, module boundaries, data flow
Speech Recognitionwhisper.cpp integration, audio capture, transcription pipeline
Model ManagementModel download, storage, updates, GGUF file handling
API DocumentationInternal API reference for all modules
Tech StackFull technology stack and dependency details
SecuritySecurity model, privacy guarantees, threat analysis
PermissionsmacOS permissions (microphone, accessibility)
Database SchemaSwiftData models and persistence layer

This document is part of the VaulType project documentation. VaulType is licensed under GPL-3.0. For more information, see the LICENSE file in the repository root.