Speech Recognition
Last Updated: 2026-02-13
VaulType’s speech recognition engine is built on whisper.cpp, a high-performance C/C++ port of OpenAI’s Whisper model, compiled as a static library with Metal GPU acceleration for macOS. Every byte of audio is processed locally — no network calls, no cloud APIs, no telemetry.
Table of Contents
Section titled “Table of Contents”- 1. Whisper.cpp Integration Architecture
- 2. Model Management
- 3. Audio Preprocessing Pipeline
- 4. Streaming vs Batch Transcription
- 5. Language Detection and Selection
- 6. Performance Tuning Parameters
- 7. Custom Vocabulary Integration
- 8. Accuracy Optimization Techniques
- 9. Handling Edge Cases
- Related Documentation
1. Whisper.cpp Integration Architecture
Section titled “1. Whisper.cpp Integration Architecture”VaulType integrates whisper.cpp as a statically linked C library, bridged into Swift through a custom C-to-Swift interop layer. This approach avoids dynamic linking pitfalls, keeps the binary self-contained, and enables fine-grained control over GPU acceleration via Metal.
1.1 Static Library Compilation
Section titled “1.1 Static Library Compilation”whisper.cpp is compiled as a static library (libwhisper.a) as part of VaulType’s build process. The library is built with Metal support enabled and optimized for the target architecture.
🍎 macOS-specific: The build process uses
xcrunand targets the macOS SDK directly, ensuring compatibility with Apple’s code signing and notarization requirements.
The build configuration in the Xcode project references a custom CMake-based build step:
# Build whisper.cpp as a static library with Metal supportcd vendor/whisper.cpp
mkdir -p build && cd build
cmake .. \ -DCMAKE_BUILD_TYPE=Release \ -DWHISPER_METAL=ON \ -DWHISPER_COREML=OFF \ -DWHISPER_NO_AVX=OFF \ -DWHISPER_NO_AVX2=OFF \ -DWHISPER_NO_F16C=OFF \ -DBUILD_SHARED_LIBS=OFF \ -DCMAKE_OSX_ARCHITECTURES="arm64;x86_64" \ -DCMAKE_OSX_DEPLOYMENT_TARGET="14.0" \ -DCMAKE_INSTALL_PREFIX="../install"
cmake --build . --config Release --target whisper -- -j$(sysctl -n hw.ncpu)cmake --install . --config ReleaseThis produces:
install/lib/libwhisper.a— the static library (universal binary)install/include/whisper.h— the public C headerinstall/share/whisper/ggml-metal.metal— the Metal shader source
⚠️ Warning: When building a universal binary (
arm64;x86_64), ensure both architecture slices link correctly. Uselipo -info libwhisper.ato verify.
The Xcode project includes the static library via:
- Library Search Paths:
$(PROJECT_DIR)/vendor/whisper.cpp/install/lib - Header Search Paths:
$(PROJECT_DIR)/vendor/whisper.cpp/install/include - Other Linker Flags:
-lwhisper -lstdc++ -framework Metal -framework MetalKit -framework Accelerate
1.2 Bridging Headers and Swift-C Interop
Section titled “1.2 Bridging Headers and Swift-C Interop”Swift communicates with whisper.cpp through a bridging header that exposes the C API, combined with a Swift wrapper layer that provides type-safe, idiomatic Swift interfaces.
Bridging Header (VaulType-Bridging-Header.h):
//// VaulType//// Bridges whisper.cpp C API into Swift//
#ifndef VaulType_Bridging_Header_h#define VaulType_Bridging_Header_h
#include "whisper.h"
// Additional helper declarations for Swift interop// whisper.h uses opaque pointer types that Swift can consume directly
#endif /* VaulType_Bridging_Header_h */Swift Wrapper (WhisperBridge.swift):
import Foundation
/// Thread-safe bridge to whisper.cpp C API./// All whisper context operations are serialized on a dedicated dispatch queue.final class WhisperBridge: @unchecked Sendable {
// MARK: - Types
struct TranscriptionResult: Sendable { let text: String let segments: [Segment] let language: String let languageProbability: Float let processingTimeMs: Int64 }
struct Segment: Sendable { let text: String let startMs: Int64 let endMs: Int64 let probability: Float let isPartial: Bool }
enum WhisperError: Error, LocalizedError { case modelNotLoaded case contextInitFailed(path: String) case transcriptionFailed(code: Int32) case invalidAudioFormat case modelFileNotFound(path: String) case metalInitFailed
var errorDescription: String? { switch self { case .modelNotLoaded: return "No whisper model is currently loaded." case .contextInitFailed(let path): return "Failed to initialize whisper context from: \(path)" case .transcriptionFailed(let code): return "Transcription failed with error code: \(code)" case .invalidAudioFormat: return "Audio data must be 16kHz mono Float32 PCM." case .modelFileNotFound(let path): return "Model file not found at: \(path)" case .metalInitFailed: return "Metal GPU acceleration initialization failed." } } }
// MARK: - Properties
private var context: OpaquePointer? private let queue = DispatchQueue(label: "com.vaultype.whisper", qos: .userInitiated) private(set) var isModelLoaded: Bool = false private(set) var currentModelPath: String?
// MARK: - Lifecycle
deinit { if let ctx = context { whisper_free(ctx) } }
// MARK: - Model Loading
/// Load a whisper model from disk. /// - Parameter path: Absolute path to the .bin model file. /// - Throws: `WhisperError` if the file is missing or context init fails. func loadModel(at path: String) async throws { guard FileManager.default.fileExists(atPath: path) else { throw WhisperError.modelFileNotFound(path: path) }
try await withCheckedThrowingContinuation { (continuation: CheckedContinuation<Void, Error>) in queue.async { [weak self] in guard let self else { return }
// Free existing context if any if let existingCtx = self.context { whisper_free(existingCtx) self.context = nil self.isModelLoaded = false }
// Initialize context parameters with Metal enabled var params = whisper_context_default_params() params.use_gpu = true params.gpu_device = 0
guard let ctx = whisper_init_from_file_with_params(path, params) else { continuation.resume(throwing: WhisperError.contextInitFailed(path: path)) return }
self.context = ctx self.isModelLoaded = true self.currentModelPath = path continuation.resume() } } }
/// Unload the current model and free all associated resources. func unloadModel() async { await withCheckedContinuation { (continuation: CheckedContinuation<Void, Never>) in queue.async { [weak self] in guard let self else { return } if let ctx = self.context { whisper_free(ctx) self.context = nil } self.isModelLoaded = false self.currentModelPath = nil continuation.resume() } } }}ℹ️ Info: The
@unchecked Sendableconformance is intentional — all mutable state access is serialized through the dedicatedqueue. This is safe under Swift’s concurrency model as long as all access goes through the queue.
1.3 Metal GPU Acceleration
Section titled “1.3 Metal GPU Acceleration”whisper.cpp uses Metal shaders to accelerate matrix multiplication and other compute-heavy operations on Apple GPUs. VaulType bundles the Metal shader source and compiles it at runtime.
extension WhisperBridge {
/// Verifies that Metal GPU acceleration is available and functional. /// - Returns: `true` if Metal is available and the GPU device was found. func verifyMetalAvailability() -> Bool { guard let device = MTLCreateSystemDefaultDevice() else { Logger.speech.warning("No Metal-capable GPU device found") return false }
Logger.speech.info( "Metal GPU available: \(device.name), " + "recommended max working set: \(device.recommendedMaxWorkingSetSize / 1024 / 1024) MB" ) return true }
/// Returns the path to the bundled Metal shader source file. /// whisper.cpp compiles this at runtime when `use_gpu = true`. static var metalShaderPath: String? { Bundle.main.path(forResource: "ggml-metal", ofType: "metal") }}🍎 macOS-specific: Metal acceleration is available on all Apple Silicon Macs and Intel Macs with discrete or integrated GPUs that support Metal. On Apple Silicon, the unified memory architecture allows the GPU to access model weights without copying, significantly reducing latency.
The Metal shader file (ggml-metal.metal) must be included in the app bundle’s Resources. Add it to the Xcode project’s “Copy Bundle Resources” build phase.
💡 Tip: To verify Metal is being used at runtime, set the environment variable
GGML_METAL_LOG_LEVEL=2during development. This prints detailed Metal kernel dispatch information to the console.
1.4 Architecture Overview Diagram
Section titled “1.4 Architecture Overview Diagram”┌──────────────────────────────────────────────────────────────────┐│ VaulType Application │├──────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────────┐ ┌──────────────────┐ ┌────────────┐ ││ │ SwiftUI Views │ │ WhisperBridge │ │ Model │ ││ │ │───▶│ (Swift Wrapper) │───▶│ Manager │ ││ │ - RecordButton │ │ │ │ │ ││ │ - Transcript │ │ - loadModel() │ │ - download│ ││ │ - Settings │ │ - transcribe() │ │ - verify │ ││ └─────────────────┘ │ - streaming() │ │ - select │ ││ │ └────────┬─────────┘ └────────────┘ ││ │ │ ││ ▼ ▼ ││ ┌─────────────────┐ ┌──────────────────┐ ││ │ AudioCapture │ │ Bridging Header │ ││ │ Pipeline │ │ (C ↔ Swift) │ ││ │ │ └────────┬─────────┘ ││ │ - AVAudioEngine │ │ ││ │ - VAD │ ▼ ││ │ - NoiseGate │ ┌──────────────────┐ ││ │ - Resampler │ │ libwhisper.a │ ││ └────────┬─────────┘ │ (Static Library) │ ││ │ │ │ ││ │ │ ┌────────────┐ │ ││ └─────────────▶│ │ C API │ │ ││ PCM Float32 │ ├────────────┤ │ ││ 16kHz mono │ │ ggml │ │ ││ │ ├────────────┤ │ ││ │ │ Metal GPU │──┼──▶ Apple GPU ││ │ └────────────┘ │ ││ └──────────────────┘ ││ │└──────────────────────────────────────────────────────────────────┘2. Model Management
Section titled “2. Model Management”VaulType supports multiple Whisper model sizes, allowing users to choose the optimal balance between speed, accuracy, and resource consumption.
2.1 Supported Models
Section titled “2.1 Supported Models”| Model | Params | Disk Size | VRAM (approx.) | Speed (Apple Silicon) | Speed (Intel) | Relative Accuracy | Best For |
|---|---|---|---|---|---|---|---|
tiny | 39M | 75 MB | ~200 MB | ~10x realtime | ~4x realtime | Baseline | Quick drafts, low-resource machines |
base | 74M | 148 MB | ~350 MB | ~7x realtime | ~3x realtime | +10% vs tiny | Daily use on constrained hardware |
small | 244M | 488 MB | ~850 MB | ~4x realtime | ~1.5x realtime | +25% vs tiny | Recommended default |
medium | 769M | 1.5 GB | ~2.5 GB | ~2x realtime | ~0.5x realtime | +35% vs tiny | High-accuracy needs |
large-v3 | 1550M | 3.1 GB | ~4.8 GB | ~1x realtime | ~0.2x realtime | +40% vs tiny | Maximum accuracy, Apple Silicon only |
⚠️ Warning: The
large-v3model requires substantial memory and is impractical on Intel Macs with less than 16 GB RAM. On Apple Silicon, the unified memory architecture makes it feasible on machines with 16 GB or more.
💡 Tip: For most users, the
smallmodel offers the best balance of speed and accuracy. Start there and adjust based on your hardware and accuracy needs.
2.2 Download Flow
Section titled “2.2 Download Flow”Models are downloaded from Hugging Face’s CDN on first use. The download is performed in the background with progress reporting, integrity verification via SHA-256 checksums, and automatic retry on failure.
import Foundationimport CryptoKit
/// Manages Whisper model downloads, verification, and storage.actor ModelManager {
// MARK: - Types
struct ModelInfo: Sendable, Codable, Identifiable { let id: String // e.g., "ggml-small" let displayName: String // e.g., "Small (488 MB)" let fileName: String // e.g., "ggml-small.bin" let sizeBytes: Int64 let sha256: String let downloadURL: URL }
enum DownloadState: Sendable { case idle case downloading(progress: Double) case verifying case ready case failed(Error) }
enum ModelError: Error, LocalizedError { case checksumMismatch(expected: String, actual: String) case downloadFailed(statusCode: Int) case insufficientDiskSpace(required: Int64, available: Int64) case modelNotFound(id: String)
var errorDescription: String? { switch self { case .checksumMismatch(let expected, let actual): return "Checksum mismatch. Expected: \(expected.prefix(12))..., got: \(actual.prefix(12))..." case .downloadFailed(let code): return "Download failed with HTTP status \(code)." case .insufficientDiskSpace(let required, let available): let req = ByteCountFormatter.string(fromByteCount: required, countStyle: .file) let avail = ByteCountFormatter.string(fromByteCount: available, countStyle: .file) return "Insufficient disk space. Required: \(req), available: \(avail)." case .modelNotFound(let id): return "Model '\(id)' not found in the registry." } } }
// MARK: - Properties
static let modelsDirectory: URL = { let appSupport = FileManager.default.urls(for: .applicationSupportDirectory, in: .userDomainMask).first! return appSupport .appendingPathComponent("VaulType", isDirectory: true) .appendingPathComponent("Models", isDirectory: true) .appendingPathComponent("whisper", isDirectory: true) }()
private var downloadTasks: [String: URLSessionDownloadTask] = [:] private var stateCallbacks: [String: (DownloadState) -> Void] = [:]
// MARK: - Registry
static let availableModels: [ModelInfo] = [ ModelInfo( id: "ggml-tiny", displayName: "Tiny (75 MB)", fileName: "ggml-tiny.bin", sizeBytes: 75_000_000, sha256: "bd577a113a864445d4c7f519f9b0822db", // abbreviated downloadURL: URL(string: "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.bin")! ), ModelInfo( id: "ggml-base", displayName: "Base (148 MB)", fileName: "ggml-base.bin", sizeBytes: 148_000_000, sha256: "465707469ff3a37a2b9b8d8f89f97f2c3", downloadURL: URL(string: "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin")! ), ModelInfo( id: "ggml-small", displayName: "Small (488 MB)", fileName: "ggml-small.bin", sizeBytes: 488_000_000, sha256: "55356645c2b361a969dfd0ef2c5a50d72", downloadURL: URL(string: "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.bin")! ), ModelInfo( id: "ggml-medium", displayName: "Medium (1.5 GB)", fileName: "ggml-medium.bin", sizeBytes: 1_500_000_000, sha256: "fd9727b63525adb262b8ec317dd2ad8b5", downloadURL: URL(string: "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-medium.bin")! ), ModelInfo( id: "ggml-large-v3", displayName: "Large v3 (3.1 GB)", fileName: "ggml-large-v3.bin", sizeBytes: 3_100_000_000, sha256: "ad82bf6a9043cba5e2577e0c9c1c8a9b2", downloadURL: URL(string: "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin")! ), ]
// MARK: - Download
/// Download a model by ID with progress reporting. func downloadModel( id: String, onStateChange: @escaping @Sendable (DownloadState) -> Void ) async throws { guard let model = Self.availableModels.first(where: { $0.id == id }) else { throw ModelError.modelNotFound(id: id) }
// Check disk space let availableSpace = try availableDiskSpace() let requiredSpace = model.sizeBytes + (model.sizeBytes / 10) // 10% buffer guard availableSpace > requiredSpace else { throw ModelError.insufficientDiskSpace(required: requiredSpace, available: availableSpace) }
// Ensure directory exists try FileManager.default.createDirectory( at: Self.modelsDirectory, withIntermediateDirectories: true )
onStateChange(.downloading(progress: 0))
let destinationURL = Self.modelsDirectory.appendingPathComponent(model.fileName) let tempURL = try await downloadFile(from: model.downloadURL) { progress in onStateChange(.downloading(progress: progress)) }
// Verify checksum onStateChange(.verifying) let actualHash = try sha256Hash(of: tempURL) guard actualHash == model.sha256 else { try? FileManager.default.removeItem(at: tempURL) throw ModelError.checksumMismatch(expected: model.sha256, actual: actualHash) }
// Move to final location (atomic replace) if FileManager.default.fileExists(atPath: destinationURL.path) { try FileManager.default.removeItem(at: destinationURL) } try FileManager.default.moveItem(at: tempURL, to: destinationURL)
onStateChange(.ready) }
/// Check if a model is already downloaded and ready. func isModelAvailable(id: String) -> Bool { guard let model = Self.availableModels.first(where: { $0.id == id }) else { return false } let path = Self.modelsDirectory.appendingPathComponent(model.fileName) return FileManager.default.fileExists(atPath: path.path) }
/// Get the file path for a downloaded model. func modelPath(for id: String) -> URL? { guard let model = Self.availableModels.first(where: { $0.id == id }) else { return nil } let path = Self.modelsDirectory.appendingPathComponent(model.fileName) return FileManager.default.fileExists(atPath: path.path) ? path : nil }
// MARK: - Private Helpers
private func downloadFile( from url: URL, onProgress: @escaping @Sendable (Double) -> Void ) async throws -> URL { let (tempURL, response) = try await URLSession.shared.download(from: url, delegate: nil) guard let httpResponse = response as? HTTPURLResponse, (200...299).contains(httpResponse.statusCode) else { let code = (response as? HTTPURLResponse)?.statusCode ?? -1 throw ModelError.downloadFailed(statusCode: code) } return tempURL }
private func sha256Hash(of fileURL: URL) throws -> String { let data = try Data(contentsOf: fileURL, options: .mappedIfSafe) let digest = SHA256.hash(data: data) return digest.map { String(format: "%02x", $0) }.joined() }
private func availableDiskSpace() throws -> Int64 { let values = try URL(fileURLWithPath: NSHomeDirectory()) .resourceValues(forKeys: [.volumeAvailableCapacityForImportantUsageKey]) return values.volumeAvailableCapacityForImportantUsage ?? 0 }}🔒 Security: Model files are verified with SHA-256 checksums before being moved to the final storage location. If the hash does not match, the partially downloaded file is deleted. This protects against corrupted or tampered downloads.
2.3 Storage Layout
Section titled “2.3 Storage Layout”~/Library/Application Support/VaulType/└── Models/ └── whisper/ ├── ggml-tiny.bin # 75 MB ├── ggml-base.bin # 148 MB ├── ggml-small.bin # 488 MB (default) ├── ggml-medium.bin # 1.5 GB └── ggml-large-v3.bin # 3.1 GB🍎 macOS-specific: The
~/Library/Application Support/directory is the standard location for application data on macOS. It is excluded from iCloud backup by default unless explicitly configured otherwise. VaulType does not sync models to iCloud — they must be downloaded on each machine.
2.4 Model Selection UI
Section titled “2.4 Model Selection UI”The model selection interface displays each model’s status (downloaded, downloading, not available), size, and a brief description of its accuracy and speed characteristics. Users can download, delete, or select a model as the active model.
import SwiftUI
struct ModelSelectionView: View { @State private var downloadStates: [String: ModelManager.DownloadState] = [:] @AppStorage("selectedModelId") private var selectedModelId: String = "ggml-small"
private let modelManager = ModelManager()
var body: some View { Form { Section("Whisper Models") { ForEach(ModelManager.availableModels) { model in ModelRowView( model: model, isSelected: selectedModelId == model.id, downloadState: downloadStates[model.id] ?? .idle, onSelect: { selectedModelId = model.id }, onDownload: { downloadModel(model) }, onDelete: { deleteModel(model) } ) } }
Section { Text("Selected model: **\(selectedModelId)**") .font(.caption) .foregroundStyle(.secondary) } footer: { Text("Larger models provide better accuracy but require more memory and are slower. The Small model is recommended for most users.") } } .formStyle(.grouped) }
private func downloadModel(_ model: ModelManager.ModelInfo) { Task { try await modelManager.downloadModel(id: model.id) { state in Task { @MainActor in downloadStates[model.id] = state } } } }
private func deleteModel(_ model: ModelManager.ModelInfo) { let path = ModelManager.modelsDirectory.appendingPathComponent(model.fileName) try? FileManager.default.removeItem(at: path) downloadStates[model.id] = .idle if selectedModelId == model.id { selectedModelId = "ggml-small" } }}2.5 Model Loading and Switching
Section titled “2.5 Model Loading and Switching”Switching between models at runtime involves unloading the current context and loading a new one. This operation is asynchronous and blocks transcription until complete.
/// Coordinates model loading and switching for the speech recognition engine.@MainActorfinal class SpeechRecognitionEngine: ObservableObject {
@Published var currentModelId: String? @Published var isLoading: Bool = false @Published var loadError: String?
private let whisperBridge = WhisperBridge() private let modelManager = ModelManager()
/// Switch to a different Whisper model. /// Unloads the current model, loads the new one, and updates published state. func switchModel(to modelId: String) async { isLoading = true loadError = nil
// Unload current model await whisperBridge.unloadModel() currentModelId = nil
// Resolve path guard let modelPath = await modelManager.modelPath(for: modelId) else { loadError = "Model '\(modelId)' is not downloaded." isLoading = false return }
do { try await whisperBridge.loadModel(at: modelPath.path) currentModelId = modelId Logger.speech.info("Switched to model: \(modelId)") } catch { loadError = error.localizedDescription Logger.speech.error("Failed to load model \(modelId): \(error)") }
isLoading = false }}ℹ️ Info: Model switching typically takes 200ms–2s depending on model size and whether the previous model’s memory has been fully reclaimed. The
large-v3model may take longer on Intel Macs due to memory pressure.
3. Audio Preprocessing Pipeline
Section titled “3. Audio Preprocessing Pipeline”The audio preprocessing pipeline captures microphone input, converts it to the format Whisper expects (16kHz mono Float32 PCM), applies voice activity detection and noise gating, and manages audio buffers for both streaming and batch transcription.
3.1 AVAudioEngine Setup
Section titled “3.1 AVAudioEngine Setup”VaulType uses AVAudioEngine for audio capture, which provides low-latency access to the system’s audio input hardware through a tap on the input node.
import AVFoundationimport Combine
/// Captures audio from the system microphone using AVAudioEngine./// Produces 16kHz mono Float32 PCM samples for Whisper consumption.final class AudioCapturePipeline: ObservableObject {
// MARK: - Published State
@Published var isCapturing: Bool = false @Published var audioLevel: Float = 0.0 // 0.0 – 1.0, for UI meters @Published var isVoiceDetected: Bool = false
// MARK: - Properties
private let engine = AVAudioEngine() private let targetSampleRate: Double = 16_000 private let targetChannelCount: AVAudioChannelCount = 1 private var converter: AVAudioConverter?
private var audioBuffer: [Float] = [] private let bufferLock = NSLock()
private let vad = VoiceActivityDetector() private let noiseGate = NoiseGate()
// Callback for streaming: called with new PCM samples as they arrive var onAudioSamples: (([Float]) -> Void)?
// MARK: - Target Format
private var targetFormat: AVAudioFormat { AVAudioFormat( commonFormat: .pcmFormatFloat32, sampleRate: targetSampleRate, channels: targetChannelCount, interleaved: false )! }
// MARK: - Start / Stop
/// Begin capturing audio from the default input device. /// - Throws: If the audio engine cannot start or the input node is unavailable. func startCapture() throws { guard !isCapturing else { return }
let inputNode = engine.inputNode let inputFormat = inputNode.outputFormat(forBus: 0)
guard inputFormat.sampleRate > 0 else { throw AudioCaptureError.noInputDevice }
// Create the sample-rate converter guard let conv = AVAudioConverter(from: inputFormat, to: targetFormat) else { throw AudioCaptureError.converterInitFailed } self.converter = conv
// Reset buffers bufferLock.lock() audioBuffer.removeAll(keepingCapacity: true) bufferLock.unlock()
// Install tap on input node let bufferSize: AVAudioFrameCount = 4096 inputNode.installTap(onBus: 0, bufferSize: bufferSize, format: inputFormat) { [weak self] buffer, time in self?.processInputBuffer(buffer, time: time) }
try engine.start() isCapturing = true }
/// Stop capturing audio and release resources. func stopCapture() { guard isCapturing else { return }
engine.inputNode.removeTap(onBus: 0) engine.stop() isCapturing = false }
/// Returns the accumulated audio buffer and clears it. /// Used for batch transcription after recording stops. func drainBuffer() -> [Float] { bufferLock.lock() let samples = audioBuffer audioBuffer.removeAll(keepingCapacity: true) bufferLock.unlock() return samples }
// MARK: - Private Processing
private func processInputBuffer(_ buffer: AVAudioPCMBuffer, time: AVAudioTime) { guard let converter = self.converter else { return }
// Convert to 16kHz mono guard let convertedBuffer = convertBuffer(buffer, using: converter) else { return } guard let channelData = convertedBuffer.floatChannelData else { return }
let frameCount = Int(convertedBuffer.frameLength) let samples = Array(UnsafeBufferPointer(start: channelData[0], count: frameCount))
// Update audio level for UI (RMS) let rms = computeRMS(samples) DispatchQueue.main.async { [weak self] in self?.audioLevel = min(rms * 3.0, 1.0) // Scale for visual range }
// Apply noise gate let gatedSamples = noiseGate.process(samples)
// Voice activity detection let hasVoice = vad.detect(in: gatedSamples) DispatchQueue.main.async { [weak self] in self?.isVoiceDetected = hasVoice }
// Accumulate for batch mode bufferLock.lock() audioBuffer.append(contentsOf: gatedSamples) bufferLock.unlock()
// Notify streaming listeners if hasVoice { onAudioSamples?(gatedSamples) } }
private func convertBuffer( _ input: AVAudioPCMBuffer, using converter: AVAudioConverter ) -> AVAudioPCMBuffer? { let ratio = targetSampleRate / input.format.sampleRate let outputFrameCapacity = AVAudioFrameCount(Double(input.frameLength) * ratio) + 1
guard let outputBuffer = AVAudioPCMBuffer( pcmFormat: targetFormat, frameCapacity: outputFrameCapacity ) else { return nil }
var error: NSError? var hasData = true
converter.convert(to: outputBuffer, error: &error) { _, outStatus in if hasData { outStatus.pointee = .haveData hasData = false return input } outStatus.pointee = .noDataNow return nil }
if let error { Logger.audio.error("Audio conversion error: \(error)") return nil }
return outputBuffer }
private func computeRMS(_ samples: [Float]) -> Float { guard !samples.isEmpty else { return 0 } let sumOfSquares = samples.reduce(Float(0)) { $0 + $1 * $1 } return sqrt(sumOfSquares / Float(samples.count)) }}
enum AudioCaptureError: Error, LocalizedError { case noInputDevice case converterInitFailed case permissionDenied
var errorDescription: String? { switch self { case .noInputDevice: return "No audio input device found." case .converterInitFailed: return "Failed to create audio format converter." case .permissionDenied: return "Microphone access was denied." } }}🍎 macOS-specific: On macOS 14+, microphone access requires explicit user permission. VaulType requests this via
AVCaptureDevice.requestAccess(for: .audio)at first launch. TheNSMicrophoneUsageDescriptionkey must be present inInfo.plist.
3.2 Sample Rate Conversion
Section titled “3.2 Sample Rate Conversion”Whisper models expect audio at 16kHz mono. Most macOS input devices operate at 44.1kHz or 48kHz stereo. The AVAudioConverter handles the downsampling and channel mixing automatically.
Input Device AVAudioConverter Whisper┌──────────────┐ ┌──────────────────┐ ┌──────────────┐│ 48kHz Stereo │────▶│ Resample + Mix │────▶│ 16kHz Mono ││ Float32 │ │ to 16kHz Mono │ │ Float32 PCM │└──────────────┘ └──────────────────┘ └──────────────┘ℹ️ Info: The
AVAudioConverteruses Apple’s high-quality sample rate conversion algorithm internally. No additional anti-aliasing filters are needed.
3.3 Voice Activity Detection (VAD)
Section titled “3.3 Voice Activity Detection (VAD)”VAD determines whether a given audio segment contains speech. VaulType uses an energy-based VAD with zero-crossing rate analysis to distinguish speech from silence and background noise.
import Accelerate
/// Energy-based voice activity detector with zero-crossing rate analysis./// Distinguishes speech from silence/noise using adaptive thresholds.final class VoiceActivityDetector {
// MARK: - Configuration
struct Configuration { var energyThreshold: Float = 0.005 // Minimum RMS energy for speech var zeroCrossingThreshold: Float = 0.15 // Max zero-crossing rate for speech var hangoverFrames: Int = 8 // Frames to hold "voice" after drop var adaptiveAlpha: Float = 0.02 // Noise floor adaptation rate }
// MARK: - State
private var config: Configuration private var noiseFloor: Float = 0.001 private var hangoverCounter: Int = 0 private var isCurrentlyActive: Bool = false
init(config: Configuration = Configuration()) { self.config = config }
// MARK: - Detection
/// Analyze a chunk of audio samples and return whether voice activity is detected. /// - Parameter samples: Audio samples (16kHz mono Float32 PCM). /// - Returns: `true` if voice activity is detected in this chunk. func detect(in samples: [Float]) -> Bool { guard !samples.isEmpty else { return false }
// Compute RMS energy let energy = computeRMS(samples)
// Compute zero-crossing rate let zcr = computeZeroCrossingRate(samples)
// Adapt noise floor (slowly track ambient noise level) if !isCurrentlyActive { noiseFloor = noiseFloor * (1 - config.adaptiveAlpha) + energy * config.adaptiveAlpha }
// Dynamic threshold: noise floor + configured minimum let dynamicThreshold = max(config.energyThreshold, noiseFloor * 3.0)
// Speech detection: energy above threshold AND zero-crossing below speech range let speechDetected = energy > dynamicThreshold && zcr < config.zeroCrossingThreshold
// Hangover logic: keep detecting voice for a few frames after energy drops if speechDetected { hangoverCounter = config.hangoverFrames isCurrentlyActive = true } else if hangoverCounter > 0 { hangoverCounter -= 1 isCurrentlyActive = true } else { isCurrentlyActive = false }
return isCurrentlyActive }
/// Reset the detector state (call when starting a new recording session). func reset() { noiseFloor = 0.001 hangoverCounter = 0 isCurrentlyActive = false }
// MARK: - Private
private func computeRMS(_ samples: [Float]) -> Float { var rms: Float = 0 vDSP_rmsqv(samples, 1, &rms, vDSP_Length(samples.count)) return rms }
private func computeZeroCrossingRate(_ samples: [Float]) -> Float { guard samples.count > 1 else { return 0 } var crossings = 0 for i in 1..<samples.count { if (samples[i] >= 0) != (samples[i - 1] >= 0) { crossings += 1 } } return Float(crossings) / Float(samples.count - 1) }}💡 Tip: The
hangoverFramesparameter prevents the VAD from cutting off the end of words during brief pauses in speech. Increase it if you notice clipped word endings; decrease it for tighter silence trimming.
3.4 Noise Gate
Section titled “3.4 Noise Gate”The noise gate suppresses audio below a configurable threshold, reducing low-level background noise that can confuse the recognition model.
/// Simple noise gate that suppresses audio below a threshold.final class NoiseGate {
var threshold: Float = 0.003 // Gate opens above this RMS level var attackTime: Float = 0.005 // Seconds to fully open var releaseTime: Float = 0.05 // Seconds to fully close var sampleRate: Float = 16_000
private var gateLevel: Float = 0.0
/// Process a chunk of samples, applying the noise gate. /// - Parameter samples: Raw audio samples. /// - Returns: Gated audio samples. func process(_ samples: [Float]) -> [Float] { var output = [Float](repeating: 0, count: samples.count)
let attackCoeff = 1.0 - exp(-1.0 / (attackTime * sampleRate)) let releaseCoeff = 1.0 - exp(-1.0 / (releaseTime * sampleRate))
for i in 0..<samples.count { let absSample = abs(samples[i])
if absSample > threshold { gateLevel += attackCoeff * (1.0 - gateLevel) } else { gateLevel += releaseCoeff * (0.0 - gateLevel) }
output[i] = samples[i] * gateLevel }
return output }
func reset() { gateLevel = 0.0 }}3.5 Audio Level Monitoring
Section titled “3.5 Audio Level Monitoring”Audio levels are published to the UI layer for real-time feedback (microphone level meter). The RMS value is computed per buffer and smoothed for display.
/// Smoothed audio level suitable for driving a UI meter.final class AudioLevelMonitor: ObservableObject { @Published var level: Float = 0.0 @Published var peakLevel: Float = 0.0
private var smoothingFactor: Float = 0.3 private var peakDecayRate: Float = 0.95
func update(rms: Float) { // Exponential smoothing let smoothedLevel = level * (1 - smoothingFactor) + rms * smoothingFactor
// Peak tracking with decay let newPeak = max(peakLevel * peakDecayRate, rms)
DispatchQueue.main.async { [weak self] in self?.level = smoothedLevel self?.peakLevel = newPeak } }
func reset() { level = 0.0 peakLevel = 0.0 }}3.6 Buffer Management
Section titled “3.6 Buffer Management”Audio buffers accumulate samples during recording. For batch mode, the entire buffer is sent to Whisper after recording stops. For streaming mode, overlapping windows of audio are sent at regular intervals.
| Parameter | Batch Mode | Streaming Mode |
|---|---|---|
| Buffer strategy | Accumulate all | Sliding window |
| Window size | Full recording | 5 seconds |
| Overlap | N/A | 1 second |
| Memory growth | Linear with duration | Fixed (~320 KB) |
| Max duration | 5 minutes (configurable) | Unlimited |
⚠️ Warning: For batch mode, audio buffers grow linearly. A 5-minute recording at 16kHz mono produces approximately 9.6 MB of Float32 data. VaulType enforces a configurable maximum recording duration (default: 5 minutes) to prevent excessive memory use.
3.7 Pipeline Diagram
Section titled “3.7 Pipeline Diagram”┌─────────────┐│ Microphone ││ (Hardware) │└──────┬──────┘ │ 48kHz Stereo (typical) ▼┌─────────────────┐│ AVAudioEngine ││ Input Node ││ (installTap) │└──────┬──────────┘ │ Raw PCM buffers ▼┌─────────────────┐│ AVAudioConverter││ 48kHz → 16kHz ││ Stereo → Mono │└──────┬──────────┘ │ 16kHz Mono Float32 ▼┌─────────────────┐│ Noise Gate ││ Suppress low- ││ level noise │└──────┬──────────┘ │ ▼┌─────────────────┐ ┌─────────────────┐│ VAD (Voice │──────▶│ Audio Level ││ Activity │ │ Monitor (UI) ││ Detection) │ └─────────────────┘└──────┬──────────┘ │ Voice-gated samples ▼┌─────────────────┐│ Buffer Manager ││ ││ ┌────────────┐ │ ┌──────────────────────┐│ │ Batch: │─┼───▶│ Full buffer → Whisper ││ │ Accumulate │ │ │ (after recording) ││ └────────────┘ │ └──────────────────────┘│ ││ ┌────────────┐ │ ┌──────────────────────┐│ │ Streaming: │─┼───▶│ Sliding window → ││ │ Window │ │ │ Whisper (periodic) ││ └────────────┘ │ └──────────────────────┘│ │└─────────────────┘4. Streaming vs Batch Transcription
Section titled “4. Streaming vs Batch Transcription”VaulType supports two transcription modes: batch (process after recording) and streaming (real-time partial results). Each mode has distinct performance characteristics and user experience implications.
4.1 Batch Transcription (Process After Recording)
Section titled “4.1 Batch Transcription (Process After Recording)”In batch mode, audio is recorded in its entirety, then sent to Whisper for processing. This produces the highest-quality transcription because the model has access to the full audio context.
extension WhisperBridge {
/// Transcribe a complete audio buffer in batch mode. /// - Parameters: /// - samples: 16kHz mono Float32 PCM audio samples. /// - language: ISO 639-1 language code, or `nil` for auto-detection. /// - options: Transcription configuration options. /// - Returns: The transcription result with segments and timing info. func transcribeBatch( samples: [Float], language: String? = nil, options: TranscriptionOptions = .default ) async throws -> TranscriptionResult { guard isModelLoaded, context != nil else { throw WhisperError.modelNotLoaded }
guard !samples.isEmpty else { return TranscriptionResult( text: "", segments: [], language: language ?? "en", languageProbability: 0, processingTimeMs: 0 ) }
return try await withCheckedThrowingContinuation { continuation in queue.async { [weak self] in guard let self, let ctx = self.context else { continuation.resume(throwing: WhisperError.modelNotLoaded) return }
let startTime = DispatchTime.now()
// Configure parameters var params = whisper_full_default_params(WHISPER_SAMPLING_GREEDY)
params.n_threads = Int32(options.threadCount) params.translate = false params.no_timestamps = false params.single_segment = false params.print_special = false params.print_progress = false params.print_realtime = false params.print_timestamps = true
// Beam search configuration if options.beamSize > 1 { params.strategy = WHISPER_SAMPLING_BEAM_SEARCH params.beam_search.beam_size = Int32(options.beamSize) }
params.temperature = options.temperature params.temperature_inc = options.temperatureIncrement
// Language setting if let lang = language { lang.withCString { cStr in params.language = cStr } } else { params.detect_language = true }
// Prompt conditioning if let prompt = options.initialPrompt { prompt.withCString { cStr in params.initial_prompt = cStr } }
// Run inference let result = samples.withUnsafeBufferPointer { bufferPtr in whisper_full(ctx, params, bufferPtr.baseAddress, Int32(samples.count)) }
guard result == 0 else { continuation.resume(throwing: WhisperError.transcriptionFailed(code: result)) return }
// Extract results let segmentCount = whisper_full_n_segments(ctx) var segments: [Segment] = [] var fullText = ""
for i in 0..<segmentCount { let text = String(cString: whisper_full_get_segment_text(ctx, i)) let startMs = whisper_full_get_segment_t0(ctx, i) * 10 let endMs = whisper_full_get_segment_t1(ctx, i) * 10
// Average token probability for this segment let tokenCount = whisper_full_n_tokens(ctx, i) var probSum: Float = 0 for t in 0..<tokenCount { probSum += whisper_full_get_token_p(ctx, i, t) } let avgProb = tokenCount > 0 ? probSum / Float(tokenCount) : 0
segments.append(Segment( text: text, startMs: Int64(startMs), endMs: Int64(endMs), probability: avgProb, isPartial: false )) fullText += text }
// Detect language let detectedLang: String let langProb: Float if language == nil { let langId = whisper_full_lang_id(ctx) detectedLang = String(cString: whisper_lang_str(langId)) langProb = 0.0 // Approximate; full lang probs require separate API } else { detectedLang = language! langProb = 1.0 }
let elapsed = DispatchTime.now().uptimeNanoseconds - startTime.uptimeNanoseconds let elapsedMs = Int64(elapsed / 1_000_000)
continuation.resume(returning: TranscriptionResult( text: fullText.trimmingCharacters(in: .whitespacesAndNewlines), segments: segments, language: detectedLang, languageProbability: langProb, processingTimeMs: elapsedMs )) } } }}
/// Configurable transcription options.struct TranscriptionOptions: Sendable { var threadCount: Int = ProcessInfo.processInfo.activeProcessorCount var beamSize: Int = 1 // 1 = greedy, >1 = beam search var temperature: Float = 0.0 // 0.0 = deterministic var temperatureIncrement: Float = 0.2 var initialPrompt: String? = nil
static let `default` = TranscriptionOptions()
static let highAccuracy = TranscriptionOptions( beamSize: 5, temperature: 0.0, temperatureIncrement: 0.1 )}✅ Success: Batch mode produces the most accurate transcription because Whisper processes the full audio context. It is the default mode for VaulType’s push-to-talk workflow.
4.2 Streaming Transcription (Real-Time Partial Results)
Section titled “4.2 Streaming Transcription (Real-Time Partial Results)”Streaming mode provides partial transcription results as the user speaks. This creates a more responsive user experience but with somewhat lower accuracy, since the model processes audio in overlapping windows without full future context.
/// Streaming transcription controller that feeds audio windows to Whisper periodically.actor StreamingTranscriber {
// MARK: - Types
struct StreamingOptions: Sendable { var windowDurationSeconds: Double = 5.0 var overlapDurationSeconds: Double = 1.0 var updateIntervalSeconds: Double = 0.5 var language: String? = nil var transcriptionOptions: TranscriptionOptions = .default }
enum StreamingEvent: Sendable { case partialResult(text: String, isFinal: Bool) case languageDetected(language: String, probability: Float) case error(Error) }
// MARK: - Properties
private let whisperBridge: WhisperBridge private var audioWindow: [Float] = [] private var confirmedText: String = "" private var isRunning: Bool = false private var options: StreamingOptions private var eventHandler: ((StreamingEvent) -> Void)?
private let windowSizeInSamples: Int private let overlapSizeInSamples: Int
// MARK: - Init
init(whisperBridge: WhisperBridge, options: StreamingOptions = StreamingOptions()) { self.whisperBridge = whisperBridge self.options = options self.windowSizeInSamples = Int(options.windowDurationSeconds * 16_000) self.overlapSizeInSamples = Int(options.overlapDurationSeconds * 16_000) }
// MARK: - Control
/// Start streaming transcription. /// - Parameter handler: Called on each transcription update. func start(handler: @escaping @Sendable (StreamingEvent) -> Void) { self.eventHandler = handler self.isRunning = true self.audioWindow = [] self.confirmedText = "" }
/// Feed new audio samples into the streaming window. /// Triggers transcription when enough audio has accumulated. func feedAudio(_ samples: [Float]) async { guard isRunning else { return }
audioWindow.append(contentsOf: samples)
// Trim window to max size (keep most recent audio + overlap) if audioWindow.count > windowSizeInSamples { let excess = audioWindow.count - windowSizeInSamples audioWindow.removeFirst(excess) }
// Only transcribe when we have enough audio let minSamplesForTranscription = Int(0.5 * 16_000) // 500ms minimum guard audioWindow.count >= minSamplesForTranscription else { return }
await transcribeCurrentWindow() }
/// Stop streaming and produce the final transcription. func stop() async -> String { isRunning = false
// Final transcription of remaining audio if !audioWindow.isEmpty { await transcribeCurrentWindow(isFinal: true) }
let finalText = confirmedText audioWindow = [] confirmedText = "" return finalText }
// MARK: - Private
private func transcribeCurrentWindow(isFinal: Bool = false) async { do { let result = try await whisperBridge.transcribeBatch( samples: audioWindow, language: options.language, options: options.transcriptionOptions )
if isFinal { confirmedText += result.text eventHandler?(.partialResult(text: confirmedText, isFinal: true)) } else { // Partial: show confirmed + current window result let displayText = confirmedText + result.text eventHandler?(.partialResult(text: displayText, isFinal: false)) }
if options.language == nil && !result.language.isEmpty { eventHandler?(.languageDetected( language: result.language, probability: result.languageProbability )) } } catch { eventHandler?(.error(error)) } }}4.3 Tradeoffs Comparison
Section titled “4.3 Tradeoffs Comparison”| Aspect | Batch Mode | Streaming Mode |
|---|---|---|
| Latency | Full recording duration + processing | Near real-time (~500ms delay) |
| Accuracy | Highest (full context) | Slightly lower (partial context) |
| Memory | Linear with recording length | Fixed window size |
| CPU/GPU usage | Spike after recording | Continuous moderate load |
| User experience | Wait for result | Live preview as you speak |
| Use case | Push-to-talk, short utterances | Long dictation, live captioning |
| Implementation | Simpler | More complex (overlap, merging) |
💡 Tip: VaulType defaults to batch mode for its push-to-talk workflow (record → release → transcribe). Streaming mode can be enabled in Settings for users who prefer real-time feedback during longer dictations.
4.4 Implementation Details
Section titled “4.4 Implementation Details”Sliding Window Strategy for Streaming:
Time ─────────────────────────────────────────────────▶
Audio: [==========|==========|==========|==========]
Window 1: [==========XXXXX]Window 2: [XXXXX==========XXXXX]Window 3: [XXXXX==========XXXXX]Window 4: [XXXXX==========]
─────── overlap ───────Each window contains the most recent N seconds of audio. The overlap region ensures no speech is lost between windows. Whisper processes each window independently, and the streaming controller merges results by aligning overlapping text segments.
Result Merging: When two consecutive windows produce overlapping text, the streaming controller performs string suffix matching to avoid duplicating words:
extension StreamingTranscriber {
/// Merge overlapping text from consecutive transcription windows. /// Finds the longest common suffix of `existing` that matches a prefix of `incoming`. static func mergeOverlappingText(existing: String, incoming: String) -> String { let existingWords = existing.split(separator: " ").map(String.init) let incomingWords = incoming.split(separator: " ").map(String.init)
guard !existingWords.isEmpty, !incomingWords.isEmpty else { return existing + incoming }
// Find longest overlap (up to min of both arrays) let maxOverlap = min(existingWords.count, incomingWords.count) var bestOverlap = 0
for overlapLength in stride(from: maxOverlap, through: 1, by: -1) { let existingSuffix = existingWords.suffix(overlapLength) let incomingPrefix = incomingWords.prefix(overlapLength)
if Array(existingSuffix) == Array(incomingPrefix) { bestOverlap = overlapLength break } }
if bestOverlap > 0 { let newWords = incomingWords.dropFirst(bestOverlap) if newWords.isEmpty { return existing } return existing + " " + newWords.joined(separator: " ") }
return existing + " " + incoming }}5. Language Detection and Selection
Section titled “5. Language Detection and Selection”Whisper’s multilingual models natively support 90+ languages with automatic language identification. English-only models are also available for faster performance when multilingual support is not needed. VaulType exposes this capability through both automatic detection and manual language selection.
5.1 Automatic Language Detection
Section titled “5.1 Automatic Language Detection”When no language is specified, Whisper analyzes the first 30 seconds of audio to identify the language. This works well for monolingual speech but may struggle with very short utterances or heavily accented speech.
extension WhisperBridge {
/// Detect the language of an audio sample without performing full transcription. /// - Parameter samples: At least 1 second of 16kHz mono Float32 PCM audio. /// - Returns: A ranked list of detected languages with probabilities. func detectLanguage(samples: [Float]) async throws -> [(language: String, probability: Float)] { guard isModelLoaded, let ctx = context else { throw WhisperError.modelNotLoaded }
return try await withCheckedThrowingContinuation { continuation in queue.async { var params = whisper_full_default_params(WHISPER_SAMPLING_GREEDY) params.detect_language = true params.n_threads = 4
let result = samples.withUnsafeBufferPointer { bufferPtr in whisper_full(ctx, params, bufferPtr.baseAddress, Int32(min(samples.count, 16_000 * 30))) }
guard result == 0 else { continuation.resume(throwing: WhisperError.transcriptionFailed(code: result)) return }
// Collect language probabilities let langCount = whisper_lang_max_id() + 1 var languages: [(language: String, probability: Float)] = []
for i in 0..<langCount { let langStr = String(cString: whisper_lang_str(i)) // Note: Full language probability extraction requires // whisper_full_lang_id for the top detected language. // For a ranked list, use the state's lang_probs if available. languages.append((language: langStr, probability: 0.0)) }
let detectedId = whisper_full_lang_id(ctx) let detectedLang = String(cString: whisper_lang_str(detectedId))
// Return top detected language first let topResult = [(language: detectedLang, probability: 1.0 as Float)] continuation.resume(returning: topResult) } } }}5.2 Manual Language Selection
Section titled “5.2 Manual Language Selection”Users can manually select a language to bypass auto-detection. This improves accuracy for known languages and eliminates the detection overhead.
/// Language selection model for the settings UI.struct LanguageOption: Identifiable, Hashable { let id: String // ISO 639-1 code (e.g., "en", "tr", "de") let name: String // English name (e.g., "English") let nativeName: String // Native name (e.g., "Turkce")
static let autoDetect = LanguageOption( id: "auto", name: "Auto-Detect", nativeName: "Auto-Detect" )}
/// Commonly used language options for the UI.extension LanguageOption { static let commonLanguages: [LanguageOption] = [ .autoDetect, LanguageOption(id: "en", name: "English", nativeName: "English"), LanguageOption(id: "tr", name: "Turkish", nativeName: "Turkce"), LanguageOption(id: "de", name: "German", nativeName: "Deutsch"), LanguageOption(id: "fr", name: "French", nativeName: "Francais"), LanguageOption(id: "es", name: "Spanish", nativeName: "Espanol"), LanguageOption(id: "it", name: "Italian", nativeName: "Italiano"), LanguageOption(id: "pt", name: "Portuguese", nativeName: "Portugues"), LanguageOption(id: "nl", name: "Dutch", nativeName: "Nederlands"), LanguageOption(id: "pl", name: "Polish", nativeName: "Polski"), LanguageOption(id: "ru", name: "Russian", nativeName: "Russkiy"), LanguageOption(id: "zh", name: "Chinese", nativeName: "Zhongwen"), LanguageOption(id: "ja", name: "Japanese", nativeName: "Nihongo"), LanguageOption(id: "ko", name: "Korean", nativeName: "Hangugeo"), LanguageOption(id: "ar", name: "Arabic", nativeName: "Al-Arabiyyah"), LanguageOption(id: "hi", name: "Hindi", nativeName: "Hindi"), ]}5.3 Supported Languages
Section titled “5.3 Supported Languages”Whisper supports the following 99 languages. Performance and accuracy vary by language, with English having the highest accuracy and less-resourced languages showing lower performance.
| Tier | Languages | Expected WER |
|---|---|---|
| Tier 1 (Excellent) | English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Chinese, Japanese | < 5% |
| Tier 2 (Good) | Turkish, Korean, Polish, Czech, Swedish, Danish, Norwegian, Finnish, Greek, Romanian, Hungarian, Bulgarian, Croatian, Slovak, Slovenian, Lithuanian, Latvian, Estonian | 5–10% |
| Tier 3 (Fair) | Arabic, Hindi, Thai, Vietnamese, Indonesian, Malay, Filipino, Ukrainian, Serbian, Catalan, Galician, Basque, Welsh, Irish, Icelandic | 10–20% |
| Tier 4 (Experimental) | Remaining languages (Swahili, Yoruba, Hausa, Amharic, etc.) | > 20% |
ℹ️ Info: Word Error Rate (WER) estimates are approximate and based on common benchmarks. Actual performance depends heavily on audio quality, accent, speaking pace, and domain-specific vocabulary.
5.4 Language-Specific Optimizations
Section titled “5.4 Language-Specific Optimizations”For certain languages, VaulType applies specific post-processing rules to improve output quality:
/// Language-specific post-processing rules applied after Whisper transcription.struct LanguagePostProcessor {
/// Apply language-specific corrections to transcribed text. static func process(_ text: String, language: String) -> String { var result = text
switch language { case "tr": // Turkish: Fix common i/I dotted vs dotless confusion result = fixTurkishDotting(result) case "de": // German: Capitalize nouns (Whisper sometimes lowercases them) result = fixGermanCapitalization(result) case "zh": // Chinese: Remove extraneous spaces between characters result = result.replacingOccurrences(of: " ", with: "") case "ja": // Japanese: Normalize punctuation result = normalizeJapanesePunctuation(result) default: break }
return result }
private static func fixTurkishDotting(_ text: String) -> String { // Turkish has both dotted (i, I) and dotless (ı, I) i characters // Whisper sometimes confuses these; apply common corrections var result = text let turkishCorrections: [String: String] = [ "Istambul": "Istanbul", "Izmir": "Izmir", ] for (wrong, correct) in turkishCorrections { result = result.replacingOccurrences(of: wrong, with: correct) } return result }
private static func fixGermanCapitalization(_ text: String) -> String { // Simplified: in practice, this would use a noun dictionary return text }
private static func normalizeJapanesePunctuation(_ text: String) -> String { text.replacingOccurrences(of: ".", with: "。") .replacingOccurrences(of: ",", with: "、") }}6. Performance Tuning Parameters
Section titled “6. Performance Tuning Parameters”Whisper.cpp exposes several parameters that significantly affect transcription speed and accuracy. VaulType provides sensible defaults and allows advanced users to tune these parameters.
6.1 Core Parameters
Section titled “6.1 Core Parameters”| Parameter | Default | Range | Effect |
|---|---|---|---|
n_threads | CPU core count | 1–16 | Number of CPU threads for inference |
beam_size | 1 (greedy) | 1–10 | Beam search width; higher = more accurate but slower |
temperature | 0.0 | 0.0–1.0 | Sampling temperature; 0 = deterministic |
temperature_inc | 0.2 | 0.0–1.0 | Temperature increment on fallback attempts |
no_timestamps | false | true/false | Skip timestamp generation (slight speedup) |
single_segment | false | true/false | Force single-segment output |
max_tokens | 0 (unlimited) | 0–448 | Max tokens per segment |
use_gpu | true | true/false | Use Metal GPU acceleration |
/// Performance presets for different use cases.enum PerformancePreset: String, CaseIterable, Identifiable { case fast = "Fast" case balanced = "Balanced" case accurate = "Accurate"
var id: String { rawValue }
var options: TranscriptionOptions { switch self { case .fast: return TranscriptionOptions( threadCount: ProcessInfo.processInfo.activeProcessorCount, beamSize: 1, temperature: 0.0, temperatureIncrement: 0.4, initialPrompt: nil ) case .balanced: return TranscriptionOptions( threadCount: max(ProcessInfo.processInfo.activeProcessorCount - 2, 4), beamSize: 3, temperature: 0.0, temperatureIncrement: 0.2, initialPrompt: nil ) case .accurate: return TranscriptionOptions( threadCount: max(ProcessInfo.processInfo.activeProcessorCount - 2, 4), beamSize: 5, temperature: 0.0, temperatureIncrement: 0.1, initialPrompt: nil ) } }}6.2 Apple Silicon Optimization
Section titled “6.2 Apple Silicon Optimization”Apple Silicon Macs benefit from the unified memory architecture and the Neural Engine. VaulType automatically detects the chip family and adjusts parameters accordingly.
/// Determines optimal Whisper parameters for the current hardware.struct HardwareOptimizer {
struct HardwareProfile: Sendable { let chipFamily: ChipFamily let coreCount: Int let performanceCores: Int let efficiencyCores: Int let memoryGB: Int let gpuCores: Int let hasNeuralEngine: Bool }
enum ChipFamily: Sendable { case appleSilicon(generation: String) // "M1", "M2", "M3", "M4" case intel }
/// Detect current hardware and return optimized transcription options. static func optimizedOptions(for modelSize: String) -> TranscriptionOptions { let profile = detectHardware()
switch profile.chipFamily { case .appleSilicon: return appleSiliconOptions(profile: profile, modelSize: modelSize) case .intel: return intelOptions(profile: profile, modelSize: modelSize) } }
private static func appleSiliconOptions( profile: HardwareProfile, modelSize: String ) -> TranscriptionOptions { // On Apple Silicon, use performance cores and leave efficiency cores for UI let threads = max(profile.performanceCores, 4)
// Beam search is affordable on Apple Silicon let beamSize: Int switch modelSize { case "ggml-tiny", "ggml-base": beamSize = 5 case "ggml-small": beamSize = 3 case "ggml-medium": beamSize = 2 case "ggml-large-v3": beamSize = 1 // Greedy to keep latency manageable default: beamSize = 3 }
return TranscriptionOptions( threadCount: threads, beamSize: beamSize, temperature: 0.0, temperatureIncrement: 0.2 ) }
private static func intelOptions( profile: HardwareProfile, modelSize: String ) -> TranscriptionOptions { // On Intel, be more conservative with thread count let threads = max(profile.coreCount - 2, 2)
// Greedy decoding for speed on Intel return TranscriptionOptions( threadCount: threads, beamSize: 1, temperature: 0.0, temperatureIncrement: 0.3 ) }
private static func detectHardware() -> HardwareProfile { let processInfo = ProcessInfo.processInfo let coreCount = processInfo.activeProcessorCount let memoryGB = Int(processInfo.physicalMemory / (1024 * 1024 * 1024))
// Detect Apple Silicon vs Intel #if arch(arm64) return HardwareProfile( chipFamily: .appleSilicon(generation: "M-series"), coreCount: coreCount, performanceCores: max(coreCount / 2, 4), efficiencyCores: coreCount / 2, memoryGB: memoryGB, gpuCores: 0, // GPU core count not easily queryable hasNeuralEngine: true ) #else return HardwareProfile( chipFamily: .intel, coreCount: coreCount, performanceCores: coreCount, efficiencyCores: 0, memoryGB: memoryGB, gpuCores: 0, hasNeuralEngine: false ) #endif }}🍎 macOS-specific: On Apple Silicon, whisper.cpp offloads matrix multiplications to the Metal GPU via
ggml-metal. The unified memory means no explicit data transfer between CPU and GPU is needed, resulting in lower latency than on discrete GPU systems.
6.3 Intel Mac Configuration
Section titled “6.3 Intel Mac Configuration”| Consideration | Recommendation |
|---|---|
| Max recommended model | medium (1.5 GB) |
| Thread count | Total cores minus 2 (minimum 2) |
| Beam search | Greedy (beam_size = 1) |
| GPU acceleration | Available if Metal-capable GPU present |
| Expected speed | 0.2x–3x realtime depending on model |
| Memory warning | 16 GB RAM minimum for medium |
⚠️ Warning: The
large-v3model on Intel Macs will likely run slower than realtime and may cause memory pressure on systems with less than 32 GB RAM. VaulType shows a warning when selectinglarge-v3on Intel hardware.
6.4 Memory Management
Section titled “6.4 Memory Management”Whisper models are memory-mapped where possible, reducing the application’s resident memory footprint. However, during inference, the working memory can be substantial:
extension WhisperBridge {
/// Estimated memory requirements for a given model. static func estimatedMemory(forModel modelId: String) -> (modelMB: Int, workingMB: Int, totalMB: Int) { switch modelId { case "ggml-tiny": return (modelMB: 75, workingMB: 125, totalMB: 200) case "ggml-base": return (modelMB: 148, workingMB: 200, totalMB: 350) case "ggml-small": return (modelMB: 488, workingMB: 360, totalMB: 850) case "ggml-medium": return (modelMB: 1500, workingMB: 1000, totalMB: 2500) case "ggml-large-v3": return (modelMB: 3100, workingMB: 1700, totalMB: 4800) default: return (modelMB: 0, workingMB: 0, totalMB: 0) } }
/// Check whether the system has enough memory for a model, accounting for current usage. static func canLoadModel(_ modelId: String) -> (canLoad: Bool, availableMB: Int, requiredMB: Int) { let (_, _, totalRequired) = estimatedMemory(forModel: modelId)
let processInfo = ProcessInfo.processInfo let totalPhysical = Int(processInfo.physicalMemory / (1024 * 1024)) // Conservative estimate: allow model to use up to 60% of total RAM let availableForModel = totalPhysical * 60 / 100
return ( canLoad: availableForModel >= totalRequired, availableMB: availableForModel, requiredMB: totalRequired ) }}7. Custom Vocabulary Integration
Section titled “7. Custom Vocabulary Integration”VaulType allows users to define custom vocabulary entries — domain-specific terms, proper nouns, abbreviations, and technical jargon — that improve recognition accuracy. Custom vocabulary works through two mechanisms: prompt conditioning and post-processing corrections.
7.1 Vocabulary Entry Format
Section titled “7.1 Vocabulary Entry Format”import SwiftData
/// A user-defined custom vocabulary entry stored in SwiftData.@Modelfinal class VocabularyEntry { /// The term as it should appear in the final text. var term: String
/// Alternative spoken forms that should map to this term. /// Example: For term "PostgreSQL", spoken forms might be ["postgres", "post gres Q L"]. var spokenForms: [String]
/// Whether this entry should be included in Whisper prompt conditioning. var useForPromptConditioning: Bool
/// Whether this entry should be applied as a post-processing replacement. var useForPostProcessing: Bool
/// Optional category for organization (e.g., "Technical", "Names", "Medical"). var category: String?
/// Language code this entry applies to, or nil for all languages. var language: String?
var createdAt: Date var updatedAt: Date
init( term: String, spokenForms: [String] = [], useForPromptConditioning: Bool = true, useForPostProcessing: Bool = true, category: String? = nil, language: String? = nil ) { self.term = term self.spokenForms = spokenForms self.useForPromptConditioning = useForPromptConditioning self.useForPostProcessing = useForPostProcessing self.category = category self.language = language self.createdAt = Date() self.updatedAt = Date() }}7.2 Prompt Conditioning with Custom Vocabulary
Section titled “7.2 Prompt Conditioning with Custom Vocabulary”Whisper supports an “initial prompt” parameter that biases the model toward specific vocabulary. VaulType builds this prompt from the user’s custom vocabulary entries.
/// Builds a Whisper prompt conditioning string from custom vocabulary.struct VocabularyPromptBuilder {
/// Build a prompt string from vocabulary entries. /// The prompt contains terms separated by commas, encouraging Whisper /// to recognize these terms during transcription. /// /// - Parameters: /// - entries: Custom vocabulary entries. /// - language: Current transcription language (for filtering). /// - maxLength: Maximum prompt length in characters (Whisper limit ~224 tokens). /// - Returns: A prompt string suitable for `whisper_full_params.initial_prompt`. static func buildPrompt( from entries: [VocabularyEntry], language: String?, maxLength: Int = 500 ) -> String? { let relevantEntries = entries.filter { entry in guard entry.useForPromptConditioning else { return false } if let entryLang = entry.language, let targetLang = language { return entryLang == targetLang } return true }
guard !relevantEntries.isEmpty else { return nil }
// Build prompt: "The following terms may appear: VaulType, PostgreSQL, Kubernetes, ..." var prompt = "The following terms may appear: " var currentLength = prompt.count
for (index, entry) in relevantEntries.enumerated() { let separator = index == 0 ? "" : ", " let addition = separator + entry.term
if currentLength + addition.count > maxLength { break }
prompt += addition currentLength += addition.count }
return prompt + "." }}ℹ️ Info: Prompt conditioning is a soft bias — it increases the probability of the listed terms appearing in the transcription but does not guarantee them. It works best when the terms are actually spoken in the audio. Overly long prompts may degrade overall accuracy.
7.3 Post-Processing Corrections
Section titled “7.3 Post-Processing Corrections”After Whisper produces a transcription, VaulType applies post-processing rules based on custom vocabulary to correct common misrecognitions.
/// Applies post-processing corrections to transcribed text based on custom vocabulary.struct VocabularyPostProcessor {
/// Apply vocabulary-based corrections to transcribed text. /// - Parameters: /// - text: Raw transcription from Whisper. /// - entries: Custom vocabulary entries. /// - language: Transcription language. /// - Returns: Corrected text. static func apply( to text: String, entries: [VocabularyEntry], language: String? ) -> String { var result = text
let relevantEntries = entries.filter { entry in guard entry.useForPostProcessing else { return false } guard !entry.spokenForms.isEmpty else { return false } if let entryLang = entry.language, let targetLang = language { return entryLang == targetLang } return true }
for entry in relevantEntries { for spokenForm in entry.spokenForms { // Case-insensitive whole-word replacement let pattern = "\\b\(NSRegularExpression.escapedPattern(for: spokenForm))\\b" if let regex = try? NSRegularExpression(pattern: pattern, options: .caseInsensitive) { let range = NSRange(result.startIndex..., in: result) result = regex.stringByReplacingMatches( in: result, options: [], range: range, withTemplate: entry.term ) } } }
return result }}Example vocabulary entries and their effect:
| Term | Spoken Forms | Before Correction | After Correction |
|---|---|---|---|
VaulType | hush type, vaultype | ”Open hush type settings" | "Open VaulType settings” |
PostgreSQL | postgres, post gres | ”Connect to the postgres database" | "Connect to the PostgreSQL database” |
Kubernetes | kubernetes, k8s | ”Deploy to kubernetes" | "Deploy to Kubernetes” |
async/await | async await, a sync a wait | ”Use a sync a wait pattern" | "Use async/await pattern” |
8. Accuracy Optimization Techniques
Section titled “8. Accuracy Optimization Techniques”8.1 Model Selection Strategy
Section titled “8.1 Model Selection Strategy”Choosing the right model is the single most impactful decision for transcription accuracy. Use this decision tree:
Start Here │ ▼ ┌─────────────────────┐ │ Apple Silicon Mac? │ └──────┬──────────────┘ │ ┌──────┴──────┐ ▼ ▼ Yes No (Intel) │ │ ▼ ▼ ┌────────────┐ ┌─────────────┐ │ RAM >= 16GB│ │ Use small │ └──┬─────────┘ │ (488 MB) │ │ └─────────────┘ ┌────┴────┐ ▼ ▼ Yes No │ │ ▼ ▼┌────────┐ ┌──────────┐│ Need │ │ Use small││ max │ │ (488 MB) ││accuracy?│ └──────────┘└──┬─────┘ │┌──┴──┐▼ ▼Yes No│ │▼ ▼Use Uselarge small-v3 or(3.1 medium GB)Recommended Model by Use Case:
| Use Case | Recommended Model | Reason |
|---|---|---|
| Quick notes, drafts | tiny or base | Speed over accuracy |
| Daily use, emails | small | Best balance |
| Professional transcription | medium or large-v3 | Accuracy critical |
| Technical dictation (code) | medium + custom vocab | Handles jargon better |
| Multi-language | large-v3 | Best cross-language performance |
8.2 Prompt Conditioning
Section titled “8.2 Prompt Conditioning”Beyond custom vocabulary, Whisper’s initial prompt can be used to set context and improve recognition of domain-specific content:
/// Context-aware prompt conditioning based on the active processing mode.struct ContextPromptBuilder {
/// Build a context prompt based on the processing mode and user context. static func buildPrompt( mode: ProcessingMode, customVocab: [VocabularyEntry], language: String?, previousText: String? = nil ) -> String? { var parts: [String] = []
// Mode-specific context switch mode { case .raw: break // No prompt conditioning for raw mode case .clean: parts.append("Transcribe clearly with proper punctuation and grammar.") case .structure: parts.append("This is structured content with headings, lists, or paragraphs.") case .prompt: parts.append("This is a prompt or instruction being dictated.") case .code: parts.append( "This is programming-related content. " + "Terms like function, variable, class, struct, enum, protocol, " + "async, await, import, return may appear." ) case .custom: break // User defines their own context }
// Add custom vocabulary terms if let vocabPrompt = VocabularyPromptBuilder.buildPrompt( from: customVocab, language: language, maxLength: 300 ) { parts.append(vocabPrompt) }
// Add previous text for continuity (last ~100 chars) if let prev = previousText, !prev.isEmpty { let suffix = String(prev.suffix(100)) parts.append(suffix) }
let prompt = parts.joined(separator: " ") return prompt.isEmpty ? nil : prompt }}
/// VaulType processing modes.enum ProcessingMode: String, CaseIterable, Identifiable, Codable { case raw = "Raw" case clean = "Clean" case structure = "Structure" case prompt = "Prompt" case code = "Code" case custom = "Custom"
var id: String { rawValue }}💡 Tip: For best results with prompt conditioning, keep the prompt concise (under 224 tokens) and relevant to the expected content. A prompt that does not match the actual speech can decrease accuracy.
8.3 Audio Quality Tips
Section titled “8.3 Audio Quality Tips”Audio quality has a dramatic impact on transcription accuracy. VaulType provides the following guidance to users:
| Factor | Impact | Recommendation |
|---|---|---|
| Microphone distance | High | Keep microphone 6–12 inches from mouth |
| Background noise | High | Use in a quiet environment or use a directional mic |
| Speaking pace | Medium | Speak naturally; avoid rushing or extreme slowness |
| Microphone quality | Medium | USB condenser or headset mic > built-in laptop mic |
| Pop filter | Low | Reduces plosive sounds (p, b, t) that cause artifacts |
| Sample rate | Low | VaulType handles conversion; native 48kHz is fine |
| Echo | Medium | Avoid large, reverberant rooms |
9. Handling Edge Cases
Section titled “9. Handling Edge Cases”9.1 Background Noise
Section titled “9.1 Background Noise”Background noise is the most common source of transcription errors. VaulType mitigates noise through the audio preprocessing pipeline (noise gate + VAD) and model-level robustness.
/// Adaptive noise profile that adjusts to the ambient environment.final class AdaptiveNoiseProfile: ObservableObject { @Published var estimatedNoiseLevel: Float = 0.0 @Published var noiseDescription: String = "Quiet"
private var samples: [Float] = [] private let calibrationDuration: Int = 16_000 // 1 second at 16kHz
/// Calibrate the noise profile from a silent recording. /// Call this when the user is NOT speaking to establish a noise baseline. func calibrate(silentSamples: [Float]) { guard !silentSamples.isEmpty else { return }
var rms: Float = 0 vDSP_rmsqv(silentSamples, 1, &rms, vDSP_Length(silentSamples.count))
estimatedNoiseLevel = rms
// Classify noise level switch rms { case 0..<0.002: noiseDescription = "Quiet" case 0.002..<0.01: noiseDescription = "Low Background Noise" case 0.01..<0.05: noiseDescription = "Moderate Background Noise" default: noiseDescription = "High Background Noise" } }
/// Suggest adjustments based on the noise profile. var recommendations: [String] { var tips: [String] = []
if estimatedNoiseLevel > 0.05 { tips.append("High noise detected. Consider moving to a quieter environment.") tips.append("Use a directional or headset microphone to reduce ambient noise.") tips.append("Consider using a larger model (medium or large-v3) for better noise robustness.") } else if estimatedNoiseLevel > 0.01 { tips.append("Moderate noise detected. Results should be acceptable with the small model or larger.") }
return tips }}Noise mitigation strategies by severity:
| Noise Level | VAD Threshold | Noise Gate Threshold | Recommended Model | Expected Impact |
|---|---|---|---|---|
| Quiet (< 0.002) | Default (0.005) | Default (0.003) | Any | Minimal |
| Low (0.002–0.01) | 0.008 | 0.005 | small or larger | Low WER increase |
| Moderate (0.01–0.05) | 0.015 | 0.01 | medium or larger | Moderate WER increase |
| High (> 0.05) | 0.03 | 0.02 | large-v3 | Significant WER increase |
9.2 Accents and Dialects
Section titled “9.2 Accents and Dialects”Whisper is trained on a diverse multilingual dataset and handles most accents reasonably well. However, strong accents can still cause errors.
Best practices for accented speech:
- Use a larger model —
mediumandlarge-v3handle accents significantly better thantinyorbase. - Set the language explicitly — Auto-detection may misidentify the language for heavily accented speech.
- Use prompt conditioning — Include region-specific terms in the initial prompt.
- Build custom vocabulary — Add frequently misrecognized words to the vocabulary list.
9.3 Technical Jargon
Section titled “9.3 Technical Jargon”Programming terms, acronyms, and domain-specific vocabulary are challenging for general-purpose speech recognition. VaulType addresses this through the Code processing mode and custom vocabulary.
/// Built-in vocabulary for common programming terms./// Loaded automatically when the Code processing mode is active.struct ProgrammingVocabulary {
static let commonTerms: [(term: String, spokenForms: [String])] = [ ("async/await", ["async await", "a sync a wait"]), ("boolean", ["boolean", "bool"]), ("StringBuilder", ["string builder"]), ("HashMap", ["hash map"]), ("GitHub", ["git hub", "github"]), ("GitLab", ["git lab", "gitlab"]), ("API", ["A P I", "api"]), ("URL", ["U R L", "url"]), ("JSON", ["J son", "jason"]), ("YAML", ["yam L", "yaml"]), ("HTTP", ["H T T P"]), ("HTTPS", ["H T T P S"]), ("SQL", ["S Q L", "sequel"]), ("NoSQL", ["no S Q L", "no sequel"]), ("REST", ["rest", "R E S T"]), ("GraphQL", ["graph Q L", "graph QL"]), ("OAuth", ["O auth", "oh auth"]), ("JWT", ["J W T", "jot"]), ("CRUD", ["crud", "C R U D"]), ("IDE", ["I D E"]), ("CLI", ["C L I"]), ("npm", ["N P M"]), ("pip", ["pip", "P I P"]), ("regex", ["regex", "reg ex", "regular expression"]), ("localhost", ["local host"]), ("sudo", ["sue doo", "pseudo"]), ("kubectl", ["cube control", "cube C T L", "kube control"]), ("Docker", ["docker"]), ("Kubernetes", ["kubernetes", "K 8 S"]), ("SwiftUI", ["swift U I", "swift you eye"]), ("UIKit", ["U I kit", "you eye kit"]), ("CoreData", ["core data"]), ("SwiftData", ["swift data"]), ("Xcode", ["X code", "ex code"]), ]}💡 Tip: When dictating code, speak punctuation explicitly: “open paren”, “close bracket”, “semicolon”. VaulType’s Code processing mode with llama.cpp post-processing handles the conversion from spoken punctuation to symbols. See Processing Modes for details.
9.4 Mixed-Language Speech
Section titled “9.4 Mixed-Language Speech”Mixed-language speech (code-switching) occurs frequently in multilingual environments — for example, switching between Turkish and English mid-sentence. This is one of Whisper’s weaker areas.
Mitigation strategies:
- Use
large-v3— The largest model handles code-switching best, as it was trained on the most diverse dataset. - Set language to the dominant language — If most speech is in Turkish with occasional English terms, set language to
tr. Whisper will still attempt to recognize the English portions. - Use custom vocabulary for foreign terms — Add commonly used English terms to the vocabulary when speaking primarily in another language.
- Avoid auto-detect for mixed speech — Auto-detection locks to a single language based on the first 30 seconds, which may not represent the full recording.
/// Configuration for mixed-language scenarios.struct MixedLanguageConfig { /// The primary language of the recording. let primaryLanguage: String
/// Secondary language terms that may appear (for vocabulary conditioning). let secondaryTerms: [String]
/// Build a prompt that hints at mixed-language content. func buildPrompt() -> String { var prompt = "The speaker primarily uses \(primaryLanguage)." if !secondaryTerms.isEmpty { let terms = secondaryTerms.prefix(20).joined(separator: ", ") prompt += " English terms such as \(terms) may also appear." } return prompt }}9.5 Long Utterances
Section titled “9.5 Long Utterances”Whisper processes audio in 30-second chunks internally. For recordings longer than 30 seconds, whisper.cpp automatically segments the audio. However, segment boundaries can occasionally split words or sentences awkwardly.
VaulType’s handling of long recordings:
- Automatic segmentation — whisper.cpp handles chunking internally. VaulType passes the full audio buffer and receives segmented results.
- Segment merging — Adjacent segments are merged with attention to sentence boundaries to produce coherent text.
- Maximum duration — Batch mode enforces a configurable maximum (default: 5 minutes). For longer dictations, streaming mode is recommended.
- Progress reporting — For long recordings, VaulType reports transcription progress based on the segment being processed.
extension WhisperBridge {
/// Transcribe a long recording with progress reporting. /// - Parameters: /// - samples: Audio samples (may exceed 30 seconds). /// - language: Target language or nil for auto-detect. /// - options: Transcription options. /// - onProgress: Called with progress (0.0 to 1.0) as segments are processed. /// - Returns: Complete transcription result. func transcribeLong( samples: [Float], language: String? = nil, options: TranscriptionOptions = .default, onProgress: @escaping @Sendable (Double) -> Void ) async throws -> TranscriptionResult { guard isModelLoaded, context != nil else { throw WhisperError.modelNotLoaded }
let totalDuration = Double(samples.count) / 16_000.0 Logger.speech.info("Starting long transcription: \(String(format: "%.1f", totalDuration))s of audio")
// whisper_full handles long audio internally via its segmentation logic. // We use the progress callback to report status. return try await withCheckedThrowingContinuation { continuation in queue.async { [weak self] in guard let self, let ctx = self.context else { continuation.resume(throwing: WhisperError.modelNotLoaded) return }
let startTime = DispatchTime.now()
var params = whisper_full_default_params( options.beamSize > 1 ? WHISPER_SAMPLING_BEAM_SEARCH : WHISPER_SAMPLING_GREEDY ) params.n_threads = Int32(options.threadCount) params.translate = false
if options.beamSize > 1 { params.beam_search.beam_size = Int32(options.beamSize) }
params.temperature = options.temperature params.temperature_inc = options.temperatureIncrement
if let lang = language { lang.withCString { params.language = $0 } } else { params.detect_language = true }
if let prompt = options.initialPrompt { prompt.withCString { params.initial_prompt = $0 } }
// Progress callback let progressContext = UnsafeMutablePointer<(@Sendable (Double) -> Void)?>.allocate(capacity: 1) progressContext.initialize(to: onProgress)
params.progress_callback_user_data = UnsafeMutableRawPointer(progressContext) params.progress_callback = { (ctx, state, progress, userData) in guard let userData else { return } let callback = userData.assumingMemoryBound( to: ((@Sendable (Double) -> Void)?).self ).pointee callback?(Double(progress) / 100.0) }
let result = samples.withUnsafeBufferPointer { bufferPtr in whisper_full(ctx, params, bufferPtr.baseAddress, Int32(samples.count)) }
progressContext.deinitialize(count: 1) progressContext.deallocate()
guard result == 0 else { continuation.resume(throwing: WhisperError.transcriptionFailed(code: result)) return }
// Extract all segments let segmentCount = whisper_full_n_segments(ctx) var segments: [Segment] = [] var fullText = ""
for i in 0..<segmentCount { let text = String(cString: whisper_full_get_segment_text(ctx, i)) let t0 = whisper_full_get_segment_t0(ctx, i) * 10 let t1 = whisper_full_get_segment_t1(ctx, i) * 10
let tokenCount = whisper_full_n_tokens(ctx, i) var probSum: Float = 0 for t in 0..<tokenCount { probSum += whisper_full_get_token_p(ctx, i, t) } let avgProb = tokenCount > 0 ? probSum / Float(tokenCount) : 0
segments.append(Segment( text: text, startMs: Int64(t0), endMs: Int64(t1), probability: avgProb, isPartial: false )) fullText += text }
let detectedLang: String if language == nil { let langId = whisper_full_lang_id(ctx) detectedLang = String(cString: whisper_lang_str(langId)) } else { detectedLang = language! }
let elapsed = DispatchTime.now().uptimeNanoseconds - startTime.uptimeNanoseconds let elapsedMs = Int64(elapsed / 1_000_000)
Logger.speech.info( "Long transcription complete: \(segmentCount) segments, " + "\(elapsedMs)ms processing time" )
continuation.resume(returning: TranscriptionResult( text: fullText.trimmingCharacters(in: .whitespacesAndNewlines), segments: segments, language: detectedLang, languageProbability: 0, processingTimeMs: elapsedMs )) } } }}Segment boundary handling:
| Issue | Cause | Mitigation |
|---|---|---|
| Split words | 30s chunk boundary falls mid-word | whisper.cpp uses overlap to minimize this |
| Repeated text | Overlap region causes duplicate tokens | Deduplication in post-processing |
| Lost context | Each chunk processes independently | Initial prompt carries forward from previous chunk |
| Hallucination | Silent segments may generate spurious text | VAD filtering removes silence before transcription |
❌ Error: If you see repeated phrases or “hallucinated” text in the output (text that was not spoken), this is typically caused by silent audio being sent to Whisper. Ensure VAD is enabled and the noise gate threshold is correctly calibrated. Whisper tends to hallucinate on silent or near-silent input.
Related Documentation
Section titled “Related Documentation”- Architecture Overview — System-level architecture including how the speech recognition engine fits into the overall pipeline.
- Model Management — Detailed guide on downloading, managing, and selecting both Whisper and llama.cpp models.
- API Documentation — Public API reference for the speech recognition engine, audio pipeline, and processing modes.