News & Updates

Development progress, feature announcements, and behind-the-scenes updates on Voxtype.

v0.7.5: Quickshell OSD, Soniox streaming, meeting-mode rework

A larger feature release that bundles work queued behind three hotfix cycles. Three themes: an opt-in Quickshell OSD frontend, a Soniox cloud streaming backend, and a rebuild of meeting mode around source diarization. Release artifacts are now signed end-to-end in CI by a dedicated release key, removing the manual signing step from every cut.

Quickshell OSD frontend (opt-in)

A QML-native on-screen display that replaces the GTK4 surface for users who already run a Quickshell-based desktop. Three composed surfaces ship with the package: a waveform overlay during recording, an engine picker overlay for switching transcription backends without leaving the keyboard, and a meeting controls overlay for starting and stopping meeting captures. A new voxtype-audio-bridge sidecar streams audio levels to the QML side as NDJSON over a UNIX socket.

Why use it: if you already run Quickshell, the OSD blends into your existing shell instead of pulling in a GTK4 surface. The QML stack composes with Hyprland / Niri layer-shell rules cleanly, and the waveform renders at full compositor refresh rate.

config.toml
[osd]
frontend = "quickshell"

The default frontend stays gtk4. Run voxtype setup quickshell to install the QML files and the sidecar binary into the right Quickshell config paths. Known issue: the Quickshell overlay captures pointer events across the whole screen (#440). v0.7.6 will fix.

Soniox cloud streaming backend (#411)

A new engine that streams audio over a WebSocket to Soniox and types each partial transcript at the cursor as it arrives. Lower end-to-end latency than the local Parakeet streaming pipeline, at the cost of network dependence and a third-party API key.

Why use it: the lowest-latency option when you can tolerate cloud transcription. Good for live captioning, long-form interview dictation, or any case where the half-second cost of local Parakeet streaming feels too long.

config.toml
engine = "soniox"

[soniox]
api_key = "your-soniox-api-key"

Meeting mode: source diarization and VAD sub-windows

Three changes converge: a source-diarization path that attributes each speaker by which input device they came from (#341, contributed by sjug), VAD sub-windows that let the ECAPA-TDNN ML diarizer run on shorter chunks for better turn-by-turn accuracy (#418), and a per-meeting --diarization CLI flag so you can switch backends without editing the config file (#420).

Why use it: meetings with a clear host-plus-remote structure (loopback for remote audio, mic for the host) now get correct speaker labels without any ML cost. The per-meeting flag means you can run one meeting with full ML diarization and the next with cheap source diarization without restarting the daemon.

per-meeting override
# Source-based (fast, works when speakers are on different input devices)
voxtype meeting start --diarization source

# ML-based (slower, works on a single mixed audio stream)
voxtype meeting start --diarization ml

Release signing migrated to a dedicated CI key (#437)

Every release artifact (binaries, companion .so files, SHA256SUMS.txt, .deb, .rpm, and the GitHub source archive) is now signed in CI by a dedicated release primary key (9CCF7915...) that is cross-signed by the offline maintainer key. Both fingerprints are in voxtype-bin's validpgpkeys, so yay auto-fetches the new key from a keyserver on first upgrade with no manual gpg --recv-keys step.

Why this matters: v0.7.4 used a back-signed signing subkey that yay could not auto-fetch by keyid, which broke yay -Syu voxtype-bin for every user with a stale local keyring. v0.7.5's standalone primary keyserver-resolves correctly on every keyserver implementation. The signing step itself is also now fully in CI, removing the maintainer-laptop step between tag and release.

Variant safety improvements

Two changes around the /usr/bin/voxtype dispatcher wrapper:

  • Wrapper decision uses basename, not canonicalize (#443). The previous logic followed symlinks and could pick the wrong dispatch shape when /usr/bin/voxtype already pointed at a non-canonical path. Closed-set basename match is more reliable and doesn't depend on filesystem state.
  • TUI surfaces engine-vs-binary mismatch (#450). A persistent banner appears across every section when the configured engine cannot be served by the running binary. F2 jumps to the General section's variant picker. The daemon also fires a desktop notification at startup so users who do not open the TUI still see the warning.

Smaller fixes

  • Streaming Parakeet validator accepts the bobNight naming convention (closes the v0.7.4 known issue).
  • Per-language XKB variants for eitype (#424, contributed by AlexCzar).
  • Cohere ONNX remote-path prefix fix for the new onnx/ subdirectory layout (#363, contributed by jpds).
  • CPU __cpuid wrapped in unsafe block for rustc 1.91+ (#419, contributed by materemias).
  • NixOS CI workflow now builds on every PR (closes #369).
  • wl-copy fallback under X11 sessions (#346).
  • dotool daemon fast path (#410).
  • Audio socket listener watchdog (#391, #392).

Acknowledgments

  • sjug for the source-diarization patch (#341).
  • AlexCzar for the per-language XKB variants (#424).
  • jpds for the Cohere ONNX path fix (#363).
  • materemias for the rustc 1.91 __cpuid fix (#419).

Downloads

Full changelog, signed binaries, and SHA256 sums: v0.7.5 on GitHub.

v0.7.3: Kaby Lake Vulkan SIGILL fix, RTX 5090 Blackwell support

A two-issue hotfix. The Vulkan binary regression came in via the move to GitHub-hosted CI runners; the CUDA 13 issue surfaced once RTX 5090 owners hit a transcription. Both fixes also harden the build pipeline so the same class of regression cannot ship again.

Vulkan binary works on Kaby Lake and older Intel CPUs (#393)

The v0.7.2 Vulkan binary failed with SIGILL on Intel CPUs that lack SSE4a, including the i7-7700HQ in the reporter's machine. The instruction came from RUSTFLAGS="-C target-cpu=native" in Dockerfile.vulkan. That flag was safe when the build host was a pre-AVX-512 Coffee Lake server, but the v0.7.2 pipeline moved onto GitHub's ubuntu-24.04 runner, which is AMD Zen 3 and has SSE4a. The compiler emitted insertq instructions, and Intel CPUs SIGILL on those.

Why use it: if you have a pre-Cascade-Lake Intel CPU (anything roughly 6th-gen or older) and the Vulkan backend, voxtype 0.7.2 daemons would not start. v0.7.3 boots cleanly on the same hardware.

The build now bounds Rust at target-cpu=haswell with -sse4a,-gfni,-avxvnni disables, matches the ggml CMake flags to the same floor, and adds a CI gate that fails the build on any AVX-512 EVEX, GFNI, AVX-VNNI, or SSE4a instruction in the AVX2 or Vulkan variants. The previous gate only checked zmm registers, so SSE4a slipped through unnoticed.

NVIDIA Blackwell (RTX 5090) works on the CUDA 13 binary (#386)

voxtype-onnx-cuda-13 shipped a CUDA execution provider with kernel coverage up to sm_90a (Hopper). Blackwell needs sm_120, so the RTX 5090 hit the EP and got cudaErrorNoKernelImageForDevice on every transcription. The upstream prebuilt ORT that the ort crate downloads does not include Blackwell, and the maintainer has marked Blackwell support as not planned.

Why use it: if you own an RTX 50-series card, voxtype 0.7.2's CUDA 13 binary would error on every transcription. v0.7.3 transcribes correctly on Blackwell.

Dockerfile.onnx-cuda-13 now pulls Microsoft's official ORT 1.24.4 cu13 prebuilt and links to it with ort/load-dynamic. The bundled CUDA EP covers sm_70 through sm_120a. The runtime ships alongside the binary at /usr/lib/voxtype/cuda-13/libonnxruntime.so.1.24.4 plus a libonnxruntime.so symlink, which is exactly where ort/load-dynamic looks first (relative to /proc/self/exe). No environment variable or LD_LIBRARY_PATH setup required.

Two build gates went in alongside the fix: a check that sm_120 markers are present in the bundled provider library, and a check that the binary did not silently fall back to static linking (which would have brought the old kernel coverage back).

voxtype-onnx-cuda-12 is unchanged. Blackwell requires CUDA 13 and driver 580+, so cuda-12 users are not affected.

Acknowledgments

  • Piotr Lipski for reporting the Kaby Lake Vulkan SIGILL with a complete environment dump.
  • tyvsmith for the RTX 5090 report including the source-build comparison that confirmed the diagnosis.

Downloads

Full changelog, signed binaries, and SHA256 sums: v0.7.3 on GitHub.

v0.7.2: Streaming Dictation, Modifier-Release Guard, Notification Cleanup

Live transcription appears at the cursor as you speak, modifier keys no longer collide with typed output, and Waybar gets a dedicated streaming icon. Streaming requires toggle activation, not push-to-talk.

Streaming Dictation (experimental)

Parakeet streaming output types text incrementally while you dictate. Powered by the parakeet-rs cache-aware streaming pipeline, it works with parakeet-unified-en-0.6b (English) and runs on CPU, NVIDIA CUDA, or AMD MIGraphX.

Why use it: Words appear as you speak instead of after you release a key. Useful for long-form dictation, live notes, and interactive editing where waiting for batch transcription breaks flow.

Enable in the TUI Advanced section or in config:

config.toml
engine = "parakeet"

[parakeet]
model = "parakeet-unified-en-0.6b"
streaming = true

[hotkey]
mode = "toggle"

Streaming requires toggle activation. Push-to-talk does not work with streaming output: voxtype types characters at the cursor while you are still holding the key, and the synthetic key events from wtype or dotool clobber the held-key state tracker that Hyprland, Sway, and River use through libinput. After a few seconds the compositor decides the key was already released, so the actual release event never fires and the daemon stays in streaming mode.

Toggle activation sidesteps the issue entirely: no key is held during the session. The daemon emits a strong warning and auto-promotes push_to_talk to toggle for the running session when streaming is enabled, and the TUI rewrites the config to match.

Tunable parameters control the latency/accuracy tradeoff:

config.toml
[parakeet]
streaming_chunk_secs = 0.32         # how often partials are emitted
streaming_left_context_secs = 5.6   # history fed to each chunk
streaming_right_context_secs = 0.32 # lookahead per chunk

Modifier-Release Guard (#350)

When you bind voxtype to a chord like Super+Ctrl+X, the modifiers stay physically pressed while wtype types the transcription. On some compositors this caused the first typed letter to combine with the modifiers and fire an unrelated keybinding (closing windows, switching workspaces, opening menus).

The daemon now takes an EVIOCGKEY snapshot of keyboard state via evdev before typing output, and waits for any held modifier to be released before sending keystrokes. If the timeout is hit (default 750ms) the output falls back to clipboard so the transcription is not lost, and a desktop notification tells the user where the text went.

Why use it: No more "voxtype just closed my window" surprises after a long dictation. Works on Hyprland, Sway, River, and any compositor-agnostic setup where the user is in the input group.

config.toml
[output]
wait_for_modifier_release = true     # default when input group is available
modifier_release_timeout_ms = 750

Disabled automatically during streaming (your modifiers are still down because you are still actively dictating; the guard would never fire).

Notification Stacking (#345)

Voxtype's status notifications ("Recording...", "Transcribing...", "Transcribed", "Modifier key held...") now use the x-canonical-private-synchronous and transient libnotify hints, so a compositor with a notification daemon (mako, dunst, GNOME Shell, KDE) overwrites the previous Voxtype notification in place instead of stacking them in the notification history.

Why use it: The notification panel stays clean on GNOME/Ubuntu and similar setups instead of accumulating a long trail of "Recording..." / "Transcribing..." / "Transcribed" entries every time you dictate.

Patch contributed by Stephan Schuster.

Streaming Status Icon

Waybar status followers now distinguish "recording" (batch capture) from "streaming" (live capture with cursor output). Custom themes can override the streaming glyph via [status.icons]:

config.toml
[status.icons]
streaming = "󰜟"  # nf-md-radio-handheld in Nerd Font

The built-in themes (omarchy, nerd-font, material, phosphor, codicons, emoji, minimal, dots, arrows, text) each ship a sensible streaming glyph. The Waybar format-icons map has a new "streaming" key. The omarchy-voxtype-status script reports the new state.

Bug Fixes

  • Streaming cursor protection: the parakeet backend used to emit one last Final event after SIGUSR2, which would type into whichever window had focus by the time the event arrived. The daemon now disowns the streaming session synchronously on stop so post-stop emissions are dropped instead of typed.
  • voxtype record toggle during streaming: the toggle command checked for "recording" state only, so toggling during a streaming session would send SIGUSR1 and start a second concurrent session instead of stopping the first. Both "recording" and "streaming" are now treated as active states.
  • Cohere model validator: updated to match HuggingFace Optimum filenames so voxtype setup model validates the Cohere variants correctly.
  • GTK4 OSD startup visibility: the OSD no longer flashes its chrome on the first frame at boot. Fix from Andre Silva.
  • Nix flake: the OSD binaries are now packaged in the flake. From Andre Silva.

Experimental aarch64 binaries

Two new arm64 Linux binaries land in this release: aarch64-cpu (the Whisper engine, equivalent to the x86_64 avx2 variant) and aarch64-onnx (all ONNX engines: Parakeet, Moonshine, SenseVoice, Paraformer, Dolphin, Omnilingual, Cohere). Both are CPU-only. No CUDA or Vulkan on arm64 in this release, because the Jetson CUDA toolchain is awkward and there is no mainstream consumer arm64 hardware with Vulkan GPUs to support yet. Target hardware is Raspberry Pi 4/5, Ampere Altra servers, Snapdragon X laptops, and AWS Graviton instances. These binaries are marked experimental: nobody on the maintenance team owns arm64 hardware to smoke-test against, so users who run these and report back are the ones validating them in practice. File issues with hardware details if something breaks.

Manual install only in 0.7.2. The .deb / .rpm / AUR packages stay x86_64-only for this release because voxtype's installed-binary switching logic (in src/setup/gpu.rs) does not yet recognize arm64 variants. Download the binary directly from the GitHub release page and drop it into /usr/local/bin/voxtype:

curl -L https://github.com/peteonrails/voxtype/releases/download/v0.7.2/voxtype-0.7.2-linux-aarch64-cpu \
  -o /usr/local/bin/voxtype
chmod 755 /usr/local/bin/voxtype
voxtype --version

(Substitute aarch64-onnx for the ONNX-engine build.) The aarch64 binaries get full package-tooling integration in v0.7.3 alongside a chosen naming convention and arch=('x86_64' 'aarch64') in the PKGBUILDs.

Acknowledgments

  • Andre Silva for the OSD startup-visibility fix and the Nix flake OSD packaging.
  • Stephan Schuster for the notification-stacking patch (#345), including the libnotify hint research and a working diff.
  • Jean-Paul van Tillo for early streaming feedback that informed the partial-typing implementation.

Downloads

Full changelog, signed binaries, and SHA256 sums: v0.7.2 on GitHub.

v0.7.1: NixOS source build hotfix

Hotfix for v0.7.0 that fixes source builds on NixOS and lands two community contributions for the Nix flake. Binary package users (AUR, .deb, .rpm) get the OSD visibility fix and flake improvements; v0.7.0 binaries already worked correctly for them.

NixOS vulkan build no longer needs GTK3 system libs

v0.7.0 added tray-icon and rdev as top-level Cargo dependencies. Both crates are consumed only by macOS-specific code, but because they were declared as platform-agnostic, cargo resolved and built them on every platform. tray-icon dragged in libappindicatorgtk3-sysglib-sysgdk-pixbuf-sys transitively.

Docker-based binary builds happened to install the GTK3 dev stack, so they worked by accident. The NixOS vulkan derivation only declares alsa-lib, openssl, and vulkan-loader, so cargo died on glib-sys when pkg-config could not find glib-2.0.pc.

Why use it: if you build from source on any non-Arch Linux distribution and do not have the full GTK3 development stack installed, v0.7.0 would not compile and v0.7.1 will. NixOS users can drop any glib or gdk-pixbuf workarounds from their flake derivations.

Reported by @bashfulrobot in Issue #352.

Nix flake: package OSD binaries

The flake now exposes osdNative and osdGtk4 as build targets, so NixOS users get the on-screen visualizer through the standard nix build path instead of needing to build from source manually.

nix build github:peteonrails/voxtype#packages.x86_64-linux.osdGtk4
nix build github:peteonrails/voxtype#packages.x86_64-linux.osdNative

Contributed by @andresilva in PR #354.

GTK4 OSD no longer stays visible on startup

The GTK4 visualizer's initial visibility flag did not match the real GTK state after window.present(), so the first idle check skipped the hide call and left the empty visualizer on screen until you started a recording. Subtle but visible if you launched voxtype-osd-gtk4 standalone before dictating for the first time.

Contributed by @andresilva in PR #355.

Full changelog and signed binaries: v0.7.1 on GitHub.

v0.7.0: Cohere, macOS, on-screen visualizer, configuration TUI

v0.7.0 is the largest release in the project's history. It adds Cohere Transcribe as the eighth ASR engine, brings full macOS support, ships a GTK4 on-screen visualizer that follows the swayosd convention, replaces the AMD GPU backend with MIGraphX, and introduces an interactive configuration TUI. 138 commits since v0.6.6.

Cohere Transcribe

The eighth transcription engine. Cohere's Transcribe model is a 1.6B-parameter multilingual ASR built on the Aya Expanse architecture. Voxtype ships int4 and fp16 quantizations downloadable via voxtype setup model.

Why use it: Best-in-class accuracy on accented English, code-switching between languages, and technical vocabulary. The int4 build is small enough to run on CPU; fp16 runs on GPU for sub-second latency.

config.toml
engine = "cohere"

[cohere]
model = "cohere-transcribe-q4f16"

On CUDA the encoder runs on GPU but the decoder is pinned to CPU pending an upstream ONNX Runtime fix. AMD MIGraphX users get full GPU acceleration end-to-end.

macOS support

Voxtype now runs on Apple Silicon. The release ships a signed .dmg with the daemon, a SwiftUI menubar app, and a setup wizard for accessibility permissions, model download, and the FN key hotkey.

Why use it: If you've been waiting for voxtype on a Mac, the wait is over. The macOS build uses Microsoft's prebuilt ONNX Runtime so all engines work, including Parakeet.

Install via Homebrew
brew install --cask voxtype

The macOS port was led by krystophny; thanks for the long arc of work to get it shipped.

On-screen visualizer

A new on-screen mic visualizer renders a live waveform aligned with the swayosd convention. Two frontends: GTK4 + gtk4-layer-shell, or native SCTK + wgpu + egui-wgpu. The daemon picks one based on [osd] frontend and spawns it automatically when the daemon starts.

Why use it: You know dictation is being captured without looking at the waybar. The waveform shows your voice level, fades when you stop talking, and disappears when transcription finishes.

config.toml
[osd]
enabled = true
frontend = "gtk4"          # or "native"
position = "bottom-center"
top_margin = 0.85          # swayosd-aligned by default

The daemon supervisor restarts the OSD on crash with exponential backoff so a misconfigured GTK theme or a Wayland reconnect doesn't lose the visualizer.

Configuration TUI

The new voxtype configure command opens an interactive terminal UI that edits every option in ~/.config/voxtype/config.toml for you. The editor preserves comments and unknown fields via toml_edit, validates the result against the daemon's parser before writing, and rolls back if a change would prevent startup.

Why use it: Every option is discoverable. The TUI shows live state, warns when a chosen engine's binary isn't installed, and offers to download missing models or swap binary variants without leaving the editor.

Search "Voxtype Configuration" in walker, fuzzel, rofi, KRunner, or GNOME Activities to launch it as a floating window. The launcher discovers an installed terminal emulator and applies a Hyprland windowrulev2 for centered floating placement.

CUDA 12 / CUDA 13 split

NVIDIA users now get two ONNX binaries: voxtype-onnx-cuda-12 (driver 525+, libcudart.so.12) and voxtype-onnx-cuda-13 (driver 580+, libcudart.so.13). The AUR post-install hook detects your CUDA runtime via ldconfig and points /usr/bin/voxtype at the matching binary automatically.

Why use it: Rolling-distro users on CUDA 13 stop hitting forward compatibility was attempted on non supported HW errors. LTS users on CUDA 12 keep working without manually pinning a driver.

AMD: ROCm to MIGraphX

The AMD GPU execution provider changed. parakeet-rocm is now parakeet-migraphx, and voxtype-onnx-rocm is now voxtype-onnx-migraphx. The MIGraphX EP is built from source against ROCm 7.2.2, has better op coverage, and is the path nixpkgs is committing to.

Why use it: The legacy ROCm EP had upstream ORT compatibility issues that caused silent CPU fallback on Radeon RX 7000-class GPUs. MIGraphX runs Parakeet, Moonshine, SenseVoice, Cohere, and the rest of the ONNX stack with proper GPU acceleration.

Three things to know:

  • Requires ROCm 7.x. Older releases (5.x, 6.x) silently fall back to CPU.
  • First inference compiles the model graph (~30-60s on RX 7000-class). Subsequent runs are fast.
  • The voxtype-onnx-rocm name still works as a symlink for one release. Drop in v0.8.0.

Filler-word filtering

Built-in filler-word removal, on by default. Strips "um", "uh", "like", "you know", "basically", and similar fillers from transcribed text before output.

Why use it: Cleaner dictation without an LLM cleanup step. Combine with post_process for full LLM grammar fixes when you want them.

Eighth engine, eight binaries

Engines (8): Whisper, Parakeet, Moonshine, SenseVoice, Paraformer, Dolphin, Omnilingual, Cohere
Binaries (8): avx2, avx512, vulkan, onnx-avx2, onnx-avx512, onnx-cuda-12, onnx-cuda-13, onnx-migraphx, plus voxtype-osd and voxtype-osd-gtk4

The TUI's Engine section auto-swaps the binary when you pick an engine that the active variant doesn't support.

Other improvements

  • AppImage packaging. Three variants (CPU, Vulkan, ONNX) for distros without native packages.
  • Niri compositor support in the supported compositor list and the TUI hotkey suggestion logic.
  • Wrapper script for /usr/bin/voxtype. GPU and ONNX variants now go through a wrapper that sets the right argv[0] for ORT's provider .so lookup, fixing a class of silent CPU-fallback bugs.
  • Mouse support and live keymap overlay (?) in the TUI.
  • Configurable notification urgency per-event.
  • Live refresh of the General TUI section every 2 seconds while it's the active section.

Bug fixes

  • Status display now reports the real backend ("ONNX GPU (MIGraphX)") through the wrapper script. The waybar tooltip matches reality.
  • TUI save no longer corrupts config files; save-on-exit prompt no longer infinite-loops on "discard".
  • gpu_isolation worker spawn moved off the async event loop so it doesn't stall hotkey detection during model load.
  • ALSA stderr silenced during cpal device enumeration; daemon startup no longer prints spurious "ALSA lib pcm" warnings.
  • Filler-word filter no longer strips legitimate words that contain filler substrings.
  • Notification urgency level is now actually applied (was always "normal").
  • TUI engine cycling correctly filters to engines the active binary supports.

Upgrading from 0.6.x

Most users on Arch (voxtype-bin), Debian (.deb), or Fedora (.rpm) will upgrade cleanly. Three things to know if you're a Parakeet user:

Binary renames. voxtype-parakeet becomes voxtype-onnx-avx2 (or avx512). voxtype-parakeet-cuda becomes voxtype-onnx-cuda-12 or -13. voxtype-parakeet-rocm becomes voxtype-onnx-migraphx. The AUR ships a voxtype-onnx-rocm compat symlink for one release. Custom user scripts hard-coding the old names will need updating.

MIGraphX requires ROCm 7.x. AMD users on ROCm 5.x or 6.x will silently fall back to CPU. Update to rocm-hip-runtime 7.x before upgrading.

CUDA 13 needs driver 580+. Rolling distros are usually fine. LTS users stay on cuda-12. The post-install hook picks the right binary automatically based on ldconfig's libcudart SONAME.

Contributors

This release built on work by Peter Jackson, krystophny (macOS port), Antonio Zugaldia, and André Silva. Issue reports and design feedback from the Omarchy and Hyprland communities shaped the OSD positioning and the CUDA-12/13 split. Thanks to everyone who tested rc1, rc2, and rc3.

Full changelog and signed binaries: v0.7.0 on GitHub.

v0.7.0-rc2: Cohere Transcribe, macOS port, MIGraphX, configure TUI

v0.7.0-rc2 is a community pre-release. It folds in three months of work across new ASR engines, an AMD GPU acceleration path, native macOS support, and a new interactive configuration TUI. Please install from the GitHub release and report issues at github.com/peteonrails/voxtype/issues.

Cohere Transcribe ASR engine

The Cohere Labs Transcribe model is now wired in as a first-class engine. It currently sits at #1 on the Open ASR Leaderboard. Whisper-style task tokens give it punctuation, capitalization, and inverse text normalization out of the box.

This release ships four HuggingFace Optimum ONNX quantization variants. The default (cohere-transcribe-q4f16) is 1.5 GB and runs at 9-11x realtime on a Zen 4 CPU.

Available models
cohere-transcribe-q4f16    1.5 GB   9-11x realtime   (default)
cohere-transcribe-q4       2.0 GB   9-11x realtime
cohere-transcribe-int8     2.9 GB   2-3x realtime
cohere-transcribe-fp16     3.9 GB   7-8x realtime

Why use it: Cohere is the new top-of-the-leaderboard ASR model and we ship a 1.5 GB quantized variant that runs faster than realtime on commodity hardware. CUDA acceleration on the encoder gives a meaningful boost on longer dictations.

Native macOS port

Apple Silicon support via Microsoft ONNX Runtime. The macOS package includes:

  • SwiftUI VoxtypeMenubar status bar app
  • VoxtypeSetup wizard with sidebar settings layout
  • Homebrew Cask + tap for prebuilt distribution
  • FN / Globe key hotkey support
  • App-bundle install path (supersedes the older launchd flow)

Linux remains the primary platform. macOS arm64 ships as an additional artifact.

Credit: the macOS port comes from krystophny, who carried it from a personal fork to a release-ready feature branch.

MIGraphX for AMD GPUs

MIGraphX 7.2 replaces the older ROCm execution-provider path for Parakeet on AMD GPUs. Validated on Radeon RX 7000-series cards. The compatibility symlink voxtype-onnx-rocm ships through this release for users with scripts that reference the old name; it goes away in v0.8.0.

Note: MIGraphX 7.2 requires ROCm 7.x. Users on ROCm 5.x or 6.x will silently fall back to CPU. Check rocm-smi before upgrading if you depend on AMD GPU acceleration.

Two CUDA binaries

The previous single CUDA binary is now split into voxtype-onnx-cuda-12 (NVIDIA driver 525+) and voxtype-onnx-cuda-13 (NVIDIA driver 580+). The AUR PKGBUILD and voxtype setup gpu --enable symlink voxtype-onnx-cuda to whichever variant matches your runtime CUDA.

Why two: CUDA 13's runtime requires a newer NVIDIA driver than CUDA 12 does. Shipping both gives users on older drivers a working binary while still offering the latest CUDA toolchain to users on current drivers.

Interactive configure TUI

The new voxtype configure command opens an interactive terminal UI for editing every option in config.toml. It surfaces in Walker, fuzzel, and rofi as "Voxtype Configuration" so you can launch it from your usual app launcher.

Behaviors that landed in this release:

  • Engine cycle filters to engines for which an installed binary actually exists. No more saving an engine that the active binary cannot load.
  • Save auto-swaps /usr/bin/voxtype to a binary that supports the chosen engine (via pkexec) and auto-downloads missing models via voxtype setup model.
  • Save-on-exit prompt with Save / Discard / Cancel options so field edits are not lost on q.
  • D on the General section starts or restarts the user systemd unit so configuration changes take effect without leaving the TUI.

Why use it: No more hand-editing TOML. Pick an engine, pick a model, save. The TUI handles the binary swap and download orchestration so you do not end up in a broken state.

swayosd-aligned floating OSD

The optional waveform OSD (voxtype-osd-gtk4) now positions itself using fractional top_margin semantics that match swayosd. Default 0.85 puts the OSD in the same vertical band where your volume, brightness, and media-key feedback already render. Works across multi-monitor setups with different screen heights without per-monitor pixel tweaks.

Other changes

  • Cohere config table moved to the HuggingFace Optimum file layout. Pre-rc2 Cohere installs need to re-download via voxtype setup model.
  • The default Cohere model is now cohere-transcribe-q4f16 (was cohere-transcribe-int8).
  • Status command now reports the real backend (e.g. ONNX GPU (MIGraphX)) through the wrapper-aware inventory machinery, rather than always showing CPU (native).
  • Roadmap reviewed: voxtype-models CDN and streaming transcription confirmed for v0.7.1.

Caveats for testers

  • The CUDA companion .so files (libonnxruntime_providers_*.so) must be installed in the same directory as the binary. The AUR voxtype-bin package handles this; manual installs need the matching layout under /usr/lib/voxtype/cuda-12|cuda-13|migraphx/.
  • Cohere's CUDA decoder still falls through to CPU pending an ONNX Runtime kernel update. Encoder-only acceleration on this release.
  • Parakeet users upgrading from 0.6.x: please read the GitHub release notes for the binary-rename and ROCm-version details.

Full changelog and signed binaries: v0.7.0-rc2 on GitHub.

v0.6.6: Media Pause, Audio Feedback, KDE Support, 7 Bug Fixes

Voxtype 0.6.6 adds media player pausing during recording, audio feedback when transcription finishes, KDE Plasma compositor keybinding documentation, and finer control over post-processing behavior. This release also fixes seven bugs across output drivers, text processing, and the remote backend.

Media Player Pause

Voxtype can now pause your media player when you start recording and resume it when you stop. This prevents music or podcasts from bleeding into your transcription. It uses playerctl, which supports Spotify, Firefox, VLC, and any MPRIS-compatible player.

Why use it: If you listen to music while working and use push-to-talk dictation, you no longer need to manually pause before speaking.

config.toml
[audio]
pause_media = true

Also available as --pause-media on the CLI or VOXTYPE_PAUSE_MEDIA=true as an environment variable.

Audio Feedback

A new sound event plays when transcription finishes, so you know your text has been output without looking at the screen. Three built-in themes are available: default, subtle, and mechanical.

Why use it: When transcription takes a few seconds (especially on CPU), audio feedback tells you the text is ready. This is particularly helpful when dictating into a window that isn't visible.

config.toml
[audio]
feedback_theme = "default"    # default, subtle, or mechanical

The existing start/stop recording sounds continue to work as before. The new TranscriptionComplete event is an addition, not a replacement.

KDE Plasma Keybindings

KDE Plasma on Wayland is now documented as a supported compositor for keybinding setup. The README, User Manual, and Configuration guide include step-by-step instructions for configuring push-to-talk via KWin's custom shortcuts.

Post-Process Control

Two new options for the post-processing pipeline give you more control over LLM cleanup output.

trim strips leading and trailing whitespace from post-processed text. This is useful when LLM commands return text with extra newlines or spaces.

fallback_on_empty controls what happens when your post-process command returns empty output. When enabled (the default), voxtype falls back to the original transcription. When disabled, empty output is treated as intentional, which is useful for commands that deliberately filter out certain transcriptions.

config.toml
[output]
post_process_command = "ollama run llama3.2 'Clean up this dictation:'"
post_process_trim = true
post_process_fallback_on_empty = true

Meeting Mode Post-Processing

Meeting mode now supports post-processing with dictation context. When a post-process command is configured, meeting transcription chunks are processed through it before being added to the transcript. The command receives the raw transcription on stdin along with meeting context (title, speaker, timestamp) via environment variables.

Why use it: If you use LLM cleanup for regular dictation, this extends that pipeline to meeting mode. The additional context variables let the LLM make better decisions about formatting and cleanup.

CLI Help Reorganization

The CLI help output has been reorganized for progressive disclosure. Options are grouped by section (Audio, Whisper, Output, VAD, Meeting) with the most commonly used options appearing first. The --help output is shorter and more scannable, while subcommand help still shows full details.

Bug Fixes

  • ydotool socket detection: Fedora puts the ydotool socket at /tmp/.ydotool_socket instead of the standard path. Voxtype now searches common locations automatically. (#306)
  • Duplicate desktop notifications: The ydotool, dotool, clipboard, and xclip output drivers were sending their own notifications in addition to the daemon's. Notifications are now handled consistently by the daemon. (#268)
  • Remote backend initial_prompt: The remote transcription backend now forwards initial_prompt as the prompt field in the multipart form, matching the OpenAI API spec. (#278)
  • Text replacement ordering: Text replacements are now applied before spoken punctuation conversion. Patterns like "slash pr" now work correctly. (#172)
  • KDE Plasma paste support: eitype added to the Ctrl+V simulation chain in paste mode for KDE Plasma Wayland. (#259)
  • X11 clipboard fallback: xclip added as a fallback in clipboard mode for X11 sessions where wl-copy isn't available. (#256)
  • CJK first character drop in Discord: New wtype_shift_prefix option works around a Discord bug where the first CJK character is dropped when using wtype. (#208)

Contributors

Thanks to KaiStarkk for post-process trim and fallback_on_empty, graysky2 for fixing flash attention config wiring, and materemias for meeting mode post-processing with dictation context.

v0.6.5: Eager Processing, Multi-GPU Device Selection, Parakeet TDT English-Only Models

Voxtype 0.6.5 adds real-time transcription during recording, fixes a performance regression on multi-GPU systems, and brings new English-only Parakeet models. If you have both an integrated and discrete GPU, the gpu_device option gives you direct control over which one runs inference.

Eager Processing

Voxtype can now transcribe audio chunks while you are still recording. Instead of waiting until you release the hotkey and then processing all the audio at once, eager mode splits your recording into overlapping chunks and transcribes them in parallel. When you stop recording, the partial results are combined into the final output.

Why use it: On slower CPUs where transcription takes several seconds, eager processing reduces the wait after you stop speaking. On fast systems with GPU acceleration, the difference is negligible.

config.toml
[whisper]
eager_processing = true
eager_chunk_secs = 5.0       # Chunk size (seconds)
eager_overlap_secs = 0.5     # Overlap to avoid losing words at boundaries

Per-recording overrides work too: voxtype --eager-processing record start

Multi-GPU Device Selection

whisper-rs 0.16.0 began enumerating integrated GPUs via Vulkan, which gave them priority over discrete GPUs. On systems with both (common on laptops and hybrid desktops), this caused transcription to run on the integrated GPU at roughly 3x slower speeds with no obvious indication.

The new gpu_device option lets you specify the exact GPU device index for whisper.cpp:

config.toml
[whisper]
gpu_device = 1    # Use device index 1 (your discrete GPU)

Also available as --gpu-device 1 on the CLI or VOXTYPE_GPU_DEVICE=1 as an environment variable. Use voxtype setup gpu or vulkaninfo --summary to find your device indices.

This complements the existing VOXTYPE_VULKAN_DEVICE env var, which filters by vendor name ("amd", "nvidia", "intel"). Use gpu_device when you need precise index control, especially with same-vendor multi-GPU setups.

Parakeet TDT English-Only Models

Two new Parakeet model options for English-only transcription:

  • parakeet-tdt-0.6b-v2 (2.4 GB): Full-precision TDT model with best English accuracy
  • parakeet-tdt-0.6b-v2-int8 (640 MB): Quantized variant, about 20% faster and 75% smaller at the cost of minor accuracy

Select them via voxtype setup model. These are alternatives to the existing v3 models, which have better punctuation but are multilingual.

Contributors

Thanks to Sami Jawhar for wiring eager processing into the daemon, Rinor Maloku for the Parakeet TDT English-only model definitions, and jan Lemata (DuskyElf) for the multi-GPU device selection.

v0.6.4: Clang 22 Compatibility, Smart Auto-Submit, CUDA Safety

Voxtype 0.6.4 fixes source builds on systems with clang 22, adds voice-triggered auto-submit, and prevents CUDA version mismatch segfaults. If you build voxtype from source on Arch Linux and recently updated clang, this release unblocks you.

Clang 22 / Bindgen Compatibility

LLVM 22 changed how it represents certain C struct types in the AST, which broke bindgen 0.71.x during the whisper-rs build. The build would fail with type errors in the generated FFI bindings. This release bumps whisper-rs to 0.16.0, which includes bindgen 0.72.1 with the fix.

Why it matters: Arch Linux and other rolling distros ship clang 22 now. Without this fix, makepkg for the voxtype AUR package fails during compilation.

Smart Auto-Submit

You can now say "submit", "send", or "enter" at the end of your dictation to automatically press Enter after the text is output. This is useful for chat apps, terminal prompts, and search boxes where you want to dictate and send in one step.

Why use it: Dictate a Slack message and say "send" at the end. The text is typed and Enter is pressed, all from a single recording.

config.toml
[text]
smart_auto_submit = true

The trigger word is stripped from the output. Works with --auto-submit per-recording overrides too. Trigger words are case-insensitive and must appear as the last word in the transcription.

CUDA Version Probing

ONNX engine binaries ship with CUDA 12.x bundled. If your system has an older CUDA version (or none at all), the previous behavior was a segfault deep inside ONNX Runtime initialization. Voxtype now probes the system CUDA version via dlopen/dlsym before attempting to load ONNX Runtime with CUDA. If the version doesn't match, it falls back to CPU with a clear warning instead of crashing.

Why it matters: Users with NVIDIA GPUs but older CUDA drivers no longer get unexplained crashes when selecting an ONNX engine.

Bug Fixes

  • Status command ONNX detection: voxtype status now correctly reports the model name and backend when using ONNX engines (Parakeet, SenseVoice, etc.) instead of showing the Whisper model path.
  • Meeting export directory handling: When --output points to a directory instead of a file, voxtype now auto-generates a timestamped filename instead of failing.

Other Changes

  • NixOS Home Manager module now merges user settings into defaults instead of replacing them.
  • Updated Parakeet model documentation with correct download links and file listings.

Contributors

Thanks to Michael Siebert for smart auto-submit, Sami Jawhar for wiring eager input processing into the daemon, materemias for meeting export improvements, Joakim Repomaa for the NixOS HM fix, benj9000 for Parakeet documentation updates, and Syed Fazil Basheer for diagnosing the ONNX status bug.

v0.6.3: Clipboard Restoration, Full Config Override, NixOS Improvements

Voxtype 0.6.3 restores your clipboard after paste mode output, closes the config override gap so every option is configurable via CLI flags and environment variables, and includes several NixOS packaging improvements.

Clipboard Restoration

When using paste mode, voxtype copies transcribed text to the clipboard and simulates Ctrl+V. Previously, this replaced whatever you had on your clipboard. Now voxtype can save your clipboard contents before pasting and restore them afterward.

Why use it: You can dictate text without losing the link, code snippet, or image you copied earlier.

Supports both Wayland (wl-paste/wl-copy) and X11 (xclip), preserving binary content and MIME types on Wayland. A 100 MB size cap prevents memory issues with large clipboard contents.

config.toml
[output]
mode = "paste"
restore_clipboard = true
restore_clipboard_delay_ms = 200

The delay gives applications time to process the paste before the clipboard is restored. The default 200ms works well for most apps including Electron-based editors.

Every Option Configurable Everywhere

Project principle #5 says every option should be configurable via CLI flag, environment variable, or config file. This release closes that gap. The daemon now accepts 30+ CLI flags organized by section (Hotkey, Whisper, Audio, Output, VAD), and every config option has a corresponding VOXTYPE_* environment variable.

CLI flags (examples)
# Override model and language via CLI
voxtype --model large-v3 --language en daemon

# Override via environment variables
VOXTYPE_MODEL=large-v3 VOXTYPE_LANGUAGE=en voxtype daemon

The record start and record toggle commands also gained per-recording overrides for --auto-submit and --shift-enter-newlines, so compositor keybindings can toggle these per-dictation.

Hyprland example
# Normal dictation
bind = , F8, exec, voxtype record toggle

# Dictation that presses Enter after output
bind = SHIFT, F8, exec, voxtype record toggle --auto-submit

NixOS Packaging

Several improvements for NixOS users:

  • ONNX packages now include all six engines (Parakeet, Moonshine, SenseVoice, Paraformer, Dolphin, Omnilingual) instead of just Parakeet. The parakeet package names still work as aliases.
  • flake.nix derives version from Cargo.toml instead of a hard-coded string, so voxtype --version stays in sync.
  • dotool added to Nix runtime dependencies.
  • LIBCLANG_PATH set correctly for the devshell.

Bug Fixes

  • Post-process EPIPE errors: Commands that don't read stdin (like echo or head -1) could exit before voxtype finished writing to the pipe, causing a spurious fallback to the original text. The command's exit code and stdout now determine success, not whether it consumed all of stdin. This also fixed a deterministic test failure on aarch64 that was blocking NixOS packaging.

Contributors

Thanks to grota for clipboard restoration, digunix for the flake.nix version fix, ekisu for dotool in Nix deps, and DuskyElf for the devshell LIBCLANG_PATH fix and NixOS packaging co-ownership.

v0.6.2: Echo Cancellation and Meeting Bug Fixes

Voxtype 0.6.2 adds neural echo cancellation for meeting mode, rewrites loopback audio capture, and fixes three meeting bugs. If you use meeting mode with loopback enabled, remote participants' audio no longer bleeds into your mic transcript.

GTCRN Echo Cancellation

When meeting mode captures both your microphone and system audio (for remote participant transcription), the remote audio bleeds through your speakers into your mic. Previous approaches using signal subtraction and energy gating didn't work because the mic and loopback streams aren't sample-aligned.

This release uses GTCRN (Group Temporal Convolutional Recurrent Network), a 48K-parameter speech enhancement model that processes mic audio frame-by-frame through STFT. It removes background noise and speaker bleed-through while preserving your voice. The model is 523 KB and runs at about 0.02x real-time on CPU, so processing 30 seconds of audio takes under 0.6 seconds.

Why use it: If you're in a video call and the other person's voice is showing up in your "You" segments, this fixes it. It's enabled by default when loopback capture is active.

The GTCRN model downloads automatically the first time you start a meeting. To disable it (if you already have system-level echo cancellation via PipeWire):

config.toml
[meeting.audio]
echo_cancel = "disabled"

Dual Audio Capture via parec

Loopback capture has been rewritten to use parec (PulseAudio recording client) instead of cpal. PipeWire's monitor sources aren't visible through ALSA, which is what cpal uses. The parec approach works with both PipeWire and PulseAudio and can access any monitor source listed by pactl list short sources.

Mic and loopback audio now go into separate buffers, so they're transcribed independently. Mic chunks pass through GTCRN before transcription. A phrase-level dedup pass strips any residual echoed words that survive enhancement.

Bug Fixes

  • Meeting stop/pause/resume not working: The daemon's event loop starved the meeting control channel when audio processing was active. Meeting commands now take priority.
  • Orphaned meetings on daemon restart: If the daemon crashed during a meeting, the meeting state file was never cleaned up, preventing new meetings from starting. The daemon now cleans up stale meeting state on startup.
  • ml-diarization compilation errors: Fixed type mismatches after the ort 2.0.0-rc.11 API change.

v0.6.0: Five New Transcription Engines, Meeting Mode

Voxtype 0.6.0 is a major release. Five new ONNX-based transcription engines bring support for Chinese, Japanese, Korean, Cantonese, and 1600+ additional languages. Meeting mode adds continuous transcription for meetings with speaker attribution, export to multiple formats, and AI-generated summaries. The ONNX binary variants now ship with every ONNX engine included.

Five New Transcription Engines

Voxtype previously offered two transcription engines: Whisper and Parakeet. This release adds five more, all running locally via ONNX Runtime. Each engine has different strengths and language coverage.

Engine Languages Best For
SenseVoice Chinese, English, Japanese, Korean, Cantonese CJK transcription with auto-detection
Paraformer Chinese + English (bilingual), Chinese + Cantonese + English (trilingual) Mixed Chinese/English speech
Dolphin 40 languages + 22 Chinese dialects Chinese dialects and Eastern languages
Omnilingual 1600+ languages Low-resource and rare languages
Moonshine English (+ Japanese, Mandarin, Korean, Arabic) Fast CPU inference

Why use them: If you speak a CJK language and want a model trained specifically for it, SenseVoice and Paraformer will outperform Whisper on that task. If you need a rare language that Whisper doesn't cover well, Omnilingual supports over 1600 languages via Meta's MMS model. Dolphin covers 40 languages plus 22 Chinese dialects, which is useful if your dialect isn't well-served by general-purpose models.

All five engines are CTC-based, meaning they run in a single forward pass with no autoregressive decoder loop. In practice, this makes them fast on CPU.

config.toml
# SenseVoice for CJK + English
engine = "sensevoice"

[sensevoice]
model = "SenseVoiceSmall"
language = "auto"    # auto, zh, en, ja, ko, yue

# Paraformer for bilingual Chinese/English
engine = "paraformer"

[paraformer]
model = "paraformer-zh"

# Omnilingual for 1600+ languages
engine = "omnilingual"

[omnilingual]
model = "mms-1b-all"

Download models with voxtype setup model and select from the interactive menu. The new engines share infrastructure with Parakeet and Moonshine: shared Fbank feature extraction, shared CTC decoding, and shared ONNX Runtime session management.

All ONNX Engines in Every ONNX Binary

Previously, the ONNX release binaries only included Parakeet. Starting with v0.6.0, all four ONNX binary variants (onnx-avx2, onnx-avx512, onnx-cuda, onnx-rocm) include every ONNX engine: Parakeet, Moonshine, SenseVoice, Paraformer, Dolphin, and Omnilingual. The three Whisper-only binaries (avx2, avx512, vulkan) remain Whisper-only. Switch engines by changing a single config line.

The release binary naming also changed from voxtype-*-parakeet-* to voxtype-*-onnx-* to reflect this broader scope.

Meeting Mode

Meeting mode is a new way to use Voxtype: continuous transcription for meetings, calls, and lectures. Instead of push-to-talk, it records continuously in chunks and transcribes each chunk as it completes.

# Start a meeting
voxtype meeting start --title "Weekly standup"

# Pause and resume
voxtype meeting pause
voxtype meeting resume

# Stop when done
voxtype meeting stop

# Export the transcript
voxtype meeting export latest --format markdown
voxtype meeting export latest --format srt

Why use it: Push-to-talk is great for dictation, but meetings need continuous recording. Meeting mode handles that with automatic chunking (default 30-second chunks), so memory usage stays bounded even for long sessions up to 3 hours.

Speaker Attribution

Meeting mode can identify who is speaking. Two approaches are available:

  • Simple attribution: Uses dual audio capture (microphone + system loopback) to label segments as "You" or "Remote". Good for 1-on-1 calls where you just need to distinguish yourself from the other person.
  • ML diarization: Uses ONNX-based speaker embeddings to cluster speech by speaker and assign labels like SPEAKER_00, SPEAKER_01. Works for multi-person meetings. You can rename speakers after the fact with voxtype meeting label.
# Label auto-generated speaker IDs with real names
voxtype meeting label latest SPEAKER_00 "Alice"
voxtype meeting label latest SPEAKER_01 "Bob"

Meeting Export Formats

Meeting transcripts can be exported in five formats: plain text, Markdown, JSON, SRT subtitles, and VTT subtitles. The subtitle formats include timestamps, so you can use them with video recordings of the same meeting.

voxtype meeting export latest --format text
voxtype meeting export latest --format markdown
voxtype meeting export latest --format json
voxtype meeting export latest --format srt
voxtype meeting export latest --format vtt

AI Meeting Summaries

After a meeting, you can generate a summary with key points, action items, and decisions using a local LLM via Ollama or a remote API endpoint.

# Generate a summary using Ollama
voxtype meeting summarize latest

# Output as JSON for programmatic use
voxtype meeting summarize latest --format json

This requires Ollama running locally or a configured remote summarization endpoint. The summary extracts action items, key decisions, and a concise overview from the full transcript.

Setup ONNX Replaces Setup Parakeet

The voxtype setup parakeet command has been renamed to voxtype setup onnx to reflect that it now manages all ONNX-based engines, not just Parakeet. The old setup parakeet command still works as a hidden alias, so existing scripts and muscle memory won't break.

# New name
voxtype setup onnx

# Old name still works
voxtype setup parakeet

Other Changes

  • Shared Fbank feature extraction and CTC decoder modules reduce code duplication across ONNX engines
  • Multi-engine transcription smoke tests for automated regression testing
  • Fix Dolphin and Paraformer transcription backends after initial implementation
  • Fix clippy warnings across the codebase

v0.5.6: Moonshine Backend, Voice Activity Detection, Eager Transcription

Voxtype 0.5.6 adds the Moonshine transcription backend for fast CPU inference, voice activity detection to filter silence before transcription, starts transcribing while you're still recording, and fixes several regressions from the 0.5.x series.

Moonshine Transcription Backend (Experimental)

A third transcription engine is now available alongside Whisper and Parakeet. Moonshine is an encoder-decoder transformer optimized for fast CPU inference via ONNX Runtime.

On a Ryzen 9 9900X3D (CPU-only), Moonshine transcribes 4 seconds of audio in 0.09 seconds. Whisper large-v3-turbo takes 17.7 seconds for the same recording. Models come in two sizes: tiny (27M params, 100MB) and base (61M params, 237MB). English models are MIT-licensed, and multilingual models are available for Japanese, Mandarin, Korean, and Arabic.

Why use it: If transcription speed matters more than accuracy, especially on CPU-only setups. Moonshine outputs lowercase text without punctuation, so pair it with voxtype's spoken punctuation feature or a post-processing command if you need formatted output.

config.toml
engine = "moonshine"

[moonshine]
model = "base"

Download a model with voxtype setup model and select a Moonshine model from the interactive menu. Requires a Moonshine-enabled binary. See the Moonshine documentation for details.

Voice Activity Detection (VAD)

Whisper has a well-known problem: feed it silence, and it hallucinates. You'll get phantom words, repeated phrases, or garbage text from a recording that was mostly dead air. VAD solves this by detecting speech in your audio and stripping out silence before it reaches the transcription engine.

Why use it: If you sometimes press your hotkey and pause before speaking, or record in a noisy environment where Whisper picks up phantom words, VAD will clean that up. It also speeds up transcription by sending less audio to process.

config.toml
[vad]
enabled = true
backend = "auto"     # auto, whisper, or energy
threshold = 0.5      # 0.0 to 1.0, higher = stricter

Two backends are available. Whisper VAD uses the Silero neural network model via whisper-rs for the best accuracy. Energy is a simple volume threshold that works without any extra model downloads. The auto backend picks Whisper VAD when using Whisper for transcription, and Energy when using Parakeet or Moonshine.

Eager Input Processing

Voxtype can now start transcribing audio while you're still recording. Instead of waiting until you release the hotkey, it processes audio in chunks as you speak. When you stop recording, there's less audio left to transcribe, so the result appears faster.

Why use it: For longer dictations, this noticeably reduces the delay between releasing the hotkey and seeing text. The transcription engine gets a head start instead of waiting for the full recording.

config.toml
[whisper]
eager_processing = true
eager_interval_secs = 3   # Process every 3 seconds

Recording Timeout Now Transcribes

Previously, when a recording hit max_duration_secs, the audio was silently discarded. Now it gets transcribed. If you forget to release your hotkey or intentionally record a long passage, the audio is processed instead of lost.

Why use it: You no longer lose audio when a recording times out. The default max duration is 30 seconds, and that audio now always produces output.

Append Text After Transcription

A new append_text option lets you add text after every transcription. The most common use is appending a trailing space so the cursor is ready for the next word.

config.toml
[output]
append_text = " "   # Trailing space after each transcription

Bug Fixes

  • Fix GPU isolation subprocess not working (regression since v0.5.0)
  • Fix state file default path not being set correctly
  • RPM packages now include pipewire-alsa dependency and use a wrapper script instead of a symlink for the binary, fixing installation on Fedora
  • Fix top-level --model flag not being forwarded to daemon via record command
  • Bump parakeet-rs for NixOS compatibility
  • Add KDE Plasma hotkey setup documentation

v0.5.5: Native GNOME/KDE Support, Media Keys

Voxtype 0.5.5 adds native text input for GNOME and KDE Wayland users via the libei protocol, plus support for media keys and numeric keycodes for hotkey configuration.

eitype Output Driver for GNOME and KDE

GNOME and KDE Wayland don't support the virtual-keyboard protocol that wtype uses. Previously, users on these desktops had to fall back to dotool or ydotool. Now voxtype can use eitype, which speaks the native libei/EI protocol supported by Mutter and KWin.

Why use it: If you're on GNOME or KDE Wayland and want typing to work without dotool/ydotool workarounds, install eitype and it will be used automatically.

Install eitype
cargo install eitype

The fallback chain is now: wtype → eitype → dotool → ydotool → clipboard. You can also force eitype explicitly:

config.toml
[output]
driver_order = ["eitype", "dotool", "clipboard"]

Media Keys and Numeric Keycodes

You can now use media keys as your push-to-talk hotkey: MEDIA, RECORD, REWIND, and FASTFORWARD. For keys not in the built-in list, you can specify numeric keycodes with a prefix indicating the source tool.

Why use it: Some keyboards have dedicated media or macro keys that make great push-to-talk buttons. The numeric keycode support means any key your kernel recognizes can be used.

config.toml
[hotkey]
# Use a named media key
key = "MEDIA"

# Or use a numeric keycode from wev/xev (XKB keycode)
key = "WEV_234"

# Or from evtest (kernel keycode)
key = "EVTEST_226"

The prefix is required because wev/xev and evtest report different numbers for the same key (XKB keycodes are offset by 8 from kernel keycodes).

Contributor

Thanks to Loki Coyote for both features in this release.

v0.5.0: Multi-model, Parakeet Engine, Profiles, and More

Voxtype 0.5.0 is a major release with several new features for power users: switch between Whisper models on the fly, try the experimental Parakeet transcription engine, set up profiles for different contexts, and customize how text gets typed.

Multi-Model Support

You can now load multiple Whisper models and switch between them without restarting the daemon. Use a fast model for everyday dictation and a more accurate model when precision matters.

Why use it: Keep base.en loaded for quick notes, then hold Shift while pressing your hotkey to use large-v3-turbo for important dictation.

config.toml
[whisper]
model = "base.en"
secondary_model = "large-v3-turbo"

[hotkey]
model_modifier = "LEFTSHIFT"

Parakeet Transcription Engine (Experimental)

Parakeet is NVIDIA's FastConformer-based speech recognition model. It offers fast CPU inference (30x realtime on AVX-512) with proper punctuation and capitalization out of the box.

Why use it: If you only need English transcription and want speed without a GPU, Parakeet is worth trying. It's faster than Whisper on CPU and competitive with Whisper on GPU.

config.toml
engine = "parakeet"

[parakeet]
model = "parakeet-tdt-0.6b-v3"

Requires a Parakeet-enabled binary. See the Parakeet documentation for setup instructions.

Named Profiles for Post-Processing

Define named profiles with different post-processing settings for different contexts. Each profile can have its own LLM cleanup command and initial prompt.

Why use it: Use --profile slack for casual messages that get cleaned up one way, and --profile code for technical dictation with different formatting.

config.toml
[profiles.slack]
post_process_command = "llm-cleanup --casual"
initial_prompt = "Casual workplace chat message"

[profiles.code]
post_process_command = "llm-cleanup --technical"
shift_enter_newlines = true

Customizable Output Driver Order

Configure which output drivers voxtype tries and in what order. Force a specific driver or rearrange the fallback chain to match your system.

config.toml
[output]
driver_order = ["dotool", "ydotool", "clipboard"]

Initial Prompt for Transcription Hints

Provide context to Whisper to improve accuracy for domain-specific vocabulary, proper nouns, or technical terms.

config.toml
[whisper]
initial_prompt = "Technical discussion about Kubernetes and Terraform"

Other New Features

  • File output mode: Write transcriptions directly to a file with --file
  • Shift+Enter for newlines: For chat apps where Enter sends the message
  • xclip support: X11 clipboard fallback for non-Wayland environments
  • Constrained language detection: Limit auto-detection to specific languages

Bug Fixes

  • Fix apostrophes causing unexpected line breaks
  • Fix audio device selection for PipeWire loopback devices
  • Fix notification icon showing wrong engine type
  • Fix NixOS systemd service path issue

Thanks to @lypanov for contributions incorporated into this release.

v0.4.16: dotool Output Driver with Keyboard Layout Support

Voxtype 0.4.16 adds dotool as a new output driver in the fallback chain, bringing keyboard layout support for non-US keyboards and better compatibility with KDE Plasma and GNOME on Wayland.

New Output Driver: dotool

The output fallback chain is now wtype → dotool → ydotool → clipboard. dotool sits between wtype and ydotool, providing a daemon-free alternative that works on X11, Wayland, and TTY environments.

Why use it: If you use a non-US keyboard layout (German, French, etc.), dotool can type special characters correctly by setting the XKB layout. Unlike ydotool, dotool doesn't require a daemon to be running.

config.toml
[output]
dotool_xkb_layout = "de"          # German layout
dotool_xkb_variant = "nodeadkeys" # Optional variant

Better KDE Plasma and GNOME Support

KDE Plasma and GNOME on Wayland don't support the virtual keyboard protocol that wtype requires. Previously, users had to set up ydotool with its daemon. Now dotool provides a simpler alternative that just works if you're in the input group.

The fallback chain automatically tries each method:

  1. wtype: Wayland-native, best Unicode/CJK support (fails on KDE/GNOME)
  2. dotool: Uses uinput, supports keyboard layouts, no daemon needed
  3. ydotool: Uses uinput, requires ydotoold daemon running
  4. clipboard: Universal fallback via wl-copy

Installation

dotool is available in most distribution repositories and the AUR:

# Arch (AUR)
yay -S dotool

# From source
git clone https://git.sr.ht/~geb/dotool
cd dotool && make && sudo make install

Make sure your user is in the input group for uinput access:

sudo usermod -aG input $USER
# Log out and back in

Thanks to Zubair for contributing the dotool output driver.

v0.4.15: Context Optimization, Single-Instance, Delay Options

Voxtype 0.4.15 brings faster transcription for short recordings, single-instance safeguards, and consolidated typing delay options.

Context Window Optimization

Short recordings now use an optimized audio context window instead of the default 30-second fixed window. The context size is calculated dynamically based on audio length, speeding up transcription for clips under 22.5 seconds.

Why it matters: A 3-second recording no longer processes as if it were 30 seconds. You'll see faster transcription times without any configuration changes.

INFO Audio context optimization: using audio_ctx=204 for 2.82s clip

If you experience accuracy issues with short recordings (rare), disable with context_window_optimization = false in config.

Single-Instance Safeguard

The daemon now prevents multiple instances from running simultaneously using PID file locking. If a previous instance crashed, the stale lock is automatically cleaned up on next start.

Why it matters: No more accidentally running two daemons and getting duplicate transcriptions or audio conflicts.

Thanks to @materemias for this contribution.

Unified Delay Options

Two delay options now work consistently across wtype, ydotool, and paste mode:

  • type_delay_ms: Delay between keystrokes (wtype -d, ydotool --key-delay)
  • pre_type_delay_ms: Delay before typing starts (wtype -s, sleep for others)

Why use it: Some applications drop the first character or miss keystrokes when text is typed too quickly. These options let you slow things down without switching output modes.

config.toml
[output]
type_delay_ms = 50        # Slow down inter-keystroke timing
pre_type_delay_ms = 200   # Wait before typing starts

The wtype_delay_ms option is deprecated but still works with a warning.

v0.4.14: Configurable Paste Keystroke

Voxtype 0.4.14 adds configurable paste keystrokes and switches to wtype as the primary backend for keystroke simulation.

Configurable Paste Keystroke

Paste mode now supports custom keystrokes via the paste_keys config option. The default remains ctrl+v, but you can configure alternatives like shift+insert for environments where that works better.

Why use it: Some desktop environments use universal paste shortcuts that differ from the standard Ctrl+V. Hyprland and Omarchy users often prefer Shift+Insert. This lets you match your system's paste behavior.

config.toml
[output]
mode = "paste"
paste_keys = "shift+insert"  # Or "ctrl+shift+v" for terminals

wtype as Primary Keystroke Backend

Paste mode now prefers wtype over ydotool for simulating keystrokes. This means you no longer need ydotoold running as a daemon for paste mode to work.

Why it matters: ydotool requires its daemon (ydotoold) to be running, which adds complexity. wtype works without a daemon and handles the keystroke simulation directly. If wtype isn't available, Voxtype falls back to ydotool automatically.

Contributors

This release adds contributor credits to the project for those who have contributed code via co-authored commits: Dan Heuckeroth, Igor Warzocha, Julian Kaiser, Kevin Miller, konnsim, and reisset.

v0.4.13: GPU Memory Isolation

Voxtype 0.4.13 adds GPU memory isolation for laptop users with hybrid graphics.

GPU memory comparison: without gpu_isolation the GPU stays at 1.6GB, with gpu_isolation it drops to zero between transcriptions

GPU Isolation Mode

When enabled, transcription runs in a subprocess that exits after each recording. This fully releases GPU memory between transcriptions, allowing your discrete GPU to power down when not in use.

Why use it: Laptops with NVIDIA Optimus or AMD switchable graphics keep the dGPU active when Voxtype holds GPU memory. GPU isolation lets the dGPU sleep between transcriptions, saving battery.

config.toml
[whisper]
gpu_isolation = true

Performance

The model loads while you speak, so there's minimal perceived latency. Benchmarks on an AMD Radeon RX 7800 XT with large-v3-turbo:

Mode Transcription Latency Idle RAM Idle GPU Memory
Standard 0.49s avg ~1.6 GB 409 MB
GPU Isolation 0.50s avg 0 0

GPU isolation adds about 10ms (2%) to transcription time while completely releasing memory between recordings.

v0.4.12: Compositor Integration Fix

Voxtype 0.4.12 fixes compositor submap handling when recordings are cancelled or too short.

Pre-recording Hook

A new pre_recording_command hook runs when recording starts. This pairs with the existing pre_output_command and post_output_command hooks to provide complete control over compositor state throughout the recording lifecycle.

Why use it: Enter a compositor submap when recording starts so cancel bindings (like F12) work during both recording and transcription phases.

config.toml
[output]
pre_recording_command = "hyprctl dispatch submap voxtype_recording"
pre_output_command = "hyprctl dispatch submap voxtype_suppress"
post_output_command = "hyprctl dispatch submap reset"

Run voxtype setup compositor hyprland|sway|river to automatically configure these hooks.

Bug Fixes

  • Fixed compositor submap getting stuck when recording is too short or cancelled
  • The post_output_command hook now runs in all early-exit scenarios (recording too short, no audio, cancel, timeout, transcription errors)

v0.4.11: Remote Whisper, Cancel Transcription, Output Mode Override

Voxtype 0.4.11 brings remote transcription servers, the ability to cancel mid-recording, and per-command output mode control.

Remote Whisper Server Support

Run transcription on a powerful server instead of your local machine. Voxtype now supports any OpenAI-compatible Whisper API, including self-hosted servers (whisper.cpp, faster-whisper, NVIDIA Parakeet) and cloud providers like OpenAI.

Why use it: Offload CPU/GPU work to a home server. Get ~1.5s transcription instead of 15-20s on CPU. Share one GPU across multiple machines.

Self-hosted server
[whisper]
backend = "remote"
remote_endpoint = "http://your-server:8080"
OpenAI cloud API
[whisper]
backend = "remote"
remote_endpoint = "https://api.openai.com"
remote_api_key = "sk-your-openai-key"
remote_model = "whisper-1"

Privacy note: Voxtype is designed to be local-first. Your voice data never leaves your machine by default. Using a remote endpoint outside your trusted local network means your audio is transmitted to that server. Only use cloud APIs if you understand and accept this tradeoff.

Cancel Transcription

Changed your mind? Cancel recording or transcription with a single command.

Why use it: Stop a recording you don't want without waiting for transcription. No more unwanted text appearing at your cursor.

# Bind to Escape in your compositor
voxtype record cancel

The voxtype setup compositor command now adds an Escape binding to cancel automatically.

Output Mode Override

Override the output mode on a per-command basis. Type by default, but paste this one time.

Why use it: Different contexts need different output. Type into your terminal, but paste into a web form that fights with simulated keystrokes.

# Type (default)
voxtype record start

# Paste this time only
voxtype record start --paste

# Clipboard this time only
voxtype record stop --clipboard

Setup Improvements

The --model flag lets you specify which Whisper model to download non-interactively. Perfect for automated setups and Omarchy integration.

voxtype setup --download --model large-v3-turbo --quiet

Bug Fixes

  • Fixed stale cancel file causing next recording to cancel unexpectedly
  • Added 14 new tests for cancel and output mode override logic

v0.4.10: Compositor Integration, Auto-Submit

Voxtype 0.4.10 fixes the modifier key interference problem and adds auto-submit for LLM workflows.

Compositor Integration

If you use keybindings with modifiers (like Super+Ctrl+X), releasing keys slowly could cause modifiers to interfere with typed text. This release adds compositor-level integration to fix it.

Why use it: Eliminates "super+h" appearing instead of "h" when your fingers are slow to release modifier keys.

# Generate compositor config for Hyprland, Sway, or River
voxtype setup compositor hyprland

# Or let it auto-detect
voxtype setup compositor

This creates a submap that suppresses modifier keys during text output. Pre/post output hooks handle the switching automatically.

Auto-Submit

Send Enter automatically after transcription. Contributed by Rob Zolkos.

Why use it: Voice-driven LLM prompting. Dictate your question and have it submit immediately to Claude, ChatGPT, or your local Ollama.

config.toml
[output]
auto_submit = true

How to Update

Update via your package manager or download from GitHub Releases.

# Arch (AUR)
yay -S voxtype

# After updating, regenerate compositor config
voxtype setup compositor

Man Pages for Every Command

Real Unix tools have man pages.

Voxtype now generates man pages automatically from the CLI definitions using clap_mangen. 14 pages covering every command and subcommand.

man voxtype
man voxtype-status
man voxtype-setup-gpu

Why does this matter?

  • Offline documentation — No internet required
  • Consistent format — Works with apropos, whatis, man -k
  • Auto-updated — Generated from code, never out of sync
  • Unix convention — Expected by experienced users

The man pages are included in deb/rpm packages and Arch AUR. They install to /usr/share/man/man1/.

Small polish, but the details matter.

v0.4.4: Configurable Status Bar Icons

"The emoji icons look out of place in my setup."

This was the most common feedback about Voxtype's Waybar integration. Emoji icons work everywhere but don't match every aesthetic.

v0.4.4 ships with 10 built-in icon themes:

  • emoji — Universal, works everywhere (default)
  • nerd-font — Requires Nerd Font
  • material — Material Design Icons
  • phosphor — Phosphor Icons
  • codicons — VS Code icons
  • minimal — Simple Unicode (○ ● ◐ ×)
  • dots — Geometric shapes
  • arrows — Media player style
  • text — Plain text labels
config.toml
[status]
icon_theme = "nerd-font"

Or override at runtime:

voxtype status --icon-theme material

You can also create custom themes as TOML files with your own icons. This release also adds the alt field to JSON output, enabling Waybar's native format-icons feature.

See the configuration docs for details.

Post-Processing Command: LLM Integration

Voice-to-text tools give you what you said. But what if you want what you meant?

I just shipped post-processing support in Voxtype. You can now pipe transcriptions through any external command before they reach your cursor.

The obvious use case: LLM cleanup. Run your transcription through Claude or Ollama to fix grammar, expand shorthand, or translate technical jargon into prose.

config.toml
[output]
post_process_command = "ollama run llama3.2 'Fix grammar and punctuation, output only the corrected text:'"

But the feature is intentionally generic. Any command that reads stdin and writes stdout works:

  • Spell correction with aspell
  • Profanity filtering
  • Text expansion macros
  • Custom formatting scripts

I also shipped a Swedish Chef filter as a proof-of-concept. Because sometimes you need to bork bork bork.

The design philosophy: Voxtype handles voice-to-text well. LLMs handle text transformation well. Let each tool do what it's good at.

No More Input Group: Compositor Keybindings

When I launched Voxtype, you needed to add yourself to the input group to use global hotkeys. This required root access, a logout, and felt like a security compromise.

That's no longer the default.

The new recommended approach: use your compositor's native keybindings to run voxtype record-toggle. No special permissions needed.

Hyprland
bind = , F9, exec, voxtype record-toggle
Sway
bindsym F9 exec voxtype record-toggle

The evdev hotkey system is still there for users who need it (some setups, some edge cases), but it's now opt-in rather than the default.

This also means Voxtype works immediately after install for most tiling WM users. No group membership, no logout, no root.

River Compositor Support

Voxtype started as "for Hyprland and Sway users." That was a practical choice—those are the compositors I use.

But the architecture was always compositor-agnostic. evdev doesn't care what's drawing your windows.

v0.4 adds official River support. River is a dynamic tiling Wayland compositor that's been gaining traction, and users were asking for documentation.

River (~/.config/river/init)
riverctl map normal None F9 spawn "voxtype record-toggle"

If you're using a different Wayland compositor and want support added, open an issue. The code probably already works—it's just documentation and testing.

One-Command Waybar Setup

The most common question I get about Voxtype: "How do I add the Waybar module?"

Previously, this meant manually editing your Waybar config, adding CSS classes, and figuring out the JSON format.

Now:

voxtype setup waybar --install

That's it. The command adds the module to your config, includes sensible CSS, and backs up your existing files first.

To remove it:

voxtype setup waybar --uninstall

Sometimes the best feature is removing a manual step.

v0.4.2: GPU Acceleration on More Hardware

A user reported that GPU acceleration caused Voxtype to crash on their Ryzen 5950X. Same SIGILL error we fixed for CPU-only mode months ago.

The problem: whisper.cpp's Vulkan backend uses AVX-512 even when running on the GPU. My Zen 4 build machine has AVX-512. Their Zen 3 doesn't.

The fix was surgical: build the Vulkan binary with -mno-avx512f to prevent those instructions from being emitted. Now GPU acceleration works on:

  • Zen 3 (Ryzen 5000 series)
  • Zen 4 (Ryzen 7000/9000 series)
  • Intel 11th gen and newer
  • Most discrete GPUs with Vulkan support

If you tried Vulkan mode before and it crashed, v0.4.2 should work.

Device Hotplug Fix (v0.3.2)

I locked my screen, came back, and Voxtype stopped working. The service was running, no errors in the logs, but pressing the hotkey did nothing.

Restarting the service fixed it. That's not a solution.

The problem: Voxtype uses evdev to capture keyboard input at the kernel level. When you lock your screen on Hyprland (and probably other Wayland compositors), the compositor releases its grip on input devices. When you unlock, the devices get re-added to /dev/input. But Voxtype was still holding file descriptors to the old device nodes—handles that now pointed to nothing.

The fix was inotify. Voxtype now watches /dev/input for device changes. When devices are added or removed, it validates its existing handles by checking /proc/self/fd/ to see if they still point to real devices. If not, it re-enumerates.

There's a 150ms debounce because USB devices don't enumerate instantly. And it clears modifier state on device changes so you don't end up with phantom stuck keys.

This is the kind of bug that only shows up in real usage. Unit tests don't lock your screen. I found it because I use the tool every day.

First External Contributors

When you build something solo, you get used to making all the decisions yourself. Then someone opens a PR and suddenly you're reviewing code you didn't write.

Mate Remias contributed three features to Voxtype in a single day:

  • Paste output mode for apps that don't work with virtual keyboards
  • On-demand model loading to reduce memory usage
  • Build script fixes for users compiling on different distros

Jean-Paul van Tillo contributed the initial GPU acceleration work that made Vulkan and CUDA support possible.

Going from solo project to something with external contributors feels like a milestone. Thanks to both of them.

Community Contributions & AVX-512 Fix (v0.3.1)

A user reported that Voxtype crashed immediately on their Ryzen 9 5950X. The error: SIGILL (illegal instruction). They couldn't even run voxtype --version.

I build on a Ryzen 9 9900X3D, which has AVX-512. Their Zen 3 chip doesn't. Simple enough—I was shipping binaries with instructions their CPU couldn't execute.

My first fix didn't work. I passed WHISPER_NO_AVX512=ON to the build, created separate AVX2 and AVX-512 binaries, and shipped a new release. The user tried it: still crashed.

They ran sha512sum on both binaries. Identical. I had shipped the same binary twice with different names.

The whisper.cpp version I'm using doesn't actually respect that flag. CMake's auto-detection finds AVX-512 on my build machine and enables it regardless. The flag gets passed, logged, and ignored.

The fix was to go lower: compiler flags. -mno-avx512f tells the compiler itself to never emit those instructions. Now the binaries are actually different.

This release also includes work from Mate Remias, who contributed three PRs in one day: paste output mode, on-demand model loading, and build script fixes.

CJK Support (v0.3.0)

One of my first users of Voxtype reported that Korean speech-to-text wasn't working. They saw garbled output or a vague message to the effect of "korean isn't supported yet" (한국어는 지원되지 않습니다)

I spent a while debugging this. Whisper transcribed Korean fine—and indeed, something in the model was detecting Korean speech, so it couldn't be the model.

The problem was ydotool, which I had selected for injecting keyboard input based on speech transcription. It simulates keystrokes, which works great until you need actual Unicode support.

Here's the thing: I had looked at every other Linux voice-to-text tool I could find before I decided to build Voxtype. They all use ydotool or xdotool, which is why I selected it for my project.

None of them are going to handle Korean, Chinese, or Japanese properly either.

I built VoxType with a pluggable output module structure. So, solving this problem for my user was as simple as implementing wtype, which injects text directly instead of faking keystrokes. As far as I've been able to find, my project is the only speech-to-text tool using it for output right now.

I wasn't trying to corner the CJK market: I was just fixing a bug. But I'm genuinely not aware of another local voice-to-text tool on Linux that handles these languages correctly. If you know of one, tell me—I'd like to see how they solved it.

In the meantime, if you are a Linux user who wants speech to text, speaks Korean, Chinese, or Japanese, I think I've got the tool for you.

GPU Acceleration & Multilingual Support

I spent a couple hours today adding 1) GPU acceleration and 2) multilingual support to Voxtype.

With GPU acceleration, the Whisper large-v3 model is within reach for most users, and that means better, more accurate transcription with multilingual and translation support. You can "speak French, type English" or "speak Korean, type Korean" with language auto-detection.

This work was supported by my first two external contributors!

Initial Launch

I've spent years contributing to open source projects, but it's been a long time since I built and shipped a complete piece of software solo.

This Thanksgiving, I built voxtype, a push-to-talk voice transcription tool for Linux.

macOS and Windows have had solid voice typing for years, and even Google Docs has it built in. But on Linux, especially on Wayland, the options are limited. Most solutions are cloud-based, X11-only, or tied to specific desktop environments.

The few local options that exist are python-based. No shade to python, but requiring you to install pytorch, pytorchaudio, and a dozen other dependencies into a virtual environment, and remember to activate it every time you reboot is cumbersome. Vocalinux is pretty good but I got stuck in dependency hell trying to set it up, and I find it sluggish.

I wanted something simpler. Install a package, download a model, enable a service, done.

Voxtype is available from the AUR for Arch based systems, and as a .deb or .rpm package. Run a few commands post-install, enable the systemd user service and it just works every time you log in. No virtual environments, no dependency conflicts, no activation scripts (well, except systemd).

Hold a hotkey, speak, release, and your words appear at the cursor. It uses Whisper locally through whisper.cpp, so everything stays on your machine. Voxtype makes no network requests and uses no cloud transcription services.

The interesting technical challenge was making it work across all Wayland compositors. Since Wayland's security model blocks the usual approaches for global hotkeys and input simulation, voxtype works at the kernel level using evdev and uinput. This means it runs the same on Sway, Hyprland, or any other compositor.

It's a little "omakase" in that it's meant for Hyprland or Sway with Waybar. That's where you'll have the best experience. But it works on any compositor, and doesn't require Waybar.

It works great for me, which, to be candid, is enough for me. If you find it useful or have constructive/helpful feedback, that's a bonus.

It's MIT licensed: https://voxtype.io