Skip to content

Model Selection Guide

This guide helps you choose the right transcription engine and model for voxtype v0.6.0. The choice depends on your language, hardware, and how you use dictation.

Voxtype has seven transcription engines. Two are bundled with the standard binary (Whisper and Remote Whisper). The other five require the ONNX binary variant.


Quick Comparison

EngineLanguagesArchitectureModel SizeSpeed (CPU)PunctuationBinary
Whisper99+Encoder-decoder75 MB - 3.1 GBModerateNo (use post-process)Standard
Parakeet25 EuropeanTDT/CTC670 MB - 2.6 GBFastYes (TDT)ONNX
Moonshineen, ja, zh, ko, arEncoder-decoder100 - 237 MBFastNoONNX
SenseVoicezh, en, ja, ko, yueEncoder-only CTC239 - 938 MBFastYes (ITN)ONNX
Paraformerzh, enEncoder-predictor-decoder220 - 487 MBFastNoONNX
Dolphin40+ langs, 22 Chinese dialectsCTC E-Branchformer198 MBFastNoONNX
Omnilingual1600+CTC wav2vec23.9 GBModerateNoONNX
Soniox (cloud)60+Cloud (WebSocket / REST)n/a (no local model)Cloud-boundYesSoniox feature

Soniox is different from the others — it's a paid cloud service over WebSocket / REST. No local model, no GPU. Sub-second partial latency. Strong for non-English languages where local Whisper-based engines struggle on lower-end hardware. Requires cargo build --features soniox and a SONIOX_API_KEY. See SONIOX.md for the full story.


Which Engine Should I Use?

Start here and follow the path that matches your situation.

What language(s) do you speak?

├─ English only
│   ├─ Want best accuracy + punctuation? → Parakeet TDT (ONNX binary)
│   ├─ Want smallest/fastest model?      → Moonshine tiny (ONNX binary)
│   ├─ Want simplest setup?              → Whisper small.en (standard binary)
│   └─ On a laptop / saving battery?     → Whisper small.en + on_demand_loading

├─ Chinese (Mandarin or Cantonese)
│   ├─ Chinese + English mixed?          → Paraformer zh or SenseVoice
│   ├─ Chinese + Japanese/Korean?        → SenseVoice
│   ├─ Cantonese specifically?           → SenseVoice (supports yue)
│   └─ Chinese dialects?                 → Dolphin

├─ Japanese or Korean
│   ├─ Want a small, fast model?         → Moonshine (tiny-ja/tiny-ko) or SenseVoice
│   └─ Want best quality?                → Whisper large-v3-turbo or SenseVoice

├─ European languages (German, French, Spanish, etc.)
│   ├─ Want punctuation + fast?          → Parakeet TDT v3
│   └─ Want broadest coverage?           → Whisper

├─ Arabic, Hindi, Thai, Vietnamese, other Asian languages
│   ├─ One of the 40 Dolphin languages?  → Dolphin (small, fast)
│   └─ Otherwise                         → Whisper large-v3-turbo

└─ Rare or uncommon language
    └─ → Omnilingual (1600+ languages) or Whisper

OR — independently of language:

├─ Want sub-second live partials at the cursor?
│   ├─ English + GPU available?           → Parakeet TDT (streaming, local)
│   └─ Any of 60+ languages, paid OK?     → Soniox (cloud, sub-100ms partials)

├─ Want highest accuracy and don't mind a few seconds of wait?
│   └─ Soniox async API (cloud, stt-async-v4)

└─ Cannot send audio off-device (privacy-sensitive)?
    └─ Pick any *local* engine above. Never Soniox.

Standard vs ONNX Binary

Voxtype ships two binary variants:

  • Standard binary -- includes Whisper (whisper.cpp). This is the default.
  • ONNX binary -- includes Whisper plus all ONNX-based engines (Parakeet, Moonshine, SenseVoice, Paraformer, Dolphin, Omnilingual).

The ONNX binary is larger because it bundles ONNX Runtime. If you only use Whisper, the standard binary is all you need.

To switch to the ONNX binary:

bash
sudo voxtype setup onnx --enable

To switch back to the standard binary:

bash
sudo voxtype setup onnx --disable

The setup onnx command swaps the /usr/local/bin/voxtype symlink between the two binaries. Both binaries are installed by the voxtype-bin AUR package and the deb/rpm packages.


Engine Details

1. Whisper

OpenAI's Whisper via whisper.cpp. The default engine, included in every voxtype binary.

Languages: 99+ (the widest coverage of any engine)

Available models:

ModelSizeSpeedLanguages
tiny75 MB~10x99+
tiny.en39 MB~10xEnglish only
base142 MB~7x99+
base.en142 MB~7xEnglish only
small466 MB~4x99+
small.en466 MB~4xEnglish only
medium1.5 GB~2x99+
medium.en1.5 GB~2xEnglish only
large-v33.1 GB1x99+
large-v3-turbo1.6 GB~8x99+

Speed is relative to large-v3 (1x baseline). Higher is faster.

The .en models are English-only but faster and more accurate for English. Models without .en support 99+ languages. There are no .en variants for large-v3 or large-v3-turbo.

Config example:

toml
# English-only, good CPU performance
[whisper]
model = "small.en"
language = "en"
toml
# Multilingual with GPU
[whisper]
model = "large-v3-turbo"
language = "auto"    # or "ja", "zh", "ko", "ar", etc.
toml
# Battery-saving laptop mode
[whisper]
model = "small.en"
language = "en"
on_demand_loading = true
gpu_isolation = true

Pros:

  • Widest language support (99+ languages)
  • Works with every voxtype binary (no ONNX needed)
  • Many model sizes to fit different hardware
  • GPU support via Vulkan, CUDA, or ROCm
  • Well-tested, mature codebase

Cons:

  • Slower than Parakeet for equivalent accuracy
  • No built-in punctuation (use [output.post_process] or [text] spoken_punctuation = true)
  • Larger models need significant VRAM

GPU options: Build with --features gpu-vulkan (AMD/Intel), --features gpu-cuda (NVIDIA), or --features gpu-hipblas (AMD ROCm).


2. Parakeet

NVIDIA's FastConformer model via the parakeet-rs crate. State-of-the-art accuracy for English and European languages, with built-in punctuation from the TDT decoder.

Languages: 25 European languages -- English, German, French, Spanish, Italian, Dutch, Polish, Portuguese, Romanian, Czech, Hungarian, Slovak, Slovenian, Danish, Norwegian, Swedish, Finnish, Greek, Turkish, Ukrainian, Russian, Catalan, Galician, Basque.

If your language is not in this list, Parakeet will not work for you.

Available models:

ModelSizeDescription
parakeet-tdt-0.6b-v32.6 GBTDT with punctuation (recommended)
parakeet-tdt-0.6b-v3-int8670 MBQuantized, smaller and faster

TDT vs CTC: The TDT model includes punctuation and capitalization. The CTC variant is slightly faster but outputs raw text without punctuation. Use TDT unless you have a specific reason not to.

Config example:

toml
engine = "parakeet"

[parakeet]
model = "parakeet-tdt-0.6b-v3"
toml
# Quantized for lower memory usage
engine = "parakeet"

[parakeet]
model = "parakeet-tdt-0.6b-v3-int8"

Pros:

  • Best accuracy for English (~6% WER, top of HuggingFace ASR leaderboard)
  • Built-in punctuation and capitalization (TDT)
  • Fast even on CPU thanks to efficient FastConformer architecture
  • GPU acceleration via CUDA, MIGraphX, or TensorRT

Cons:

  • Limited to 25 European languages (no CJK, Arabic, Hindi, etc.)
  • Requires ONNX binary
  • Only one model size (0.6B parameters)

GPU builds: The ONNX binary variants include GPU support. onnx-cuda for NVIDIA, onnx-migraphx for AMD.


3. Moonshine

Encoder-decoder transformer optimized for edge devices. Processes variable-length audio without the 30-second padding that Whisper requires.

Languages: English (MIT license), plus Japanese, Chinese, Korean, Arabic (Moonshine Community License, non-commercial only).

Available models:

ModelSizeLanguageLicense
base237 MBEnglishMIT
tiny100 MBEnglishMIT
base-ja237 MBJapaneseCommunity (non-commercial)
base-zh237 MBChineseCommunity (non-commercial)
tiny-ja100 MBJapaneseCommunity (non-commercial)
tiny-zh100 MBChineseCommunity (non-commercial)
tiny-ko100 MBKoreanCommunity (non-commercial)
tiny-ar100 MBArabicCommunity (non-commercial)

Config example:

toml
engine = "moonshine"

[moonshine]
model = "base"         # English, 237 MB
# quantized = true     # Use quantized variant if available (default: true)
# threads = 4          # CPU threads for ONNX Runtime
toml
# Japanese
engine = "moonshine"

[moonshine]
model = "base-ja"

Pros:

  • Small model sizes (100-237 MB)
  • Fast inference, designed for edge devices
  • No 30-second padding overhead
  • English models are MIT licensed

Cons:

  • No built-in punctuation
  • Non-English models are non-commercial only
  • Fewer language options than Whisper or SenseVoice
  • Requires ONNX binary

4. SenseVoice

Alibaba's FunAudioLLM encoder-only CTC model. Single forward pass (no autoregressive decoder loop), so inference is fast.

Languages: Chinese (zh), English (en), Japanese (ja), Korean (ko), Cantonese (yue). Automatic language detection or explicit selection.

Available models:

ModelSizeDescription
small239 MBQuantized int8 (recommended)
small-fp32938 MBFull precision, slightly better accuracy

Config example:

toml
engine = "sensevoice"

[sensevoice]
model = "sensevoice-small"
language = "auto"      # or "zh", "en", "ja", "ko", "yue"
use_itn = true         # Inverse text normalization (adds punctuation)
# threads = 4

Pros:

  • Good CJK support (Chinese, Japanese, Korean, Cantonese)
  • Fast single-pass inference (encoder-only, no decoder loop)
  • Inverse text normalization adds punctuation
  • Small quantized model (239 MB)
  • Automatic language detection

Cons:

  • Limited to 5 languages
  • CJK output is character-based with no spaces between words (see note below)
  • Requires ONNX binary

5. Paraformer

Alibaba's non-autoregressive encoder-predictor-decoder model. Generates all output tokens in a single pass.

Languages: Chinese + English (bilingual).

Available models:

ModelSizeLanguagesDescription
zh487 MBChinese + EnglishBilingual, recommended
en220 MBEnglishEnglish only

Config example:

toml
engine = "paraformer"

[paraformer]
model = "paraformer-zh"    # Chinese + English
# threads = 4

Pros:

  • Good Chinese + English bilingual support
  • Non-autoregressive: fast, single-pass decoding
  • Moderate model size

Cons:

  • Only Chinese and English
  • No built-in punctuation
  • Chinese output is character-based with no spaces (see note below)
  • Requires ONNX binary

6. Dolphin

DataoceanAI's CTC E-Branchformer model optimized for Eastern languages. Covers 40 languages plus 22 Chinese dialects.

Languages: Chinese (Mandarin + 22 dialects), Japanese, Korean, Thai, Vietnamese, Indonesian, Malay, Arabic, Hindi, Urdu, Bengali, Tamil, and more.

Available models:

ModelSizeLanguagesDescription
base198 MB40+ languagesDictation-optimized, int8 quantized

Config example:

toml
engine = "dolphin"

[dolphin]
model = "dolphin-base"
# threads = 4

Pros:

  • Broad Eastern language coverage (40 languages + 22 Chinese dialects)
  • Small model size (198 MB)
  • Optimized for dictation use cases

Cons:

  • CJK output is character-based with no spaces (see note below)
  • No built-in punctuation
  • Single model option
  • Requires ONNX binary

7. Omnilingual

Meta's Massively Multilingual Speech (MMS) wav2vec2 model. A single model that covers 1600+ languages with a character-level CTC tokenizer.

Languages: 1600+ (language-agnostic, no language selection needed).

Available models:

ModelSizeParametersDescription
300m3.9 GB300M1600+ languages

Config example:

toml
engine = "omnilingual"

[omnilingual]
model = "omnilingual-large"
# threads = 4

Pros:

  • Widest language coverage of any engine (1600+ languages)
  • Language-agnostic: no language selection needed, just speak
  • Single model covers everything

Cons:

  • Large model (3.9 GB)
  • Character-level output (no word boundaries for many languages)
  • No built-in punctuation
  • Accuracy varies by language; less accurate than specialized models for common languages
  • Requires ONNX binary

Notes

Character-Based CJK Output

SenseVoice, Paraformer, and Dolphin produce character-based output for Chinese, Japanese, and Korean. This means the transcribed text has no spaces between words:

English: "I went to the store"
Chinese: "我去了商店"          (no spaces, which is correct for Chinese)
Japanese: "お店に行きました"    (no spaces)

This is standard for CJK text and generally what you want. If your workflow requires word-segmented output, you would need external post-processing.

Downloading Models

Use the interactive model selector to browse and download models for any engine:

bash
voxtype setup model

This shows all available models across all engines, marks installed models, and handles downloads from HuggingFace.

On-Demand Loading

All engines support on_demand_loading = true, which loads the model only when you start recording and unloads it after transcription. This saves memory and battery on laptops at the cost of a short delay before the first transcription.

Common Config Options

Every ONNX engine supports these options:

OptionDefaultDescription
modelvariesModel name or path to model directory
threadsautoNumber of CPU threads for ONNX Runtime
on_demand_loadingfalseLoad model on demand to save memory

Whisper has additional options (language, gpu_isolation, mode, etc.) documented in the Configuration Guide.


Hardware Recommendations

VRAM Requirements

Engine / ModelMinimum VRAM
Whisper tiny/base1 GB
Whisper small2 GB
Whisper medium5 GB
Whisper large-v3-turbo6 GB
Whisper large-v310 GB
Parakeet TDT 0.6B4 GB (GPU), works on CPU
Parakeet TDT 0.6B int82 GB (GPU), works on CPU
Moonshine / SenseVoice / Paraformer / DolphinCPU-friendly, no GPU needed
Omnilingual 300M4+ GB (GPU), works on CPU

CPU Performance (10 seconds of audio, modern CPU)

Engine / ModelApprox Time
Whisper tiny1-2s
Whisper base2-3s
Whisper small4-6s
Whisper medium10-15s
Parakeet TDT (CPU)2-4s
Parakeet TDT (CUDA)<1s
Moonshine base1-2s
SenseVoice small1-2s
Paraformer zh1-3s
Dolphin base1-2s

Desktop with GPU

Use the largest model your GPU can hold. For English, Parakeet TDT with CUDA is the fastest option. For multilingual, Whisper large-v3-turbo with Vulkan or CUDA.

Laptop on Battery

toml
[whisper]
model = "small.en"
language = "en"
on_demand_loading = true
gpu_isolation = true

Or for an ONNX engine:

toml
engine = "moonshine"

[moonshine]
model = "tiny"
on_demand_loading = true

Troubleshooting

"Engine requested but voxtype was not compiled with --features ..."

You are running the standard (Whisper-only) binary and selected an ONNX engine. Switch to the ONNX binary:

bash
sudo voxtype setup onnx --enable

"My transcription is slow"

  1. Try a smaller model (base instead of small, tiny instead of base).
  2. Enable GPU acceleration if available.
  3. For English, Parakeet on CPU is often faster than Whisper at similar accuracy.
  4. For CJK, SenseVoice and Dolphin are fast single-pass models.

"My transcription has errors"

  1. Try a larger model.
  2. Use an engine specialized for your language (SenseVoice for CJK, Parakeet for European).
  3. For Whisper, use the .en model if you only speak English.
  4. Check that language is set correctly.
  5. Try [output.post_process] for LLM-based cleanup.

"My language isn't supported by [engine]"

Switch to an engine that supports your language. Whisper (99+ languages) and Omnilingual (1600+ languages) have the broadest coverage.

toml
# Switch back to Whisper for unsupported languages
engine = "whisper"

[whisper]
model = "large-v3-turbo"
language = "auto"

"I need punctuation"

  1. Use Parakeet TDT (European languages) or SenseVoice with use_itn = true (CJK + English).
  2. Enable spoken punctuation: [text] spoken_punctuation = true
  3. Configure LLM post-processing: [output.post_process]

Further Reading

Released under the MIT License