ASR Accuracy
Quantitative measure of transcription fidelity in automatic-speech-recognition systems relative to ground-truth references. Accuracy is inversely correlated with error rates and influenced by acoustic quality, language complexity, and model architecture.
Key Metrics
- Word Error Rate (WER): Standard metric for space-separated languages; calculated as where =substitutions, =deletions, =insertions, =total words. Lower values indicate higher accuracy.
- Character Error Rate (CER): Preferred for non-space delimited languages (e.g., Chinese, Japanese) or when vocabulary coverage is limited.
- Real-Time Factor (RTF): Measures efficiency; impacts perceived accuracy in streaming applications via latency constraints.
Influencing Factors
- Acoustic Conditions: Background noise, reverberation, and microphone quality degrade signal-to-noise ratio, increasing error rates. See Noise Robustness.
- Domain Mismatch: Performance drops when test data distribution differs significantly from training data. Mitigated via Domain Adaptation.
- Oov Tokens: Out-of-vocabulary terms contribute to substitution errors; addressed by subword tokenization or end-to-end models.
- Speaker Variability: Accents, dialects, and speaking rate affect recognition consistency.
Recent Models & Developments
- IBM Granite Speech 4.1: Open model series spanning language, vision, speech, and embeddings; benchmarks relevant to enterprise-grade accuracy and latency.
- Evaluation of performance characteristics and enterprise applications: IBM Granite Speech 4.1 ASR Models: Features, Accuracy, and Enterprise Applications.
- Comparison contexts include speed-to-accuracy trade-offs highlighted in “Is This The Fastest ASR?” analysis by Sam Witteveen.
See Also
- Transcription Quality
- Speech-to-Text
- model-benchmarking