Automated Speech Recognition allows a computer to understand what callers are saying even if there are differences in accents, pronunciation, or different characteristics of each person’s voice due to age, gender or regional variations. At a high level, speech recognition works by applying the following processes:
- Capture and Digitization – What a caller says has to be converted into binary formats. This step is also known as speech detection or end-pointing.
- Spectral Representation – Once digitized, the utterance is converted in waveforms, which statistically “map” the human voice.
- Modeling – Begins this process of applying meaning by statistically dividing utterances into sections for individual processing.
- Phonetic Classification – Assignment of meaning starts with assigning sounds or “phonemes” to the captured waveforms. (“one” = /w/ /ah/ /n/, “two” = /t/ /uw/ ). Accurately pairing a phoneme to a waveform requires a repository of acoustic models that represent a broad range of possible sounds.
- Search & Match – Vocabularies and grammars are the foundation of “understanding” for a computer system. A vocabulary defines words, grammars define the relationship of words to a sentence or a phrase. A match (based on “confidence scoring”) equals successful recognition.