- Automatic Speech Recognition- Spoken words to machine-readable form
 
- Natural language understanding- High level cognitive interpretation- Structure
- Meaning
- Intention
 
 
- High level cognitive interpretation
Automatic Speech Recognition
Applications
- Business/desktop apps- Dictation
- Voice commands
 
- Voice enabled services/apps- Siri
 
- Home automation
- Game & Entertainment
- Education
- Speech therapy/Rehab
- Hearing assistance- Live CC
 
Challenges
- Speaker dependency- Accent
- Emotion
 
- Vocab size- Slang
 
- Isolated words vs Continuous speech- Hard to segment continuous speech
 
- Language constraints & Knowledge sources- Training source is critical
 
- Acoustic ambiguity- Similar sounding speech
 
- Noise robustness- Background noise
- Reverberation
 
Speech Diarisation
- Who speaks when?
- Split stream into homogenous segments for identity
- Structure stream into speaker turns
- Provide speaker identity
- Combination of- Speaker segmentation- Speaker changes in stream
 
- Speaker clustering- Grouping segments together on basis of characteristics
 
 
- Speaker segmentation
- Gaussian mixture model- HMM
 
- Bottom-up- More popular
- Succession of clusters
- Merge redundant clusters- Remaining belong to speakers
 
 
- Top-down- Single cluster
- Iteratively split until speaker clusters