Due to their diverse application range, Text-To-Speech and Speech-To-Text systems, have been a central point of focus in the domain of audio processing. The use of such systems to solve medical problems, specifically related to speech impairment, is significantly important. In this regard, this project’s objective is to present a system that converts impaired speech into text and finally synthesizes understandable speech from it.
The problem being addressed through this project is the inability of people with a speech impairment to express their feelings and communicate with other members of society. The aim is to bridge this communication gap using the concepts of Deep Learning (DL), Natural Language Processing (NLP), and Audio Processing. Narrowing down our methodology to Deep Learning was an outcome of an extensive literature review on both the conventional and state-of-the-art models developed in this cause.
Although conventional models served as a basis for a vital foundation, the recent Machine Learning (ML) and Deep Learning-based approaches lead the charts as far as accuracy and performance are concerned.
The dataset used was obtained from the UA-Speech Database. This dataset is based on impaired speech samples of patients affected by dysarthria. Also, the dataset consists of intelligible speech to serve as control data. The data was preprocessed and normalized to consist of uniform-length speech samples. For now, a Deep Learning-based Convolutional Neural Network was implemented with additional dense and dropout layers.
The model was trained based on the impaired speech data of three patients. The results produced a 97% accuracy on the test dataset. Given the model has achieved a significant accuracy in word-based speech, the future work will be focused on sentence-based impaired speech, improving the accuracy and pipelining the generated text to the text-to-speech model. Finally, a zero-shot learning technique was used for the TTS module.
The architecture is based on three independently trained components, which include a speaker encoder, a sequence to sequence synthesis network, and a vocoder network. The proposed model gives a Mean Opinion Score of 3.03 ± 0.09 for unseen speakers. Therefore, in this way, impaired speech data is first converted to text and then to the understandable speech