website/home/assets/blog/draft_2023-09-xx_Training A Speech-to-Text Neural Network/Training A Speech-to-Text Neural Network.md
corentin e3ddda951f jimmy/events (#6)
Co-authored-by: Jimmy Vargo <james@ayo.tokyo>
Reviewed-on: ayo/website#6
2024-04-09 17:12:41 +09:00

1.3 KiB

Training A Speech-to-Text Neural Network

Speech-To-Text Recurrent Neural Network (RNN)

Displaying the Data

In order to check out the sample data from the dataset and confirm its topology, I added a few arguments to the main function.

We can run the Python script with the display argument to get a sample output of our original data. This includes all the features like transcription, the raw samples and its shape, sample rate, duration, speaker ID, and more.

I also added a few optional flags for confirming the original data visually and audibly.

  • --waveform will show a graph of the waveform, using Matplotlib
  • --spectrogram will show a graph of the spectrogram (given by STFTs not MFCCs), using Librosa
  • --mfcc will show a graph of the spectrogram (MFCCs), using Librosa
  • --play will play the audio file

After running this, we now have our preprocessed data! We've transformed the dataset into usable MFCC data stored alongside the extracted features in persistent storage that's performant.

Using the read-mfcc argument in the Python script, I can confirm that the processed data has been stored properly and is readable by our model in a topology that is useful.

Architecture

Input shape

  • .

Layers

  • GRU Layer
  • GRU Layer
  • Dense Layer
  • Dropout Layer (to prevent overfitting)
  • Dense Layer (softmax output)

Output shape

  • .