Audio Preprocessing blog post (#4)
Added new blog post & draft outline of next one. Reviewed-on: ayo/website#4 Co-authored-by: Jimmy Vargo <james@ayo.tokyo> Co-committed-by: Jimmy Vargo <james@ayo.tokyo>
This commit is contained in:
parent
e5c6ae1061
commit
d37504276a
8 changed files with 199 additions and 0 deletions
|
|
@ -0,0 +1,142 @@
|
|||
# Audio Data Preprocessing for Deep Learning
|
||||
|
||||
*Posted on September 21, 2023*
|
||||
|
||||
|
||||
## Introduction
|
||||
|
||||
I recently began the process of working on an open source Speech-to-Text neural network. I thought this would be an interesting exercise not only for deepening my knowledge of deep learning, but also because of my background in audio engineering. In university, I studied sound design and engineering for applications in both film and music production (and still do so as a hobby), so for me this a great synthesis of my past and current field of expertise. So I wanted to write about my progress so far.
|
||||
|
||||
Hopefully, documenting my thought process and journey here in this blog post can help anyone looking to start or further their own adventures in deep learning, and to stimulate some ideas.
|
||||
|
||||
In this post, I'll just be covering the preparation and data preprocessing phase. My next post will be focused on tackling the neural network architecture and training of the model itself.
|
||||
|
||||
|
||||
## The Theory
|
||||
|
||||
Over the years, I've become quite familiar with the world of digital audio. However, not every software engineer may have had much exposure, so first I want to clarify some basic terms and concepts often used in this space in order to provide a foundation for working with audio in deep learning.
|
||||
|
||||
### Introduction to Digital Audio
|
||||
|
||||
As we all know, sound is the perception created by our ears when they receive waves of varying air pressure. To capture these sounds waves digitally, we take measurements or **samples** of the air pressure (the amplitude of the wave) at a very short interval, called the **sample rate**.
|
||||
|
||||
Sound waves are often complex and oscillate at very high frequencies. The limit of human hearing is around 20kHz (20,000 cycles/second), and due to a concept called the [Nyquist limit](https://mathworld.wolfram.com/NyquistFrequency.html) (see also [Wikipedia](https://en.wikipedia.org/wiki/Nyquist_rate)), we need to take samples at double that number in order to reproduce the highest frequencies accurately. After some padding, we end up slightly north of a minimum 40kHz sample rate. For most audio and music applications, common sample rates include 44.1kHz or 48kHz, and can be higher in audio production environments or in special applications.
|
||||
|
||||
Every time we sample the wave, we store the amplitude value (as either an integer or floating point, depending on the format) in a process known as **quantization**. We can choose how many bits to use to store this value, which determines how many "slots" we have available, referred to as **bit depth**. The bit depth determines the "resolution" of the stored signal, as an audio signal quantized at a bit depth of 8 bits, for example, will have far fewer options (2^8 = 256 "slots") than a signal quantized at 16 bits (2^16 = 65,536 "slots"). Due to the limited bit depth, we lose precision as the values need to be rounded to the nearest "slot", thus losing much of the detailed information in the waveform. A bit depth of 16 bits is usually plenty for most audio files, but can be higher before production rendering (called "bouncing") or for more applications requiring higher degrees of precision.
|
||||
|
||||
<div class="image-row">
|
||||
|
||||

|
||||
|
||||
### Image credit: [ResearchGate](https://www.researchgate.net/publication/236896587/figure/fig1/AS:341768278167573@1458495315947/Quantization-of-a-sine-wave-in-an-ideal-quantizer-with-quantum-size-1-and-dc-offset-0.png)
|
||||
|
||||
</div>
|
||||
|
||||
Understanding these concepts is essential for deep learning, as sample rates and bit depths determine the quality of the source audio and play an integral part in the data processing pipeline, covered in the next section.
|
||||
|
||||
With that introduction, we can move onto more relevant concepts for deep learning.
|
||||
|
||||
|
||||
### Time vs. Frequency Domain
|
||||
|
||||
The samples stored in an audio file allow us to recreate the waveform, and give us access to time domain features of a sound, such as transients (volume spikes), envelopes (amplitude changes over time), and rhythm (measured in beats per minute or "bpm"). However, there is more information about a sound that we can access. **Instead of looking at amplitude over time as given in a waveform (the time domain), we can look at amplitude in relation to frequency (the frequency domain)** in order to glean more understanding from a sound, as commonly seen in EQs and spectrograms.
|
||||
|
||||
This is where the magic Fourier Transform comes in. Thanks to the [Fourier Theorem](http://hyperphysics.phy-astr.gsu.edu/hbase/Audio/fourier.html), we know that complex sound waves can be decomposed into pure sine waves of different frequencies, and given amplitude and phase coefficients that express their contribution to the composite signal.
|
||||
|
||||
<div class="image-row">
|
||||
|
||||

|
||||
|
||||
### Image credit: [NTI Audio - Fast Fourier Transform (FFT) Basics](https://www.nti-audio.com/en/support/know-how/fast-fourier-transform-fft)
|
||||
|
||||
</div>
|
||||
|
||||
Using [Fourier analysis](https://en.wikipedia.org/wiki/Fourier_analysis), we can get this information after applying an algorithm called the Fast Fourier Transform (FFT) that will convert our waveform from the time domain to the frequency domain. However, when we do this domain conversion, we lose the valuable information from the time domain mentioned earlier. Is there a hybrid approach in which we can access frequency information across time, giving us the best of both worlds?
|
||||
|
||||
It turns out there is, and it's called the Short Time Fourier Transform (STFT). This algorithm cuts up an audio file into many frames and performs an FFT at a specified interval along the waveform. Using this output, we now have an array of mini FFTs, giving us valuable information about how the frequencies that make up an audio signal change over time.
|
||||
|
||||
|
||||
### Mel Frequency Cepstrum Coefficients
|
||||
|
||||
We now have valuable frequency information over time given to us by the STFT algorithm, but there is a slight issue.
|
||||
|
||||
The frequency response of the human ear is not linear. This means that the qualities a machine picks up from the frequency domain is not the same as what our ears pick up. For applications that are heavily tied to human perception, such as speech and music, it would be more useful to have data that is more accurate to the human experience.
|
||||
|
||||
This is where the [Mel Frequency Cepstrum (MFC)](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum) comes in. **The MFC provides a representation of a frequency spectrum that better matches our human perception.** This is extremely useful when dealing with things such as speech or music, where our human ears are more sensitive in picking up on certain mid-range frequencies of a human voice, for example.
|
||||
|
||||
<div class="image-row">
|
||||
|
||||

|
||||

|
||||
|
||||
### Image credit: [Imgur (left)](https://i.stack.imgur.com/GAoJs.jpg), [ResearchGate (right)](https://www.researchgate.net/publication/335398843/figure/fig1/AS:796124961058818@1566822390492/MFCC-mel-frequency-cepstral-coefficients-characteristic-vectors-extraction-flow.png)
|
||||
|
||||
</div>
|
||||
|
||||
Like the human ear, the MFC frequency spectrum is also non-linear, meaning not all frequency bands are the same size. Instead, the band sizes vary depending on the importance or distinctiveness that they play to our ears. This helps to create a more balanced series of coefficients, so that we don't end up trying to manually weight a mixture of coefficients of varying usefulness.
|
||||
|
||||
Using a complex algorithm, we can calculate these Mel Frequency Cepstrum Coefficiencts (MFCCs), and use this output as our final human-accurate frequency data over time. This is the data we will be using on which to train our Speech-to-Text neural network.
|
||||
|
||||
|
||||
## The Implementation
|
||||
|
||||
With the foundational theory in place, I started diving into implementing a data pipeline for audio deep learning.
|
||||
|
||||
|
||||
### The Dataset
|
||||
|
||||
After doing some research on commonly used audio datasets for Speech-to-Text applications, for now I've settled on using the [LibraSpeech dataset](https://www.openslr.org/12), which contains around 1,000 hours of labelled audiobook speech.
|
||||
|
||||
Because this dataset is a collection of segments taken from audiobooks, it seemed useful for training on word morphology within a sentence context and for picking up on natural inflection and sentence flow. There are other datasets such as [TIMIT](https://catalog.ldc.upenn.edu/LDC93s1), which focuses more on phonemes rather than language. This might be another option for speech-to-text, and I have yet to discover if there is a good hybrid approach.
|
||||
|
||||
|
||||
### Tools & Packages
|
||||
|
||||
I'm using a library called [Librosa](https://librosa.org/doc/latest/index.html) to perform all the audio analysis, such as MFCC extraction, spectrogram display, etc. It seems to quite rigorous and to have a good reputation in the community. They have a short paper describing their design principles and module implmentation [here](https://conference.scipy.org/proceedings/scipy2015/pdfs/brian_mcfee.pdf).
|
||||
|
||||
For the network architecture and training, I'll be using PyTorch, which conveniently already includes the LibriSpeech dataset in its torchaudio datasets module. I'll cover in the next post as mentioned previously.
|
||||
|
||||
To store the MFCC data, I'm using [h5py](https://www.h5py.org/) to write the NumPy array directly in the .hdf5 file as binary data, which uses the Hierarchical Data Format (HDF).
|
||||
|
||||
|
||||
### Data Format
|
||||
|
||||
Speaking of HDF, I did some research on the benefits of using HDF over a text based format such as a .csv or .json file.
|
||||
|
||||
Where HDF wins:
|
||||
- Speed: hdf5 files are organized to be optimized for fast data access, which is great for large datasets (apparently the speed increases are more apparent on HDDs over SSDs)
|
||||
- Storage Size: hdf5 files are automatically compressed using a flexible variety of data compression filters on each individual dataset
|
||||
- Chunking: allows you load parts of datasets into memory one at a time, reducing memory usage and performance
|
||||
- Flexibility: dataset size can be either fixed or flexible, so it's easy to append new data without creating a brand new copy
|
||||
|
||||
Where CSV or JSON wins:
|
||||
- Unlike hdf5, can be viewed directly in a text editor or terminal, doesn't require specific software to view
|
||||
- Toolkit agnostic: is usable across many platforms and languages
|
||||
- Additionally, a compressed csv file can (apparently) be of similar size to an hdf5 file, potentially
|
||||
|
||||
With these factors in consideration, in my case I won't be needing to share the data externally, so my main concern will be performance with large datasets. So it seems obvious that HDF is the winner here. I found an interesting article describing hdf5 in detail and comparing hdf5 and csv benchmarks for speed and compression [here](https://waterprogramming.wordpress.com/2023/06/22/intro-to-hdf5-h5py-and-comparison-to-csv-for-speed-compression/#:~:text=In%20general%2C%20the%20HDF5%20method,of%20the%20CSV%20times%2C%20respectively.), it's worth a look in my opinion.
|
||||
|
||||
|
||||
### Preprocessing Workflow
|
||||
|
||||
Finally, here is my preprocessing workflow:
|
||||
|
||||
For each audio file in the subset:
|
||||
1. Extract selected dataset features (transcript, speaker_id, etc.) to store along with MFCCs
|
||||
2. Calculate MFCC data using Librosa
|
||||
3. Store data as .hdf5 binary data file using h5py:
|
||||
- MFCCs as an int32 numpy array dataset
|
||||
- Extracted features as dataset attributes
|
||||
|
||||
After running this, we now have our preprocessed data!
|
||||
|
||||
The nice thing is that with an audio data pipeline set up this way, we could extract other features or perform other kinds of analysis, without changing much of the code at all. This gives us the flexibility to adapt to whatever challenges we face in training.
|
||||
|
||||
|
||||
### Final Thoughts
|
||||
|
||||
Writing this blog post was helpful in getting me to think critically about my process and justify the decisions made along the way. Having to explain why I chose to do what I did forced me to do my research and understand each step of the process, not just use the first answer I came across.
|
||||
|
||||
Looking forward to updating the next part, where I'll be diving into Recurrent Neural Network (RNN) architecture and training.
|
||||
|
||||
*James Vargo*
|
||||
BIN
assets/blog/2023-09-21_Audio Preprocessing for Deep Learning/fft.png
(Stored with Git LFS)
Normal file
BIN
assets/blog/2023-09-21_Audio Preprocessing for Deep Learning/fft.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
assets/blog/2023-09-21_Audio Preprocessing for Deep Learning/human_frequency_response.jpeg
(Stored with Git LFS)
Normal file
BIN
assets/blog/2023-09-21_Audio Preprocessing for Deep Learning/human_frequency_response.jpeg
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
assets/blog/2023-09-21_Audio Preprocessing for Deep Learning/mfcc.png
(Stored with Git LFS)
Normal file
BIN
assets/blog/2023-09-21_Audio Preprocessing for Deep Learning/mfcc.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
assets/blog/2023-09-21_Audio Preprocessing for Deep Learning/quantization.png
(Stored with Git LFS)
Normal file
BIN
assets/blog/2023-09-21_Audio Preprocessing for Deep Learning/quantization.png
(Stored with Git LFS)
Normal file
Binary file not shown.
|
|
@ -0,0 +1,38 @@
|
|||
# Training A Speech-to-Text Neural Network
|
||||
|
||||
Speech-To-Text Recurrent Neural Network (RNN)
|
||||
|
||||
|
||||
### Displaying the Data
|
||||
|
||||
In order to check out the sample data from the dataset and confirm its topology, I added a few arguments to the main function.
|
||||
|
||||
We can run the Python script with the `display` argument to get a sample output of our original data. This includes all the features like transcription, the raw samples and its shape, sample rate, duration, speaker ID, and more.
|
||||
|
||||
I also added a few optional flags for confirming the original data visually and audibly.
|
||||
- `--waveform` will show a graph of the waveform, using Matplotlib
|
||||
- `--spectrogram` will show a graph of the spectrogram (given by STFTs not MFCCs), using Librosa
|
||||
- `--mfcc` will show a graph of the spectrogram (MFCCs), using Librosa
|
||||
- `--play` will play the audio file
|
||||
|
||||
|
||||
|
||||
After running this, we now have our preprocessed data! We've transformed the dataset into usable MFCC data stored alongside the extracted features in persistent storage that's performant.
|
||||
|
||||
Using the `read-mfcc` argument in the Python script, I can confirm that the processed data has been stored properly and is readable by our model in a topology that is useful.
|
||||
|
||||
|
||||
## Architecture
|
||||
|
||||
Input shape
|
||||
- .
|
||||
|
||||
Layers
|
||||
- GRU Layer
|
||||
- GRU Layer
|
||||
- Dense Layer
|
||||
- Dropout Layer (to prevent overfitting)
|
||||
- Dense Layer (softmax output)
|
||||
|
||||
Output shape
|
||||
- .
|
||||
Loading…
Add table
Add a link
Reference in a new issue