12 KiB
Audio Data Preprocessing for Deep Learning
Posted on September 21, 2023
Introduction
I recently began the process of working on an open source Speech-to-Text neural network. I thought this would be an interesting exercise not only for deepening my knowledge of deep learning, but also because of my background in audio engineering. In university, I studied sound design and engineering for applications in both film and music production (and still do so as a hobby), so for me this a great synthesis of my past and current field of expertise. So I wanted to write about my progress so far.
Hopefully, documenting my thought process and journey here in this blog post can help anyone looking to start or further their own adventures in deep learning, and to stimulate some ideas.
In this post, I'll just be covering the preparation and data preprocessing phase. My next post will be focused on tackling the neural network architecture and training of the model itself.
The Theory
Over the years, I've become quite familiar with the world of digital audio. However, not every software engineer may have had much exposure, so first I want to clarify some basic terms and concepts often used in this space in order to provide a foundation for working with audio in deep learning.
Introduction to Digital Audio
As we all know, sound is the perception created by our ears when they receive waves of varying air pressure. To capture these sounds waves digitally, we take measurements or samples of the air pressure (the amplitude of the wave) at a very short interval, called the sample rate.
Sound waves are often complex and oscillate at very high frequencies. The limit of human hearing is around 20kHz (20,000 cycles/second), and due to a concept called the Nyquist limit (see also Wikipedia), we need to take samples at double that number in order to reproduce the highest frequencies accurately. After some padding, we end up slightly north of a minimum 40kHz sample rate. For most audio and music applications, common sample rates include 44.1kHz or 48kHz, and can be higher in audio production environments or in special applications.
Every time we sample the wave, we store the amplitude value (as either an integer or floating point, depending on the format) in a process known as quantization. We can choose how many bits to use to store this value, which determines how many "slots" we have available, referred to as bit depth. The bit depth determines the "resolution" of the stored signal, as an audio signal quantized at a bit depth of 8 bits, for example, will have far fewer options (2^8 = 256 "slots") than a signal quantized at 16 bits (2^16 = 65,536 "slots"). Due to the limited bit depth, we lose precision as the values need to be rounded to the nearest "slot", thus losing much of the detailed information in the waveform. A bit depth of 16 bits is usually plenty for most audio files, but can be higher before production rendering (called "bouncing") or for more applications requiring higher degrees of precision.
Image credit: ResearchGate
Understanding these concepts is essential for deep learning, as sample rates and bit depths determine the quality of the source audio and play an integral part in the data processing pipeline, covered in the next section.
With that introduction, we can move onto more relevant concepts for deep learning.
Time vs. Frequency Domain
The samples stored in an audio file allow us to recreate the waveform, and give us access to time domain features of a sound, such as transients (volume spikes), envelopes (amplitude changes over time), and rhythm (measured in beats per minute or "bpm"). However, there is more information about a sound that we can access. Instead of looking at amplitude over time as given in a waveform (the time domain), we can look at amplitude in relation to frequency (the frequency domain) in order to glean more understanding from a sound, as commonly seen in EQs and spectrograms.
This is where the magic Fourier Transform comes in. Thanks to the Fourier Theorem, we know that complex sound waves can be decomposed into pure sine waves of different frequencies, and given amplitude and phase coefficients that express their contribution to the composite signal.
Image credit: NTI Audio - Fast Fourier Transform (FFT) Basics
Using Fourier analysis, we can get this information after applying an algorithm called the Fast Fourier Transform (FFT) that will convert our waveform from the time domain to the frequency domain. However, when we do this domain conversion, we lose the valuable information from the time domain mentioned earlier. Is there a hybrid approach in which we can access frequency information across time, giving us the best of both worlds?
It turns out there is, and it's called the Short Time Fourier Transform (STFT). This algorithm cuts up an audio file into many frames and performs an FFT at a specified interval along the waveform. Using this output, we now have an array of mini FFTs, giving us valuable information about how the frequencies that make up an audio signal change over time.
Mel Frequency Cepstrum Coefficients
We now have valuable frequency information over time given to us by the STFT algorithm, but there is a slight issue.
The frequency response of the human ear is not linear. This means that the qualities a machine picks up from the frequency domain is not the same as what our ears pick up. For applications that are heavily tied to human perception, such as speech and music, it would be more useful to have data that is more accurate to the human experience.
This is where the Mel Frequency Cepstrum (MFC) comes in. The MFC provides a representation of a frequency spectrum that better matches our human perception. This is extremely useful when dealing with things such as speech or music, where our human ears are more sensitive in picking up on certain mid-range frequencies of a human voice, for example.
Image credit: Imgur (left), ResearchGate (right)
Like the human ear, the MFC frequency spectrum is also non-linear, meaning not all frequency bands are the same size. Instead, the band sizes vary depending on the importance or distinctiveness that they play to our ears. This helps to create a more balanced series of coefficients, so that we don't end up trying to manually weight a mixture of coefficients of varying usefulness.
Using a complex algorithm, we can calculate these Mel Frequency Cepstrum Coefficiencts (MFCCs), and use this output as our final human-accurate frequency data over time. This is the data we will be using on which to train our Speech-to-Text neural network.
The Implementation
With the foundational theory in place, I started diving into implementing a data pipeline for audio deep learning.
The Dataset
After doing some research on commonly used audio datasets for Speech-to-Text applications, for now I've settled on using the LibraSpeech dataset, which contains around 1,000 hours of labelled audiobook speech.
Because this dataset is a collection of segments taken from audiobooks, it seemed useful for training on word morphology within a sentence context and for picking up on natural inflection and sentence flow. There are other datasets such as TIMIT, which focuses more on phonemes rather than language. This might be another option for speech-to-text, and I have yet to discover if there is a good hybrid approach.
Tools & Packages
I'm using a library called Librosa to perform all the audio analysis, such as MFCC extraction, spectrogram display, etc. It seems to quite rigorous and to have a good reputation in the community. They have a short paper describing their design principles and module implmentation here.
For the network architecture and training, I'll be using PyTorch, which conveniently already includes the LibriSpeech dataset in its torchaudio datasets module. I'll cover in the next post as mentioned previously.
To store the MFCC data, I'm using h5py to write the NumPy array directly in the .hdf5 file as binary data, which uses the Hierarchical Data Format (HDF).
Data Format
Speaking of HDF, I did some research on the benefits of using HDF over a text based format such as a .csv or .json file.
Where HDF wins:
- Speed: hdf5 files are organized to be optimized for fast data access, which is great for large datasets (apparently the speed increases are more apparent on HDDs over SSDs)
- Storage Size: hdf5 files are automatically compressed using a flexible variety of data compression filters on each individual dataset
- Chunking: allows you load parts of datasets into memory one at a time, reducing memory usage and performance
- Flexibility: dataset size can be either fixed or flexible, so it's easy to append new data without creating a brand new copy
Where CSV or JSON wins:
- Unlike hdf5, can be viewed directly in a text editor or terminal, doesn't require specific software to view
- Toolkit agnostic: is usable across many platforms and languages
- Additionally, a compressed csv file can (apparently) be of similar size to an hdf5 file, potentially
With these factors in consideration, in my case I won't be needing to share the data externally, so my main concern will be performance with large datasets. So it seems obvious that HDF is the winner here. I found an interesting article describing hdf5 in detail and comparing hdf5 and csv benchmarks for speed and compression here, it's worth a look in my opinion.
Preprocessing Workflow
Finally, here is my preprocessing workflow:
For each audio file in the subset:
- Extract selected dataset features (transcript, speaker_id, etc.) to store along with MFCCs
- Calculate MFCC data using Librosa
- Store data as .hdf5 binary data file using h5py:
- MFCCs as an int32 numpy array dataset
- Extracted features as dataset attributes
After running this, we now have our preprocessed data!
The nice thing is that with an audio data pipeline set up this way, we could extract other features or perform other kinds of analysis, without changing much of the code at all. This gives us the flexibility to adapt to whatever challenges we face in training.
Final Thoughts
Writing this blog post was helpful in getting me to think critically about my process and justify the decisions made along the way. Having to explain why I chose to do what I did forced me to do my research and understand each step of the process, not just use the first answer I came across.
Looking forward to updating the next part, where I'll be diving into Recurrent Neural Network (RNN) architecture and training.
James Vargo



