jimmy/events (#6)
Co-authored-by: Jimmy Vargo <james@ayo.tokyo> Reviewed-on: ayo/website#6
This commit is contained in:
parent
83a0903155
commit
e3ddda951f
76 changed files with 564 additions and 1 deletions
|
|
@ -0,0 +1,138 @@
|
|||
# Deep Learning Framework Benchmark
|
||||
*Posted on October, 7 2021*
|
||||
|
||||
## Preambule
|
||||
|
||||
There are few frameworks to work on Deep Learning neural networks, I used to be very familiar with Tensorflow back in the days when the second version was not yet released. As someone with software engineering background the strictness and clarity of this first version of Tensorflow was a joy. Also the graph outputed by tensorboard were amazing to the point of getting the habits of debugging my networks from tensorboard most of the time.
|
||||
|
||||
<div class="image-row">
|
||||
|
||||
### Graph showed in tensorboard from Tensorflow version 1
|
||||
|
||||

|
||||

|
||||

|
||||
|
||||
</div>
|
||||
|
||||
Those days are gone, now a new era of dynamic programming came to the Deep Learning field with PyTorch becoming increasingly popular (from what I have experienced), the second version of Tensorflow converging to the same API and Jax going a step further with near-python programming paradigm. The dynamic paradigm has some very nice points, especially if you do reinforcement learning it makes things way easier.
|
||||
|
||||
There are also other frameworks I haven't yet tested like [MXNet](https://mxnet.apache.org/versions/1.8.0/).
|
||||
|
||||
Now most of the frameworks I have experienced have nearly the same API and ONNX brings a very nice way to output the final result of trainings independently of the framework. Thus choosing which one to use is getting less clear than before.
|
||||
|
||||
Lately I have been trying out some RNN-like network with different modification to improve the infamous *long term memory* problem (Hopefully I will post something about that latter). Using PyTorch I feel very frustrated that the included [LSTM layer](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) was running very well but **an equivalent code would run several times slower** (around 1/3), even following the [official documentation on GPU optimizations](https://pytorch.org/blog/optimizing-cuda-rnn-with-torchscript/) (which seems deprecated on few points), sometimes too the point of going at 10% of the initial speed. So if I want to do some research I might as well choose a framework that wouldn't work so slow I would need to wait for hours a training that could be performed in minutes. But would other frameworks really give me better performance?
|
||||
|
||||
I decided to see for myself how the different framework behave, starting from simple operations and hopefully testing up to whole network trainings.
|
||||
|
||||
I will use a naming convention for the frameworks (also called platform in my scripts) tested here:
|
||||
|
||||
* TF1 : first version of Tensorflow (verion 1.x), as of this writting the latest version is 1.15
|
||||
* TF2 : second version of Tensorflow (verion 2.x), as of this writting the latest version is 2.6
|
||||
* TF2_V1 : second version of Tensorflow but using the compatibility API to write as the first version, also disabling the dynamic behaviour (I suspected different performance)
|
||||
* Torch : PyTorch
|
||||
* Jax
|
||||
|
||||
|
||||
## Benchmarking implementation
|
||||
|
||||
### No Gradient
|
||||
|
||||
This is obvious but PyTorch is very nice for the majority of the time were you need to compute gradients but not here as I started with the most simple operations first. The `requires_grad=False` argument on all tensors does the trick on PyTorch while Tensorflow and Jax don't need any additional care as far as I know.
|
||||
|
||||
### Warmup
|
||||
|
||||
I have experienced many time on all framework so far that the first run is always several time slower, this is obvious for dynamically allocated tensors of the modern frameworks but I strongly remember this happened too when I was using TF1. To avoid the first run to skew the benchmark each experiment has a small warmup loop:
|
||||
|
||||
```
|
||||
# warmup
|
||||
for _ in range(20):
|
||||
self.experiment()
|
||||
```
|
||||
|
||||
### Optimizations
|
||||
|
||||
I had to test with **random tensor at start and before each operation** to be sure that frameworks do not optimize out some already made operations (could be cache), especially since I disabled gradient computation. All my tests showed no difference so I stuck with tensors initially filled with ones.
|
||||
|
||||
### Benchmark time
|
||||
|
||||
Each operation is benchmarked during an "experiment", to get consistent benchmarking time a first loop is done to estimate the number of operation per second then the loop being benchmarked is run with a fixed number of step from the estimation. This allows to set in a configuration file the time per experiment for statistical stability and avoid unnecessary call to the system clock (CPython not being know for its speed I'd rather have a simple integer increment per loop as overhead).
|
||||
|
||||
Latter this could also make a progression bar with ETA possible as the benchmarks can be quiet exhaustive.
|
||||
|
||||
## Results
|
||||
|
||||
**The code is publicly available [here](https://gitlab.com/corentin-pro/dl_bench). It will output raw data as csv files and their plots. All the data and plot from my machine (NVIDIA GeForce RTX 2060 SUPER) can be downloaded [here](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/gpu_NVIDIA%20GeForce%20RTX%202060%20SUPER.zip).**
|
||||
|
||||
<div class="image-row">
|
||||
|
||||
### Experiment benchmark samples
|
||||
|
||||

|
||||

|
||||
|
||||
</div>
|
||||
|
||||
As expected the bigger the operations (experiments) the better [GFLOPS](https://en.wikipedia.org/wiki/FLOPS) (Giga floating point operations per second) the GPU can output. So far nothing unexpected.
|
||||
|
||||
### Comparisons
|
||||
|
||||
Comparison plots are also generated from the experiment data, for now the only comparison are done between 'platforms' (aka framework) but data type comparisons could be interesting in the future. Categories were made to plot subsets of comparisons in order to keep the scale of the y axis linear, the script will automatically switch to logarithmic scale if needed in the general case. The categories are ranges of Mop (Milions of operations) per experiment like `MEDIUM = [20, 1000]` (there is SMALL, MEDIUM, LARGE and VERY_LARGE) and can be changed in the configuration files.
|
||||
|
||||
<div class="image-row">
|
||||
|
||||
### Comparison samples
|
||||
|
||||

|
||||

|
||||

|
||||
|
||||
</div>
|
||||
|
||||
**NOTE** : all operations with the `nn` prefix means that it is run inside a 'module' (or equivalent), in Jax for instace I used `stax` and `jit` as intended by the library. JIT is not needed as far as I tested for Torch.
|
||||
|
||||
Torch seems the best for simple and small operation while tensorflow in general seems to have big overheads. Jax does very well once we use the JIT. All frameworks tends to converge more with bigger layers/operations, the XLA based Tensorflow and Jax seems to have slightly better performance there. But for small operations Torch can be orders of magnitude faster!
|
||||
|
||||
The result between float32 and float16 are very similar but float64 is different:
|
||||
|
||||
* For some reasons TF2 didn't accept matmul on float64 inside a module, I should fix that latter
|
||||
* TF2 get better results relatively to other platforms
|
||||
* Except for element-wise operations, Torch doesn't have its lead on small operations
|
||||
* There is a weird behaviour for the matmul of 800x800 tensors both in Torch and TF2. After additional testing I couldn't figure out why the first runs (even after warmup) were way to fast.
|
||||
|
||||
The specific behaviour of the 800x800 matmul in the data (see `run times (s)`) looks like :
|
||||
|
||||
```
|
||||
experiment run times (s) count ms/matmul Mop/matmul GFLOPS
|
||||
300 800x800 @ 800x800 0.03258013725280762 60 0.5430022875467936 1022.72 1883.4543121733468
|
||||
[...]
|
||||
308 800x800 @ 800x800 0.032579898834228516 60 0.5429983139038086 1022.72 1883.4680952272229
|
||||
309 800x800 @ 800x800 0.03258252143859863 60 0.5430420239766439 1022.72 1883.316492728723
|
||||
310 800x800 @ 800x800 0.1323096752166748 60 2.2051612536112466 1022.72 463.7846771183555
|
||||
311 800x800 @ 800x800 0.2970736026763916 60 4.951226711273193 1022.72 206.55891148579838
|
||||
312 800x800 @ 800x800 0.29687929153442383 60 4.947988192240397 1022.72 206.6941068298959
|
||||
[...]
|
||||
329 800x800 @ 800x800 0.2968714237213135 60 4.947857062021892 1022.72 206.69958472528631
|
||||
```
|
||||
|
||||
It is the only instance of such a behavior across all operations and even within the matmul benchamrk. Because of this the result plot doesn't look great :
|
||||
|
||||

|
||||
|
||||
|
||||
## Conclusion
|
||||
|
||||
The results so far confort me into using Torch overall as I usually design small networks but Jax seems to be a very interesting contender. I am surprise the difference on small/medium operations could be that significant between Torch and TF2, I sometimes use my DL framework for GPU accelerated math in other context so it is interesting.
|
||||
|
||||
The code is not yet complete and in the future I would like to test for more:
|
||||
|
||||
* Convolutions : 1d, 2d, transpose
|
||||
* Gradient
|
||||
* Optimizer
|
||||
* RNN : which was the trigger that started all of this
|
||||
* Data transfert? (CPU->GPU and GPU->CPU)
|
||||
|
||||
If you have questions or remarks you can contact me or reply the [reddit post](https://www.reddit.com/r/MachineLearning/comments/q2y9n5/d_deep_learning_framework_benchmark/).
|
||||
|
||||
|
||||
Corentin Risselin.
|
||||
BIN
home/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/result_add_float32.png
(Stored with Git LFS)
Normal file
BIN
home/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/result_add_float32.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
home/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/result_jax_nn_dense_float32.png
(Stored with Git LFS)
Normal file
BIN
home/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/result_jax_nn_dense_float32.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
home/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/result_matmul_float64_LARGE.png
(Stored with Git LFS)
Normal file
BIN
home/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/result_matmul_float64_LARGE.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
home/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/result_nn_dense_float32_MEDIUM.png
(Stored with Git LFS)
Normal file
BIN
home/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/result_nn_dense_float32_MEDIUM.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
home/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/result_nn_dense_x5_float32_VERY_LARGE.png
(Stored with Git LFS)
Normal file
BIN
home/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/result_nn_dense_x5_float32_VERY_LARGE.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
home/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/result_torch_matmul_float32.png
(Stored with Git LFS)
Normal file
BIN
home/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/result_torch_matmul_float32.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
home/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/tf1_graph.png
(Stored with Git LFS)
Normal file
BIN
home/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/tf1_graph.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
home/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/tf1_graph_network.png
(Stored with Git LFS)
Normal file
BIN
home/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/tf1_graph_network.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
home/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/tf1_graph_train.png
(Stored with Git LFS)
Normal file
BIN
home/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/tf1_graph_train.png
(Stored with Git LFS)
Normal file
Binary file not shown.
|
|
@ -0,0 +1,78 @@
|
|||
# Self-Hosted Web Infrastructure
|
||||
*Posted on September, 12 2023*
|
||||
|
||||
## Preambule
|
||||
|
||||
This month I had to setup a new server to host this very website as well as additional subdomains for self-hosted Git and Doc/Calendar sharing application. All of this doesn't require any big hardware and can be handled by a small IoT in my office.
|
||||
|
||||
Deploying a server, even a small and simple one, has some caveats that need to be taken into consideration. I will share the setup I just did for the last server I am using to host my application in-house.
|
||||
|
||||
## Hardware consideration
|
||||
|
||||
The first name that is probably coming to everyone is Raspberry-pi. It's capable enough, this website was hosted on a ~8 years old raspberry-pi 2 until now. I have also worked for different project with raspberry-pis (magic-mirror, robot-arm control with multi-camera setup, etc) and it was always very frustrating.
|
||||
|
||||
Here's a quick overview of the problems I know:
|
||||
|
||||
* GPU capabilities were not great because the vendor Broadcom wasn't helpful with the FOSS (Free Open Sourced Software) community and the driver was lacking. This means that outputing realtime images/video and overlay would be very slow unless you switched to the experimental driver. The experimental driver has issues such as stability, and I even got noisy output (bearable, only few pixels here and there but weird nonetheless).
|
||||
* No real ethernet but ethernet over USB (was fixed from version 3 or 4), coupling with a single not-so-powerful USB controller, this would lead to USB congestion. For instance you couldn't use 2 webcams in 720p resolution at 30 fps (not even sure about 480p, I had to tune down the resolution and FPS a lot). During USB congestion any SSH session would get stuck working with such a device.
|
||||
* SD card reader was extremely slow (version 1 and 2, probably 3 too)
|
||||
* With time the value proposition gets worse for the price : the raspberry-pi started as a cheap computer experience for educational purposes and prototyping. But now the price is quadruple for performance that is not even above average compared to other devices.
|
||||
* Version 4 opted for a micro-HDMI instead of full-sized
|
||||
|
||||
There are many alternatives I looked at, among them one caught my attention : Odroid. This device was widely known for a long time and the company, based in South-Korea, is now named hardkernel. The proximity of the company is very nice and the catalog they offer is very interesting. Here are a few points that made me, over time, look mostly at them for my IoT needs:
|
||||
|
||||
* Very focused on making hardware and not selling other things
|
||||
* Great value proposition for the price : especially for the accessories (power supply, cases, etc)
|
||||
* Updating hardware frequently : you are not stuck buying a 5 year old CPU
|
||||
* Hardware is documented : schematics and even benchmarks (performance and power consumption)!
|
||||
* Amazing support : I once had a problem with a single component, reaching to them by email I had a response immediately and they shipped me a replacement for free in the next order I made (I asked for that specifically as I was buying a new device to make my NAS).
|
||||
|
||||
Long story short : I eventually have ended up using the Odroid-N2+ device multiple times and it is mostly flawless (using panfrost driver for the GPU is tricky). You can check [all the specifications from hardkernel's website](https://www.hardkernel.com/shop/odroid-n2-with-4gbyte-ram-2).
|
||||
|
||||
<div class="image-row">
|
||||
|
||||
### Odroid N2+
|
||||
|
||||

|
||||
|
||||
</div>
|
||||
|
||||
|
||||
## OS/Distribution
|
||||
|
||||
### Distribution Choice
|
||||
|
||||
Finding a good Linux distribution can be troublesome too. For me the choice is not too hard to make : I come from a Debian background and most of the corporate-minded distributions such as CentOS(discontinued now)/RHEL have no appeal to me. I can fix my issues and contribute back if needed, I don't need support.
|
||||
|
||||
[Debian](https://wiki.debian.org) is amazing at its job : being stable. The people maintaining it are probably the most professional and rigorous I've seen. There is one drawback of the distribution : everything is ~3 years old. Because of my bias with deep learning engineering needing more recent libraries, and issues that I had in the past with too-old databases (namely MySQL) to get very important features, I am choosing another distribution. Also note that official Debian does not offer a lot of possibilities for [Arm ports](https://wiki.debian.org/Arm64Port).
|
||||
|
||||
For IoTs [Armbian](https://www.armbian.com/) is amazing and definitely a strong choice. I am using this in some of my devices, it has very interesting defaults that I will get inspiration from (like on-memory logs).
|
||||
|
||||
Being familiar with ArchLinux, [ArchLinuxArm](https://archlinuxarm.org/) is a perfect fit for me. I like the deliberate choice to leave configurations as software's default. It takes some work off the maintainer and onto the user. A lot of softwares improved their defaults in the last years anyway and it gives more knowledge/control to the user.
|
||||
|
||||
**Note :** Another noticeable distribution is [Manjaro Arm](https://manjaro.org/download/#ARM), on x86-64 platform there are differences between ArchLinux and Manjaro that spark a lot of debate. But on Arm platforms ArchLinuxArm is a totally different project and there Manjaro can bring a lot of value : packages can be better maintained (noticeable if you want the latest GPU driver).
|
||||
|
||||
### Installation
|
||||
|
||||
Compared to PC installation, installing ArchLinux for Arm platforms is way more straightforward. The boot sequence of Arm devices is very different and I have never seen UEFI on Arm. Also Arm devices boot thanks to device tree (declared is .dts files in the kernel code directly and compiled to .dtb files), those describe all the hardware and the initialization values to make the device works. This means that the Linux project has hundreds, if not thousands, of those files to be able to run on any device (the dts can be targeting a family of devices shared thus far).
|
||||
|
||||
Installing Linux on an IoT usually means copying a prepared image on an SD card or other storage. Sometimes with the DTB files given in a separate boot partition for linux to boot with. Also additional initiazition values can be set in the boot partition in a `boot.ini` files. The boot sequence usually starts not from Linux but from a boot loader : on PC Grub is a common one, on IoT you may encounter `u-boot` and more rarely `petitboot`. Those are usually already prepared so nothing needs to be done on that part.
|
||||
|
||||
For ArchLinuxArm preparing the partition on the storage is the biggest part of the installation but only requires ~10 commands and is nicely documented. For the Odroid-N2 the procedure is described [here](https://archlinuxarm.org/platforms/armv8/amlogic/odroid-n2).
|
||||
|
||||
Additional commands you probably want to do at first, as root, would be:
|
||||
|
||||
* **Edit pacman (the package manager) configuration** : `vim /etc/pacman.conf` (here using `vim`, but `nano` would also work fine). Uncommenting the `Color`, `VerbosePkgLists` and `ParallelDownloads` is a must have.
|
||||
* **Update all packages** : `pacman -Syu`
|
||||
* **Installing sudo** : `pacman -S sudo`
|
||||
* **Create sudo group and add your user to it** : `groupadd sudo` and `usermod -aG sudo myuser`.
|
||||
* **Allow sudo group to run any command** : `EDITOR=vim visudo /etc/sudoers` (again `nano` would work instead of `vim`). Make sure the line `%sudo ALL=(ALL:ALL) ALL` is uncommented.
|
||||
* **Adding your user** : `useradd -m myuser`. The `-m` option is to create a home folder (`/home/myuser`)
|
||||
* **Setting your user password** : `passwd myuser`.
|
||||
|
||||
|
||||
## Conclusion
|
||||
|
||||
I hope this introduction to the hardware and software used for self hosting was interesting and maybe gave you some thoughts about trying it yourself. There are many topics from here that I will present : security, server software installation and configuration, DNS, certificates, etc. Stay tuned for the next post.
|
||||
|
||||
Corentin Risselin.
|
||||
BIN
home/assets/blog/2023-09-12_Self-Hosted Web Infrastructure/odroidn2plus.jpg
(Stored with Git LFS)
Normal file
BIN
home/assets/blog/2023-09-12_Self-Hosted Web Infrastructure/odroidn2plus.jpg
(Stored with Git LFS)
Normal file
Binary file not shown.
|
|
@ -0,0 +1,142 @@
|
|||
# Audio Data Preprocessing for Deep Learning
|
||||
|
||||
*Posted on September 21, 2023*
|
||||
|
||||
|
||||
## Introduction
|
||||
|
||||
I recently began the process of working on an open source Speech-to-Text neural network. I thought this would be an interesting exercise not only for deepening my knowledge of deep learning, but also because of my background in audio engineering. In university, I studied sound design and engineering for applications in both film and music production (and still do so as a hobby), so for me this a great synthesis of my past and current field of expertise. So I wanted to write about my progress so far.
|
||||
|
||||
Hopefully, documenting my thought process and journey here in this blog post can help anyone looking to start or further their own adventures in deep learning, and to stimulate some ideas.
|
||||
|
||||
In this post, I'll just be covering the preparation and data preprocessing phase. My next post will be focused on tackling the neural network architecture and training of the model itself.
|
||||
|
||||
|
||||
## The Theory
|
||||
|
||||
Over the years, I've become quite familiar with the world of digital audio. However, not every software engineer may have had much exposure, so first I want to clarify some basic terms and concepts often used in this space in order to provide a foundation for working with audio in deep learning.
|
||||
|
||||
### Introduction to Digital Audio
|
||||
|
||||
As we all know, sound is the perception created by our ears when they receive waves of varying air pressure. To capture these sounds waves digitally, we take measurements or **samples** of the air pressure (the amplitude of the wave) at a very short interval, called the **sample rate**.
|
||||
|
||||
Sound waves are often complex and oscillate at very high frequencies. The limit of human hearing is around 20kHz (20,000 cycles/second), and due to a concept called the [Nyquist limit](https://mathworld.wolfram.com/NyquistFrequency.html) (see also [Wikipedia](https://en.wikipedia.org/wiki/Nyquist_rate)), we need to take samples at double that number in order to reproduce the highest frequencies accurately. After some padding, we end up slightly north of a minimum 40kHz sample rate. For most audio and music applications, common sample rates include 44.1kHz or 48kHz, and can be higher in audio production environments or in special applications.
|
||||
|
||||
Every time we sample the wave, we store the amplitude value (as either an integer or floating point, depending on the format) in a process known as **quantization**. We can choose how many bits to use to store this value, which determines how many "slots" we have available, referred to as **bit depth**. The bit depth determines the "resolution" of the stored signal, as an audio signal quantized at a bit depth of 8 bits, for example, will have far fewer options (2^8 = 256 "slots") than a signal quantized at 16 bits (2^16 = 65,536 "slots"). Due to the limited bit depth, we lose precision as the values need to be rounded to the nearest "slot", thus losing much of the detailed information in the waveform. A bit depth of 16 bits is usually plenty for most audio files, but can be higher before production rendering (called "bouncing") or for more applications requiring higher degrees of precision.
|
||||
|
||||
<div class="image-row">
|
||||
|
||||

|
||||
|
||||
### Image credit: [ResearchGate](https://www.researchgate.net/publication/236896587/figure/fig1/AS:341768278167573@1458495315947/Quantization-of-a-sine-wave-in-an-ideal-quantizer-with-quantum-size-1-and-dc-offset-0.png)
|
||||
|
||||
</div>
|
||||
|
||||
Understanding these concepts is essential for deep learning, as sample rates and bit depths determine the quality of the source audio and play an integral part in the data processing pipeline, covered in the next section.
|
||||
|
||||
With that introduction, we can move onto more relevant concepts for deep learning.
|
||||
|
||||
|
||||
### Time vs. Frequency Domain
|
||||
|
||||
The samples stored in an audio file allow us to recreate the waveform, and give us access to time domain features of a sound, such as transients (volume spikes), envelopes (amplitude changes over time), and rhythm (measured in beats per minute or "bpm"). However, there is more information about a sound that we can access. **Instead of looking at amplitude over time as given in a waveform (the time domain), we can look at amplitude in relation to frequency (the frequency domain)** in order to glean more understanding from a sound, as commonly seen in EQs and spectrograms.
|
||||
|
||||
This is where the magic Fourier Transform comes in. Thanks to the [Fourier Theorem](http://hyperphysics.phy-astr.gsu.edu/hbase/Audio/fourier.html), we know that complex sound waves can be decomposed into pure sine waves of different frequencies, and given amplitude and phase coefficients that express their contribution to the composite signal.
|
||||
|
||||
<div class="image-row">
|
||||
|
||||

|
||||
|
||||
### Image credit: [NTI Audio - Fast Fourier Transform (FFT) Basics](https://www.nti-audio.com/en/support/know-how/fast-fourier-transform-fft)
|
||||
|
||||
</div>
|
||||
|
||||
Using [Fourier analysis](https://en.wikipedia.org/wiki/Fourier_analysis), we can get this information after applying an algorithm called the Fast Fourier Transform (FFT) that will convert our waveform from the time domain to the frequency domain. However, when we do this domain conversion, we lose the valuable information from the time domain mentioned earlier. Is there a hybrid approach in which we can access frequency information across time, giving us the best of both worlds?
|
||||
|
||||
It turns out there is, and it's called the Short Time Fourier Transform (STFT). This algorithm cuts up an audio file into many frames and performs an FFT at a specified interval along the waveform. Using this output, we now have an array of mini FFTs, giving us valuable information about how the frequencies that make up an audio signal change over time.
|
||||
|
||||
|
||||
### Mel Frequency Cepstrum Coefficients
|
||||
|
||||
We now have valuable frequency information over time given to us by the STFT algorithm, but there is a slight issue.
|
||||
|
||||
The frequency response of the human ear is not linear. This means that the qualities a machine picks up from the frequency domain is not the same as what our ears pick up. For applications that are heavily tied to human perception, such as speech and music, it would be more useful to have data that is more accurate to the human experience.
|
||||
|
||||
This is where the [Mel Frequency Cepstrum (MFC)](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum) comes in. **The MFC provides a representation of a frequency spectrum that better matches our human perception.** This is extremely useful when dealing with things such as speech or music, where our human ears are more sensitive in picking up on certain mid-range frequencies of a human voice, for example.
|
||||
|
||||
<div class="image-row">
|
||||
|
||||

|
||||

|
||||
|
||||
### Image credit: [Imgur (left)](https://i.stack.imgur.com/GAoJs.jpg), [ResearchGate (right)](https://www.researchgate.net/publication/335398843/figure/fig1/AS:796124961058818@1566822390492/MFCC-mel-frequency-cepstral-coefficients-characteristic-vectors-extraction-flow.png)
|
||||
|
||||
</div>
|
||||
|
||||
Like the human ear, the MFC frequency spectrum is also non-linear, meaning not all frequency bands are the same size. Instead, the band sizes vary depending on the importance or distinctiveness that they play to our ears. This helps to create a more balanced series of coefficients, so that we don't end up trying to manually weight a mixture of coefficients of varying usefulness.
|
||||
|
||||
Using a complex algorithm, we can calculate these Mel Frequency Cepstrum Coefficiencts (MFCCs), and use this output as our final human-accurate frequency data over time. This is the data we will be using on which to train our Speech-to-Text neural network.
|
||||
|
||||
|
||||
## The Implementation
|
||||
|
||||
With the foundational theory in place, I started diving into implementing a data pipeline for audio deep learning.
|
||||
|
||||
|
||||
### The Dataset
|
||||
|
||||
After doing some research on commonly used audio datasets for Speech-to-Text applications, for now I've settled on using the [LibraSpeech dataset](https://www.openslr.org/12), which contains around 1,000 hours of labelled audiobook speech.
|
||||
|
||||
Because this dataset is a collection of segments taken from audiobooks, it seemed useful for training on word morphology within a sentence context and for picking up on natural inflection and sentence flow. There are other datasets such as [TIMIT](https://catalog.ldc.upenn.edu/LDC93s1), which focuses more on phonemes rather than language. This might be another option for speech-to-text, and I have yet to discover if there is a good hybrid approach.
|
||||
|
||||
|
||||
### Tools & Packages
|
||||
|
||||
I'm using a library called [Librosa](https://librosa.org/doc/latest/index.html) to perform all the audio analysis, such as MFCC extraction, spectrogram display, etc. It seems to quite rigorous and to have a good reputation in the community. They have a short paper describing their design principles and module implmentation [here](https://conference.scipy.org/proceedings/scipy2015/pdfs/brian_mcfee.pdf).
|
||||
|
||||
For the network architecture and training, I'll be using PyTorch, which conveniently already includes the LibriSpeech dataset in its torchaudio datasets module. I'll cover in the next post as mentioned previously.
|
||||
|
||||
To store the MFCC data, I'm using [h5py](https://www.h5py.org/) to write the NumPy array directly in the .hdf5 file as binary data, which uses the Hierarchical Data Format (HDF).
|
||||
|
||||
|
||||
### Data Format
|
||||
|
||||
Speaking of HDF, I did some research on the benefits of using HDF over a text based format such as a .csv or .json file.
|
||||
|
||||
Where HDF wins:
|
||||
- Speed: hdf5 files are organized to be optimized for fast data access, which is great for large datasets (apparently the speed increases are more apparent on HDDs over SSDs)
|
||||
- Storage Size: hdf5 files are automatically compressed using a flexible variety of data compression filters on each individual dataset
|
||||
- Chunking: allows you load parts of datasets into memory one at a time, reducing memory usage and performance
|
||||
- Flexibility: dataset size can be either fixed or flexible, so it's easy to append new data without creating a brand new copy
|
||||
|
||||
Where CSV or JSON wins:
|
||||
- Unlike hdf5, can be viewed directly in a text editor or terminal, doesn't require specific software to view
|
||||
- Toolkit agnostic: is usable across many platforms and languages
|
||||
- Additionally, a compressed csv file can (apparently) be of similar size to an hdf5 file, potentially
|
||||
|
||||
With these factors in consideration, in my case I won't be needing to share the data externally, so my main concern will be performance with large datasets. So it seems obvious that HDF is the winner here. I found an interesting article describing hdf5 in detail and comparing hdf5 and csv benchmarks for speed and compression [here](https://waterprogramming.wordpress.com/2023/06/22/intro-to-hdf5-h5py-and-comparison-to-csv-for-speed-compression/#:~:text=In%20general%2C%20the%20HDF5%20method,of%20the%20CSV%20times%2C%20respectively.), it's worth a look in my opinion.
|
||||
|
||||
|
||||
### Preprocessing Workflow
|
||||
|
||||
Finally, here is my preprocessing workflow:
|
||||
|
||||
For each audio file in the subset:
|
||||
1. Extract selected dataset features (transcript, speaker_id, etc.) to store along with MFCCs
|
||||
2. Calculate MFCC data using Librosa
|
||||
3. Store data as .hdf5 binary data file using h5py:
|
||||
- MFCCs as an int32 numpy array dataset
|
||||
- Extracted features as dataset attributes
|
||||
|
||||
After running this, we now have our preprocessed data!
|
||||
|
||||
The nice thing is that with an audio data pipeline set up this way, we could extract other features or perform other kinds of analysis, without changing much of the code at all. This gives us the flexibility to adapt to whatever challenges we face in training.
|
||||
|
||||
|
||||
### Final Thoughts
|
||||
|
||||
Writing this blog post was helpful in getting me to think critically about my process and justify the decisions made along the way. Having to explain why I chose to do what I did forced me to do my research and understand each step of the process, not just use the first answer I came across.
|
||||
|
||||
Looking forward to updating the next part, where I'll be diving into Recurrent Neural Network (RNN) architecture and training.
|
||||
|
||||
*James Vargo*
|
||||
BIN
home/assets/blog/2023-09-21_Audio Preprocessing for Deep Learning/fft.png
(Stored with Git LFS)
Normal file
BIN
home/assets/blog/2023-09-21_Audio Preprocessing for Deep Learning/fft.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
home/assets/blog/2023-09-21_Audio Preprocessing for Deep Learning/human_frequency_response.jpeg
(Stored with Git LFS)
Normal file
BIN
home/assets/blog/2023-09-21_Audio Preprocessing for Deep Learning/human_frequency_response.jpeg
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
home/assets/blog/2023-09-21_Audio Preprocessing for Deep Learning/mfcc.png
(Stored with Git LFS)
Normal file
BIN
home/assets/blog/2023-09-21_Audio Preprocessing for Deep Learning/mfcc.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
home/assets/blog/2023-09-21_Audio Preprocessing for Deep Learning/quantization.png
(Stored with Git LFS)
Normal file
BIN
home/assets/blog/2023-09-21_Audio Preprocessing for Deep Learning/quantization.png
(Stored with Git LFS)
Normal file
Binary file not shown.
|
|
@ -0,0 +1,6 @@
|
|||
|
||||
## Security
|
||||
|
||||
## Offloading storage operations
|
||||
|
||||
IoT usually rely on SD card or eMMC for storage. SD card I/O (Input/Ouput) operations can be slow depending on the card and the IoT and both SD card and eMMC lifespan suffers from writting many times. Those points always bothers me and I like to leverage the computer "memory" (the volatile memory : RAM) as much as possible.
|
||||
|
|
@ -0,0 +1,38 @@
|
|||
# Training A Speech-to-Text Neural Network
|
||||
|
||||
Speech-To-Text Recurrent Neural Network (RNN)
|
||||
|
||||
|
||||
### Displaying the Data
|
||||
|
||||
In order to check out the sample data from the dataset and confirm its topology, I added a few arguments to the main function.
|
||||
|
||||
We can run the Python script with the `display` argument to get a sample output of our original data. This includes all the features like transcription, the raw samples and its shape, sample rate, duration, speaker ID, and more.
|
||||
|
||||
I also added a few optional flags for confirming the original data visually and audibly.
|
||||
- `--waveform` will show a graph of the waveform, using Matplotlib
|
||||
- `--spectrogram` will show a graph of the spectrogram (given by STFTs not MFCCs), using Librosa
|
||||
- `--mfcc` will show a graph of the spectrogram (MFCCs), using Librosa
|
||||
- `--play` will play the audio file
|
||||
|
||||
|
||||
|
||||
After running this, we now have our preprocessed data! We've transformed the dataset into usable MFCC data stored alongside the extracted features in persistent storage that's performant.
|
||||
|
||||
Using the `read-mfcc` argument in the Python script, I can confirm that the processed data has been stored properly and is readable by our model in a topology that is useful.
|
||||
|
||||
|
||||
## Architecture
|
||||
|
||||
Input shape
|
||||
- .
|
||||
|
||||
Layers
|
||||
- GRU Layer
|
||||
- GRU Layer
|
||||
- Dense Layer
|
||||
- Dropout Layer (to prevent overfitting)
|
||||
- Dense Layer (softmax output)
|
||||
|
||||
Output shape
|
||||
- .
|
||||
Loading…
Add table
Add a link
Reference in a new issue