website/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/Deep Learning Framework Benchmarks.md
2021-10-07 14:04:30 +09:00

135 lines
No EOL
9.8 KiB
Markdown

# Deep Learning Framework Benchmark
*Posted on October, 7 2021*
## Preambule
There are few frameworks to work on Deep Learning neural networks, I used to be very familiar with Tensorflow back in the days when the second version was not yet released. As someone with software engineering background the strictness and clarity of this first version of Tensorflow was a joy. Also the graph outputed by tensorboard were amazing to the point of getting the habits of debugging my networks from tensorboard most of the time.
<div class="image-row">
### Graph showed in tensorboard from Tensorflow version 1
![Tensorboard graph from version 1](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/tf1_graph.png)
![Network graphfrom version 1](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/tf1_graph_network.png)
![Train modules graph from version 1](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/tf1_graph_train.png)
</div>
Those days are gone, now a new era of dynamic programming came to the Deep Learning field with PyTorch becoming increasingly popular (from what I have experienced), the second version of Tensorflow converging to the same API and Jax going a step further with near-python programming paradigm. The dynamic paradigm has some very nice points, especially if you do reinforcement learning it makes things way easier.
There are also other frameworks I haven't yet tested like [MXNet](https://mxnet.apache.org/versions/1.8.0/).
Now most of the frameworks I have experienced have nearly the same API and ONNX brings a very nice way to output the final result of trainings independently of the framework. Thus choosing which one to use is getting less clear than before.
Lately I have been trying out some RNN-like network with different modification to improve the infamous *long term memory* problem (Hopefully I will post something about that latter). Using PyTorch I feel very frustrated that the included [LSTM layer](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) was running very well but **an equivalent code would run several times slower** (around 1/3), even following the [official documentation on GPU optimizations](https://pytorch.org/blog/optimizing-cuda-rnn-with-torchscript/) (which seems deprecated on few points), sometimes too the point of going at 10% of the initial speed. So if I want to do some research I might as well choose a framework that wouldn't work so slow I would need to wait for hours a training that could be performed in minutes. But would other frameworks really give me better performance?
I decided to see for myself how the different framework behave, starting from simple operations and hopefully testing up to whole network trainings.
I will use a naming convention for the frameworks (also called platform in my scripts) tested here:
* TF1 : first version of Tensorflow (verion 1.x), as of this writting the latest version is 1.15
* TF2 : second version of Tensorflow (verion 2.x), as of this writting the latest version is 2.6
* TF2_V1 : second version of Tensorflow but using the compatibility API to write as the first version, also disabling the dynamic behaviour (I suspected different performance)
* Torch : PyTorch
* Jax
## Benchmarking implementation
### No Gradient
This is obvious but PyTorch is very nice for the majority of the time were you need to compute gradients but not here as I started with the most simple operations first. The `requires_grad=False` argument on all tensors does the trick on PyTorch while Tensorflow and Jax don't need any additional care as far as I know.
### Warmup
I have experienced many time on all framework so far that the first run is always several time slower, this is obvious for dynamically allocated tensors of the modern frameworks but I strongly remember this happened too when I was using TF1. To avoid the first run to skew the benchmark each experiment has a small warmup loop:
```
# warmup
for _ in range(20):
self.experiment()
```
### Optimizations
I had to test with **random tensor at start and before each operation** to be sure that frameworks do not optimize out some already made operations (could be cache), especially since I disabled gradient computation. All my tests showed no difference so I stuck with tensors initially filled with ones.
### Benchmark time
Each operation is benchmarked during an "experiment", to get consistent benchmarking time a first loop is done to estimate the number of operation per second then the loop being benchmarked is run with a fixed number of step from the estimation. This allows to set in a configuration file the time per experiment for statistical stability and avoid unnecessary call to the system clock (CPython not being know for its speed I'd rather have a simple integer increment per loop as overhead).
Latter this could also make a progression bar with ETA possible as the benchmarks can be quiet exhaustive.
## Results
**The code is publicly available [here](https://gitlab.com/corentin-pro/dl_bench). It will output raw data as csv files and their plots. All the data and plot from my machine (NVIDIA GeForce RTX 2060 SUPER) can be downloaded [here](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/gpu_NVIDIA%20GeForce%20RTX%202060%20SUPER.zip).**
<div class="image-row">
### Experiment benchmark samples
![Benchmark results for Torch with the matmul operation](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/result_torch_matmul_float32.png)
![Benchmark results for Jax with the dense layer](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/result_jax_nn_dense_float32.png)
</div>
As expected the bigger the operations (experiments) the better [GFLOPS](https://en.wikipedia.org/wiki/FLOPS) (Giga floating point operations per second) the GPU can output. So far nothing unexpected.
### Comparisons
Comparison plots are also generated from the experiment data, for now the only comparison are done between 'platforms' (aka framework) but data type comparisons could be interesting in the future. Categories were made to plot subsets of comparisons in order to keep the scale of the y axis linear, the script will automatically switch to logarithmic scale if needed in the general case. The categories are ranges of Mop (Milions of operations) per experiment like `MEDIUM = [20, 1000]` (there is SMALL, MEDIUM, LARGE and VERY_LARGE) and can be changed in the configuration files.
<div class="image-row">
### Comparison samples
![Comparison of the add operation](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/result_add_float32.png)
![Comparison of the dense layer for the MEDIUM category](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/result_nn_dense_float32_MEDIUM.png)
![Comparison of 5 dense layer in sequence for the VERY_LARGE category](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/result_nn_dense_x5_float32_VERY_LARGE.png)
</div>
**NOTE** : all operations with the `nn` prefix means that it is run inside a 'module' (or equivalent), in Jax for instace I used `stax` and `jit` as intended by the library. JIT is not needed as far as I tested for Torch.
Torch seems the best for simple and small operation while tensorflow in general seems to have big overheads. Jax does very well once we use the JIT. All frameworks tends to converge more with bigger layers/operations, the XLA based Tensorflow and Jax seems to have slightly better performance there. But for small operations Torch can be orders of magnitude faster!
The result between float32 and float16 are very similar but float64 is different:
* For some reasons TF2 didn't accept matmul on float64 inside a module, I should fix that latter
* TF2 get better results relatively to other platforms
* Except for element-wise operations, Torch doesn't have its lead on small operations
* There is a weird behaviour for the matmul of 800x800 tensors both in Torch and TF2. After additional testing I couldn't figure out why the first runs (even after warmup) were way to fast.
The specific behaviour of the 800x800 matmul in the data (see `run times (s)`) looks like :
```
experiment run times (s) count ms/matmul Mop/matmul GFLOPS
300 800x800 @ 800x800 0.03258013725280762 60 0.5430022875467936 1022.72 1883.4543121733468
[...]
308 800x800 @ 800x800 0.032579898834228516 60 0.5429983139038086 1022.72 1883.4680952272229
309 800x800 @ 800x800 0.03258252143859863 60 0.5430420239766439 1022.72 1883.316492728723
310 800x800 @ 800x800 0.1323096752166748 60 2.2051612536112466 1022.72 463.7846771183555
311 800x800 @ 800x800 0.2970736026763916 60 4.951226711273193 1022.72 206.55891148579838
312 800x800 @ 800x800 0.29687929153442383 60 4.947988192240397 1022.72 206.6941068298959
[...]
329 800x800 @ 800x800 0.2968714237213135 60 4.947857062021892 1022.72 206.69958472528631
```
It is the only instance of such a behavior across all operations and even within the matmul benchamrk. Because of this the result plot doesn't look great :
![Comparison of the flaot64 matmul operation (LARGE category)](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/result_matmul_float64_LARGE.png)
## Conclusion
The results so far confort me into using Torch overall as I usually design small networks but Jax seems to be a very interesting contender. I am surprise the difference on small/medium operations could be that significant between Torch and TF2, I sometimes use my DL framework for GPU accelerated math in other context so it is interesting.
The code is not yet complete and in the future I would like to test for more:
* Convolutions : 1d, 2d, transpose
* Gradient
* Optimizer
* RNN : which was the trigger that started all of this
* Data transfert? (CPU->GPU and GPU->CPU)
If you have questions or remarks you can contact me or reply the [reddit post](https://www.reddit.com/r/MachineLearning/comments/q2y9n5/d_deep_learning_framework_benchmark/).