diff --git a/.gitattributes b/.gitattributes index 3721121..2ad9c5a 100644 --- a/.gitattributes +++ b/.gitattributes @@ -1,4 +1,5 @@ -assets/blog/**/* filter=lfs diff=lfs merge=lfs -text +assets/blog/**/*.jpg filter=lfs diff=lfs merge=lfs -text +assets/blog/**/*.png filter=lfs diff=lfs merge=lfs -text assets/fonts/* filter=lfs diff=lfs merge=lfs -text assets/icons/* filter=lfs diff=lfs merge=lfs -text assets/images/* filter=lfs diff=lfs merge=lfs -text diff --git a/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/Deep Learning Framework Benchmarks.md b/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/Deep Learning Framework Benchmarks.md index 9d94ceb..966ff83 100644 --- a/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/Deep Learning Framework Benchmarks.md +++ b/assets/blog/2021-10-07_Deep Learning Framework Benchmarks/Deep Learning Framework Benchmarks.md @@ -1,3 +1,138 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:55ac55dc3dee1a9da272d51dbce19ea52d409f697ac05d50743b9684f28de6b5 -size 10047 +# Deep Learning Framework Benchmark +*Posted on October, 7 2021* + +## Preambule + +There are few frameworks to work on Deep Learning neural networks, I used to be very familiar with Tensorflow back in the days when the second version was not yet released. As someone with software engineering background the strictness and clarity of this first version of Tensorflow was a joy. Also the graph outputed by tensorboard were amazing to the point of getting the habits of debugging my networks from tensorboard most of the time. + +
+ +### Graph showed in tensorboard from Tensorflow version 1 + +![Tensorboard graph from version 1](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/tf1_graph.png) +![Network graphfrom version 1](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/tf1_graph_network.png) +![Train modules graph from version 1](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/tf1_graph_train.png) + +
+ +Those days are gone, now a new era of dynamic programming came to the Deep Learning field with PyTorch becoming increasingly popular (from what I have experienced), the second version of Tensorflow converging to the same API and Jax going a step further with near-python programming paradigm. The dynamic paradigm has some very nice points, especially if you do reinforcement learning it makes things way easier. + +There are also other frameworks I haven't yet tested like [MXNet](https://mxnet.apache.org/versions/1.8.0/). + +Now most of the frameworks I have experienced have nearly the same API and ONNX brings a very nice way to output the final result of trainings independently of the framework. Thus choosing which one to use is getting less clear than before. + +Lately I have been trying out some RNN-like network with different modification to improve the infamous *long term memory* problem (Hopefully I will post something about that latter). Using PyTorch I feel very frustrated that the included [LSTM layer](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) was running very well but **an equivalent code would run several times slower** (around 1/3), even following the [official documentation on GPU optimizations](https://pytorch.org/blog/optimizing-cuda-rnn-with-torchscript/) (which seems deprecated on few points), sometimes too the point of going at 10% of the initial speed. So if I want to do some research I might as well choose a framework that wouldn't work so slow I would need to wait for hours a training that could be performed in minutes. But would other frameworks really give me better performance? + +I decided to see for myself how the different framework behave, starting from simple operations and hopefully testing up to whole network trainings. + +I will use a naming convention for the frameworks (also called platform in my scripts) tested here: + +* TF1 : first version of Tensorflow (verion 1.x), as of this writting the latest version is 1.15 +* TF2 : second version of Tensorflow (verion 2.x), as of this writting the latest version is 2.6 +* TF2_V1 : second version of Tensorflow but using the compatibility API to write as the first version, also disabling the dynamic behaviour (I suspected different performance) +* Torch : PyTorch +* Jax + + +## Benchmarking implementation + +### No Gradient + +This is obvious but PyTorch is very nice for the majority of the time were you need to compute gradients but not here as I started with the most simple operations first. The `requires_grad=False` argument on all tensors does the trick on PyTorch while Tensorflow and Jax don't need any additional care as far as I know. + +### Warmup + +I have experienced many time on all framework so far that the first run is always several time slower, this is obvious for dynamically allocated tensors of the modern frameworks but I strongly remember this happened too when I was using TF1. To avoid the first run to skew the benchmark each experiment has a small warmup loop: + +``` +# warmup +for _ in range(20): + self.experiment() +``` + +### Optimizations + +I had to test with **random tensor at start and before each operation** to be sure that frameworks do not optimize out some already made operations (could be cache), especially since I disabled gradient computation. All my tests showed no difference so I stuck with tensors initially filled with ones. + +### Benchmark time + +Each operation is benchmarked during an "experiment", to get consistent benchmarking time a first loop is done to estimate the number of operation per second then the loop being benchmarked is run with a fixed number of step from the estimation. This allows to set in a configuration file the time per experiment for statistical stability and avoid unnecessary call to the system clock (CPython not being know for its speed I'd rather have a simple integer increment per loop as overhead). + +Latter this could also make a progression bar with ETA possible as the benchmarks can be quiet exhaustive. + +## Results + +**The code is publicly available [here](https://gitlab.com/corentin-pro/dl_bench). It will output raw data as csv files and their plots. All the data and plot from my machine (NVIDIA GeForce RTX 2060 SUPER) can be downloaded [here](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/gpu_NVIDIA%20GeForce%20RTX%202060%20SUPER.zip).** + +
+ +### Experiment benchmark samples + +![Benchmark results for Torch with the matmul operation](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/result_torch_matmul_float32.png) +![Benchmark results for Jax with the dense layer](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/result_jax_nn_dense_float32.png) + +
+ +As expected the bigger the operations (experiments) the better [GFLOPS](https://en.wikipedia.org/wiki/FLOPS) (Giga floating point operations per second) the GPU can output. So far nothing unexpected. + +### Comparisons + +Comparison plots are also generated from the experiment data, for now the only comparison are done between 'platforms' (aka framework) but data type comparisons could be interesting in the future. Categories were made to plot subsets of comparisons in order to keep the scale of the y axis linear, the script will automatically switch to logarithmic scale if needed in the general case. The categories are ranges of Mop (Milions of operations) per experiment like `MEDIUM = [20, 1000]` (there is SMALL, MEDIUM, LARGE and VERY_LARGE) and can be changed in the configuration files. + +
+ +### Comparison samples + +![Comparison of the add operation](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/result_add_float32.png) +![Comparison of the dense layer for the MEDIUM category](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/result_nn_dense_float32_MEDIUM.png) +![Comparison of 5 dense layer in sequence for the VERY_LARGE category](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/result_nn_dense_x5_float32_VERY_LARGE.png) + +
+ +**NOTE** : all operations with the `nn` prefix means that it is run inside a 'module' (or equivalent), in Jax for instace I used `stax` and `jit` as intended by the library. JIT is not needed as far as I tested for Torch. + +Torch seems the best for simple and small operation while tensorflow in general seems to have big overheads. Jax does very well once we use the JIT. All frameworks tends to converge more with bigger layers/operations, the XLA based Tensorflow and Jax seems to have slightly better performance there. But for small operations Torch can be orders of magnitude faster! + +The result between float32 and float16 are very similar but float64 is different: + +* For some reasons TF2 didn't accept matmul on float64 inside a module, I should fix that latter +* TF2 get better results relatively to other platforms +* Except for element-wise operations, Torch doesn't have its lead on small operations +* There is a weird behaviour for the matmul of 800x800 tensors both in Torch and TF2. After additional testing I couldn't figure out why the first runs (even after warmup) were way to fast. + +The specific behaviour of the 800x800 matmul in the data (see `run times (s)`) looks like : + +``` + experiment run times (s) count ms/matmul Mop/matmul GFLOPS +300 800x800 @ 800x800 0.03258013725280762 60 0.5430022875467936 1022.72 1883.4543121733468 +[...] +308 800x800 @ 800x800 0.032579898834228516 60 0.5429983139038086 1022.72 1883.4680952272229 +309 800x800 @ 800x800 0.03258252143859863 60 0.5430420239766439 1022.72 1883.316492728723 +310 800x800 @ 800x800 0.1323096752166748 60 2.2051612536112466 1022.72 463.7846771183555 +311 800x800 @ 800x800 0.2970736026763916 60 4.951226711273193 1022.72 206.55891148579838 +312 800x800 @ 800x800 0.29687929153442383 60 4.947988192240397 1022.72 206.6941068298959 +[...] +329 800x800 @ 800x800 0.2968714237213135 60 4.947857062021892 1022.72 206.69958472528631 +``` + +It is the only instance of such a behavior across all operations and even within the matmul benchamrk. Because of this the result plot doesn't look great : + +![Comparison of the flaot64 matmul operation (LARGE category)](/blog/2021-10-07_Deep%20Learning%20Framework%20Benchmarks/result_matmul_float64_LARGE.png) + + +## Conclusion + +The results so far confort me into using Torch overall as I usually design small networks but Jax seems to be a very interesting contender. I am surprise the difference on small/medium operations could be that significant between Torch and TF2, I sometimes use my DL framework for GPU accelerated math in other context so it is interesting. + +The code is not yet complete and in the future I would like to test for more: + +* Convolutions : 1d, 2d, transpose +* Gradient +* Optimizer +* RNN : which was the trigger that started all of this +* Data transfert? (CPU->GPU and GPU->CPU) + +If you have questions or remarks you can contact me or reply the [reddit post](https://www.reddit.com/r/MachineLearning/comments/q2y9n5/d_deep_learning_framework_benchmark/). + + +Corentin Risselin. diff --git a/assets/blog/2023-09-12_Self-Hosted Web Infrastructure/Self-Hosted Web Infrastructure.md b/assets/blog/2023-09-12_Self-Hosted Web Infrastructure/Self-Hosted Web Infrastructure.md index 712ea8a..ef3b80b 100644 --- a/assets/blog/2023-09-12_Self-Hosted Web Infrastructure/Self-Hosted Web Infrastructure.md +++ b/assets/blog/2023-09-12_Self-Hosted Web Infrastructure/Self-Hosted Web Infrastructure.md @@ -1,3 +1,78 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:69b2bbc5eefa95f9da335e3d62db52871b0a8af9041457a4d88992ba6fc06641 -size 7967 +# Self-Hosted Web Infrastructure +*Posted on September, 12 2023* + +## Preambule + +This month I had to setup a new server to host this very website as well as additional subdomains for self-hosted Git and Doc/Calendar sharing application. All of this doesn't require any big hardware and can be handle by a small IoT in my office. + +Deploying a server, even a small and simple one, has some caveates that need to be taken into consideration. I will share the setup I just did for the last server I am using to host my application in-house. + +## Hardware consideration + +The first name that is probably coming to everyone is Raspberry-pi. It's capable enough, this website was hosted on a ~8 years old raspberry-pi 2 until now. I have also worked for different project with raspberry-pis (magic-mirror, robot-arm control with multi-camera setup, etc) and it was always very frustrating. + +Here's a quick overview of the problems I know: + +* GPU capabilities were not great because the vendor Broadcom wasn't helpful with the FOSS (Free Open Sourced Software) community and the driver was lacking. This means that outputing realtime images/video and overlay would be very slow unless you swtich to the experimental driver. The experimental driver has issues such as stability and I even got noisy output (bearable, only few pixels here and there but weird nonetheless). +* No real ethernet but ethernet over USB (was fixed from version 3 or 4), coupling with a single not-so-powerful USB controller, this would lead to USB congestion. For instance you couldn't use 2 webcams in 720p resolution at 30 fps (not even sure about 480p, I had to tune down the resolution and FPS a lot). During USB congestion any SSH session would get stuck working with such a device. +* SD card reader was extremely slow (version 1 and 2, probably 3 too) +* With time the value proposition gets worse for the price : the raspberry-pi started as a cheap computer experience for educational purposes and prototyping. But now the price is quadruple for performance that is not even above average compared to other devices +* Version 4 opted for a micro-HDMI instead of full-sized + +There are many alternatives I looked at, among them one caught my attention : Odroid. This device was widely known for a long time and the company, base in South-Korea, is now named hardkernel. The proximity of the company is very nice and the catalog they offer is very interesting. Here are a few points that made me, over time, look mostly at them for my IoT needs: + +* Very focused on making hardware and not selling other things +* Great value proposition for the price : especially for the accessories (power supply, cases, etc) +* Updating hardware frequently : you are not stuck buying a 5 year old CPU +* Hardware is documented : schematics and even benchmarks (performance and power consumption)! +* Amazing support : I once had a problem with a single component, reaching to them by email I had a response immediately and they shipped me a replacement for free in the next order I made (I asked for that specifically as I was buying a new device to make my NAS). + +Long story short : I eventually have ended up using the Odroid-N2+ device multiple times and it is mostly flawless (using panfrost driver for the GPU is tricky). You can check [all the specifications from hardkernel's website](https://www.hardkernel.com/shop/odroid-n2-with-4gbyte-ram-2). + +
+ +### Odroid N2+ + +![Odroid N2+ photo](/blog/2023-09-12_Self-Hosted%20Web%20Infrastructure/odroidn2plus.jpg) + +
+ + +## OS/Distribution + +### Distribution Choice + +Finding a good Linux distribution can be troublesome too. For me the choice is not to hard to make : I come from a Debian background and most of the corporate-minded distributions such as CentOS(discountined now)/RHEL have no appeal to me. I can fix my issues and contribute back if needed, I don't need support. + +[Debian](https://wiki.debian.org) is amazing at it's job : being stable. The people maintaining it are probably the most professional and rigorous I've seen. There is one drawback of the distribution : everything is ~3 years old. Because of my bias with deep learning engineering needing more recent libraries and issues that I had in the past with too-old databases (namely MySQL) to get very important features, I am choosing another distribution. Also note that official Debian do not offer a lot of possibilities for [Arm ports](https://wiki.debian.org/Arm64Port). + +For IoTs [Armbian](https://www.armbian.com/) is amazing and definitely a strong choice. I am using this in some of my devices, it has very interesting defaults that I will get inspiration from (like on-memory logs). + +Being familiar with ArchLinux, [ArchLinuxArm](https://archlinuxarm.org/) is a perfect fit for me. I like the deliberate choice to leave configurations as software's default. It takes some work off the maintainer and onto the user. A lot of softwares improved their defaults in the last years anyway and it gives more knowledge/control to the user. + +**Note :** Another noticeable distribution is [Manjaro Arm](https://manjaro.org/download/#ARM), on x86-64 platform there are differences between ArchLinux and Manjaro that spark a lot of debate. But on Arm platforms ArchLinuxArm is a totally different project and there Manjaro can bring a lot of value : packages can be better maintained (noticeable if you want the latest GPU driver). + +### Installation + +Compared to PC installation, installing ArchLinux for Arm platforms is way more straightforward. The boot sequence of Arm devices is very different and I have never seen UEFI on Arm. Also Arm devices boot thanks to device tree (declared is .dts files in the kernel code directly and compiled to .dtb files), those describe all the hardware and the initialization values to make the device works. This means that the Linux project has hundred, if not thousand, of those files to be able to run on any device (the dts can be targeting a family of devices shared thus far). + +Installing Linux on an IoT usually means copying a prepared image on an SD card or other storage. Sometimes with the DTB files given in a separate boot partition for linux to boot with. Also additional initiazition values can be set in the boot partition in a `boot.ini` files. The boot sequence usually starts not from linux but from a boot loader : on PC Grub is a common one, on IoT you may encounter `u-boot` and more rarely `petitboot`. Those are usually already prepared so nothing needs to be done on that part. + +For ArchLinuxArm preparing the partition on the storage is the biggest part of the installation but only requires ~10 commands and is nicely documented. For the Odroid-N2 the procedure is described [here](https://archlinuxarm.org/platforms/armv8/amlogic/odroid-n2). + +Additional commands you probably want to do at first, as root, would be: + +* **Edit pacman (the package manager) configuration** : `vim /etc/pacman.conf` (here's with `vim` but `nano` would also work fine). Uncommenting the `Color`, `VerbosePkgLists` and `ParallelDownloads` is a must have. +* **Update all packages** : `pacman -Syu` +* **Installing sudo** : `pacman -S sudo` +* **Create sudo group and add your user to it** : `groupadd sudo` and `usermod -aG sudo myuser`. +* **Allow sudo group to run any command** : `EDITOR=vim visudo /etc/sudoers` (again `nano` would work instead of `vim`). Make sure the line `%sudo ALL=(ALL:ALL) ALL` is uncommented. +* **Adding your user** : `useradd -m myuser`. The `-m` option is to create a home folder (`/home/myuser`) +* **Setting your user password** : `passwd myuser`. + + +## Conclusion + +I hope this instroduction to the hardware and software used for self hosting was interesting and maybe gave you some thoughts about trying it yourself. There are many topics from here that I will present : security, server software installation and configuration, DNS, certificates, etc. Stay tuned for the next post. + +Corentin Risselin. \ No newline at end of file diff --git a/assets/blog/draft_2023-09-XX-_Web_infrastructure_2/Web Infrastructure 2.md b/assets/blog/draft_2023-09-XX-_Web_infrastructure_2/Web Infrastructure 2.md index dd577a2..ce0e3f3 100644 --- a/assets/blog/draft_2023-09-XX-_Web_infrastructure_2/Web Infrastructure 2.md +++ b/assets/blog/draft_2023-09-XX-_Web_infrastructure_2/Web Infrastructure 2.md @@ -1,3 +1,6 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:8baa7d66f3c605acc7c6a9e1208adc4313aac8a13766a51fe1ca34ca748bd893 -size 374 + +## Security + +## Offloading storage operations + +IoT usually rely on SD card or eMMC for storage. SD card I/O (Input/Ouput) operations can be slow depending on the card and the IoT and both SD card and eMMC lifespan suffers from writting many times. Those points always bothers me and I like to leverage the computer "memory" (the volatile memory : RAM) as much as possible.