Deploy your Deep Learning algorithms on embedded systems using FPGAs
Quebec City, July 4th, 2022 — Machine learning (ML) occupies an important place within the large family of digital signal and image processing methods. Because ML algorithms are resource-intensive, they have historically been used in association with cloud computing, i.e. in high-performance computing environments. Recently, thanks to the evolution of edge computing techniques, it has become possible and even interesting to deploy these algorithms on embedded systems. Still, even with today’s advances, it is often challenging for an embedded systems designer to choose the hardware configuration to use to deploy the ML method. Typically, when the algorithm to be deployed requires a lot of resources compared to what is available to the CPU, it is not uncommon to call on technologies such as GPUs or FPGAs to accelerate calculations and thus improve processing throughput. In this document, we present a comparative experiment carried out to compare the deployment of artificial neural networks (ANN) on CPU, GPU, and FPGA. Our results demonstrate that deployment on FPGA allows to minimize the cost, maximize the processing throughput, and obtain the best energy efficiency. Finally, we will discuss a project in which Deep Learning applied to the mining field was deployed on an embedded FPGA.
The use of machine learning to solve, among other things, classification problems in digital vision is expanding rapidly and driven by the expanding field of Deep Learning methods. So-called deep artificial neural networks (ANN) are now used every day to solve ever more elaborate classification problems. These promising new methods push the limits of conventional computer vision algorithms to classify objects in various scenarios.
In machine learning, it is common to use ANN with a supervised learning approach. This usually involves two stages:
Training: At this stage, the ML designer seeks to answer several questions that relate to the problem to solve. For instance, one wants to determine the type of ANN to use, to establish and prepare the training dataset as well as the validation dataset, to establish the operations to be carried out in pre-processing as well as in post-processing, etc. As this step requires the flexibility of an R&D environment, it is generally performed on one or more machines where resources (memory size, computing capacity, etc.) are not an issue.
Inferences: Once the ANN has been trained, it can be used in inference. By this, we mean that the network can be confronted with entirely new data that is similar to the training data. If the training has proven effective, it will be able to make correct predictions. A typical example of inference would be: an ANN trained to recognize cats and dogs is presented at its input with an image of a dog that it has never consumed before, nevertheless, it is able to predict that the input image contains a dog.
It is during the second step that the choice is made whether or not to deploy the ANN on an embedded system so that the processing is at the edge. An alternative is that some platform captures the data and transmits this data to the cloud where an ANN makes inferences. The choice to include an ANN in the on-board system seems obvious when:
The portability of the system is a requirement and the environments are multiple, even unpredictable;
There is a need for low latency to obtain the information at the output of the ANN, for example, to feed other algorithms which are in the same embedded system;
There are constraints on wireless communication: bandwidth, network access, etc;
When an embedded system deployment is chosen, the designer faces many options regarding the type of computing technology. In many cases, it is necessary to use an accelerator because the capacity of the embedded CPU to do the calculations by itself is quickly exceeded. Thus, it is not uncommon to use an onboard GPU or an FPGA in collaboration with a CPU to accelerate algorithms. We will see in the next section that this is a necessary step when deploying an ANN.
Performance comparison of an on-board ANN
In order to compare the performances of an ANN embedded on CPU, GPU and FPGA, Six Metrix has developed a testbench using a Jetson AGX Xavier dev kit (~USD$700) and a Kria KV260 board (~USD$200). On the Jetson, a framework that maximizes the use of CUDA cores is used. On the Kria KV260 board, the ANN is deployed using a Xilinx DPU block. The latter involves modifying the model to use fixed-point rather than floating-point coefficients (therefore performing quantization). On the CPU side, the Jetson's ARM processor (8-core ARM v8.2 64-bit CPU) is used to compare performance with the GPU and FPGA. Model accuracy between the CPU, GPU, and FPGA implementations is the same.
Two types of convolutional ANNs are deployed, a Resnet-50 and a MobileNet. These two typical networks in AI are used to solve classification problems in computer vision. The same datasets are used with each platform, all processing includes:
A pre-processing step that is specific to each platform;
The ANN inference step;
A post-processing step to format the output data.
The figure below illustrates these steps for the FPGA and the GPU. Note that when the CPU is also used to perform the inference and post-processing steps, it is the Jetson's ARM CPU.
Figure 1: Representation of benchmarking steps
The benchmark allows us to measure the time taken by each computing engine to make a large number of inferences (thousands). The results for the average number of FPS and the number of FPS per watt are presented in Figure 2 and take into account the 3 steps above. The following conclusions can be drawn from the analysis of these results:
It is counterproductive to deploy an ANN using only an embedded CPU, the use of an accelerator (GPU or FPGA) is necessary;
For the ANN models used in this experiment, the deployment to FPGA makes it possible to obtain a processing throughput equivalent or higher to a deployment to an embedded GPU;
The energy efficiency of FPGAs is an order of magnitude better than that of GPUs. In other words, for the same throughput, the FPGA consumes 10x less power than the embedded GPU;
The cost of an FPGA is often less than the cost of an embedded GPU.
Figure 2: Performance comparison: CPU, GPU and FPGA
Moreover, there are other arguments specific to the world of embedded systems that justify the use of an FPGA:
The number of I/Os: One of the biggest limitations of embedded GPUs is the number of I/Os available to interface the sensors that often provide inputs to the AI. This limitation is not an issue for FPGAs since the FPGA model can be chosen specifically to meet this need.
Optimal Resource Utilization: Since FPGAs are fully programmable digital fabric, the designer can choose the size of this programmable logic to not only minimize cost but also maximize resource utilization. This approach also allows, at some later stage, to switch to an ASIC design which allows having an optimal cost for the production.
Unfortunately, although their use is often optimal, FPGAs are difficult to approach and the use of specialized resources is necessary to ensure effective deployment. Thanks to our extensive expertise in embedded digital processing systems, Six Metrix is positioned as a key player in the ANN on the FPGA accelerator market.
Success Story: The deployment on FPGA of a bolt detection ANN
Nemesis Intelligence Inc, a company that innovates in the mining field through the development of cutting-edge algorithms, has trained an ANN to detect bolts on the walls of underground mines. This new approach makes it possible to extract critical information in the immediate environment of each vehicle in the mine. This information is then used to feed multiple algorithms that allow mine operators to make better decisions.
As Nemesis equipment is embedded in vehicles moving in underground galleries, wireless data transfer is an issue. The deployment of ANN on an embedded system is therefore required.
It is in this context that a partnership was established between Six Metrix and Nemesis Intelligence to deploy the trained ANN in an R&D environment on embedded FPGAs.
In order to proceed with the FPGA deployment, we
familiarized with the ANN model used as well as with the pre-processing and post-processing stages that were created;
performed ANN quantization. This step sometimes requires the use of pruning to modify the ANN so that it is implementable;
validated the performance of the onboard ANN against the unquantized ANN ;
created an embedded data processing application using the quantized ANN and including the pre/post-processing stages;
deployed the application on FPGA using a platform specially designed for it.
Note that this process is iterative with respect to steps 2 and 3. In the case Nemesis, the application developed is an image processing pipeline that performs the following operations:
High-definition images are first pre-processed on FPGA to adapt them to the expected ANN input;
The images are then passed on to the DPU which performs the ANN inferences;
The results are presented as data in text format and annotated images as shown on Figure 3.
Figure 3: An ANN identifies and classifies bolts on the walls of an underground mine (image property of Nemesis Intelligence Inc)
To contact Six Metrix: