Posts Tagged ‘computer vision’

Arm Research Summit 2017 Streamed Live on September 11-13

September 11th, 2017 2 comments

The Arm Research Summit is “an academic summit to discuss future trends and disruptive technologies across all sectors of computing”, with the second edition of the even taking place now in Cambridge, UK until September 13, 2017.

Click to Enlarge

The Agenda includes various subjects such as architecture and memory, IoT, HPC, computer vision, machine learning, security, servers, biotechnology and others. You can find the full detailed schedule for each day on Arm website, and the good news is that the talks are streamed live in YouTube, so you can follow the talks that interest you from the comfort of your home/office.

Note that you can switch between rooms in the stream above by clicking on <-> icon. Audio volume is a little low…

Thanks to Nobe for the tip.

Getting Started with OpenCV for Tegra on NVIDIA Tegra K1, CPU vs GPU Computer Vision Comparison

May 24th, 2017 No comments

This is a guest post by Leonardo Graboski Veiga, Field Application Engineer, Toradex Brasil


Computer vision (CV) is everywhere – from cars to surveillance and production lines, the need for efficient, low power consumption yet powerful embedded systems is nowadays one of the bleeding edge scenarios of technology development.

Since this is a very computationally intensive task, running computer vision algorithms in an embedded system CPU might not be enough for some applications. Developers and scientists have noticed that the use of dedicated hardware, such as co-processors and GPUs – the latter traditionally employed for graphics rendering – can greatly improve CV algorithms performance.

In the embedded scenario, things usually are not as simple as they look. Embedded GPUs tend to be different from desktop GPUs, thus requiring many workarounds to get extra performance from them. A good example of a drawback from embedded GPUs is that they are hardly supported by OpenCV – the de facto standard libraries for computer vision – thus requiring a big effort from the developer to achieve some performance gains.

The silicon manufacturers are paying attention to the growing need for graphics and CV-oriented embedded systems, and powerful processors are being released. This is the case with the NVIDIA Tegra K1, which has a built-in GPU using the NVIDIA Kepler architecture, with 192 cores and a processing power of 325 GFLOPS. In addition, this is one of the very few embedded GPUs in the market that supports CUDA, a parallel computing platform from NVIDIA. The good news is that OpenCV also supports CUDA.

And this is why Toradex has decided to develop a System on Module (aka Computer on Module) – the Apalis TK1 – using this processor. In it, the K1 SoC Quad Core ARM Cortex-A15 CPU runs at up to 2.2GHz, interfaced to 2GB DDR3L RAM memory and a 16GB 8-bit eMMC. The full specification of the CoM can be found here.

The purpose of this article is to install the NVIDIA JetPack on the Apalis TK1 System on Module, thus also installing OpenCV for Tegra, and trying to assess how much effort is required to code some simple CV application accelerated by CUDA. The public OpenCV is also tested using the same examples, to determine if it is a viable alternative to the closed-source version from NVIDIA.


The hardware employed in this article consists of the Apalis TK1 System on Module and the Apalis Evaluation Board. The main features of the Apalis TK1 have been presented in the introduction, and regarding the Apalis Evaluation Board, we will use the DVI output to connect to a display and the USB ports to interface a USB camera and a keyboard. The Apalis TK1 is presented in figure 1 and the Apalis Evaluation Board in figure 2:

Figure 1 – Apalis TK1 – Click to Enlarge

Figure 2 – Apalis Evaluation Board – Click to Enlarge

System Setup

NVIDIA already provides an SDK package – the NVIDIA JetPack – that comes with all tools that are supported for the TK1 architecture. It is an easy way to start developing applications with OpenCV for Tegra support. JetPack also provides many source code samples for CUDA, VisionWorks, and GameWorks. It also installs the NVIDIA Nsight, an IDE that is based on Eclipse and can be useful for debugging CPU and GPU applications.

OpenCV for Tegra is based on version 2.4.13 of the public OpenCV source code. It is closed-source but free to use and benefits from NEON and multicore optimizations that are not present in the open-source version; on the other hand, the non-free libraries are not included. If you want or need the open-source version, you can find more information on how to build OpenCV with CUDA support here – these instructions were followed and the public OpenCV 2.4.13 was also tested during this article’s development.

Toradex provides an article in the developer website with concise information describing how to install JetPack on the Apalis TK1.

Regarding hardware, it is recommended that you have an USB webcam connected to the Apalis Evaluation Board because samples tested in this article often need a video source as input.

OpenCV for Tegra

After you have finished installing the NVIDIA JetPack, OpenCV for Tegra will already be installed on the system, as well as the toolchain required for compilation on the target. You must have access to the serial terminal by means of an USB to RS-232 adapter or an SSH connection.

If you want to run Python code, an additional step on the target is required:

The easiest way to check that everything works as expected is to compile and run some samples from the public OpenCV repository since it already has the Cmake configuration files as well as some source code for applications that make use of CUDA:

We can begin testing a Python sample, for instance, the edge detector. The running application is displayed in figure 3.

Figure 3 – running Python edge detector sample – Click to Enlarge

After the samples are compiled, you can try some of them. A nice try is the “background/foreground segmentation” samples since they are available with and without GPU support. You can run them from the commands below, as well as see the results in figures 4 and 5.

Figure 4 – running bgfg_segm CPU sample – Click to Enlarge

Figure 5 – running bgfg_segm GPU sample – Click to Enlarge

By running both samples it is possible to subjectively notice the performance difference. The CPU version has more delay.

Playing Around

After having things setup, the question comes: how easy it is to port some application from CPU to GPU, or even start developing with GPU support? It was decided to play around a little with the Sobel application that is well described in the Sobel Derivatives tutorial.

The purpose is to check if it’s possible to benefit from CUDA out-of-the-box, therefore only the function getTickCount from OpenCV is employed to measure the execution time of the main loop of the Sobel implementations. You can use the NVIDIA Nsight for advanced remote debugging and profiling.

The Code

The first code is run completely on the CPU, while in the first attempt to port to GPU (the second code, which will be called CPU-GPU), the goal is to try to find functions analog to the CPU ones, but with GPU optimization. In the last attempt to port, some improvements are done, such as creating filter engines, which reduces buffer allocation, and finding a way to replace the CPU function convertScaleAbs into GPU accelerated functions.

A diagram describing the loop for the three examples is provided in figure 6.

Figure 6 – CPU / CPU-GPU / GPU main loop for Sobel implementations

The main loop for the three applications tested is presented below. You can find the full source code for them on Github:

  • CPU only code:
  • CPU-GPU code:
  • GPU code

The Tests

  • Each of the three examples is executed using a random picture in jpeg format as input.
  • The input pictures dimensions in pixels that were tested are: 3483×2642, 2122×1415, 845×450 and 460×290.
  • The main loop is being iterated 500 times for each run.
  • All of the steps described in figure 6 have their execution time measured. This section will present the results.
  • Therefore there are 12 runs total.
  • The numbers presented in the results are the average values of the 500 iterations for each run.

The Results

The results presented are the total time required to execute the main loop – with and without image capture and display time, available in tables 1 and 2 – and the time each task takes to be executed, which is described in figures 7, 8, 9 and 10. If you want to have a look at the raw data or reproduce the tests, everything is in the aforelinked GitHub repository.

Table 1 – Main loop execution time, in milliseconds

Table 2 – Main loop execution time, discarding read and display image times, in milliseconds

Figure 7 – execution time by task – larger image (3483×2642 pixels) – Click to Enlarge

Figure 8 – execution time by task – large image (2122×1415 pixels) – Click to Enlarge

Figure 9 – execution time by task – small image (845×450 pixels) – Click to Enlarge

Figure 10 – execution time by task – smaller image (460×290 pixels) – Click to Enlarge

The Analysis

Regarding OpenCV for Tegra in comparison to the public OpenCV, the results point out that OpenCV for Tegra has been optimized, mostly for some CPU functions. Even when discarding image read  – that takes a long time to be executed, and has approximately a 2x gain – and display frame execution times, OpenCV for Tegra still bests the open-source version.

When considering only OpenCV for Tegra, from the tables, it is possible to see that using GPU functions without care might even make the performance worse than using only the CPU. Also, it is possible to notice that, for these specific implementations, GPU is better for large images, while CPU is best for small images – when there is a tie, it would be nice to have a power consumption comparison, which hasn’t been done, or also consider the fact that this GPU code is not optimized as best as possible.

Looking at the figures 7 to 10, it can be seen that the Gaussian blur and scale conversion from 16 bits to 8 bits had a big boost when running on GPU, while conversion of the original image to grayscale and the Sobel derivatives had their performance degraded. Another point of interest is the fact that transferring data from/to the GPU has a high cost, and this is, in part, one of the reasons why the first GPU port was unsuccessful – it had more copies than needed.

Regarding image size, it can be noticed that the image read and display have an impact in overall performance that might be relevant depending on the complexity of the algorithm being implemented, or how the image capture is being done.

There are probably many ways to try and/or make this code more optimized, be it by only using OpenCV; by combining custom CUDA functions with OpenCV; by writing the application fully in CUDA or; by using another framework or tool such as VisionWorks.

Two points that might be of interest regarding optimization still in OpenCV are the use of streams – asynchronous execution of code on the CPU/GPU – and zero-copy or shared memory, since the Tegra K1 has CPU and GPU shared memory supported by CUDA (see this NVIDIA presentation from GPU Technology Conference and this NVIDIA blog post for reference).


In this article, the installation of the NVIDIA JetPack SDK and deployment on the Toradex Apalis TK1 have been presented. Having this tool installed, you are able to use OpenCV for Tegra, thus benefiting from all of the optimizations provided by NVIDIA. The JetPack SDK also provides many other useful contents, such as CUDA, VisionWorks and GameWorks samples, and the NVIDIA Nsight IDE.

In order to assess how easy it is for a developer freshly introduced to the CV and GPU concepts to take advantage of CUDA, purely using OpenCV optimized functions, a CPU to GPU port of a Sobel filter application was written and tested. From this experience, some interesting results were found, such as the facts that GPU indeed improves performance – and this improvement magnitude depends on a series of factors, such as size of the input image, quality of implementation – or developer experience, algorithms being used and complexity of the application.

Having a myriad of sample source code, it is easy to start developing your own applications, although care is required in order to make the Apalis TK1 System on Module yield its best performance. You can find more development information in the NVIDIA documentation, as well as the OpenCV documentation. Toradex also provides documentation about Linux usage in its developer website, and has a community forum. Hope this information was helpful, see you next time!

Open Source ARM Compute Library Released with NEON and OpenCL Accelerated Functions for Computer Vision, Machine Learning

April 3rd, 2017 11 comments

GPU compute promises to deliver much better performance compared to CPU compute for application such a computer vision and machine learning, but the problem is that many developers may not have the right skills or time to leverage APIs such as OpenCL. So ARM decided to write their own ARM Compute library and has now released it under an MIT license.

The functions found in the library include:

  • Basic arithmetic, mathematical, and binary operator functions
  • Color manipulation (conversion, channel extraction, and more)
  • Convolution filters (Sobel, Gaussian, and more)
  • Canny Edge, Harris corners, optical flow, and more
  • Pyramids (such as Laplacians)
  • HOG (Histogram of Oriented Gradients)
  • SVM (Support Vector Machines)
  • H/SGEMM (Half and Single precision General Matrix Multiply)
  • Convolutional Neural Networks building blocks (Activation, Convolution, Fully connected, Locally connected, Normalization, Pooling, Soft-max)

The library works on Linux, Android or bare metal on armv7a (32bit) or arm64-v8a (64bit) architecture, and makes use of  NEON, OpenCL, or  NEON + OpenCL. You’ll need an OpenCL capable GPU, so all Mali-4xx GPUs won’t be fully supported, and you need an SoC with Mali-T6xx, T-7xx, T-8xx, or G71 GPU to make use of the library, except for NEON only functions.

In order to showcase their new library, ARM compared its performance to OpenCV library on Huawei Mate 9 smartphone with HiSilicon Kirin 960 processor with an ARM Mali G71MP8  GPU.

ARM Compute Library vs OpenCV, single-threaded, CPU (NEON)

Even with some NEON acceleration in OpenCV, Convolutions and SGEMM functions are around 15 times faster with the ARM Compute library. Note that ARM selected a hardware platform with one of their best GPU, so while it should still be faster on other OpenCL capable ARM GPUs the difference will be lower, but should still be significantly, i.e. several times faster.

ARM Compute Library vs OpenCV, single-threaded, CPU (NEON)

The performance boost in other function is not quite as impressive, but the compute library is still 2x to 4x faster than OpenCV.

While the open source release was just about three weeks ago, the ARM Compute library has already been utilized by several embedded, consumer and mobile silicon vendors and OEMs better it was open sourced, for applications such as 360-degree camera panoramic stitching, computational camera, virtual and augmented reality, segmentation of images, feature detection and extraction, image processing, tracking, stereo and depth calculation, and several machine learning based algorithms.

Rockchip RV1108 Visual Processor is Designed for 1080p & 2K Camera Applications

January 9th, 2017 2 comments

Rockchip has joined other companies in developing camera SoCs with their RV1108 Visual Platform based on a single Cortex A7 core, a CEVA XM4 visual processing DSP, and capable of H.264 video encoding up to 1440p30 / 1080p60.

360 Camera Demo at CES 2017

360° Camera Demo at CES 2017 – Click to Enlarge

Rockchip RV1108 main features and specifications:

  • CPU – ARM Cortex A7 @ up to 1.0 GHz
  • DSP – embedded CEVA XM4 vision processor up to 600MHz
  • Video Encoder – 2K/H.264, high definition & low bit rate
  • Camera – MIPI CSI and DVP interfaces
  • Image processing – Low-light-level night vision imaging: 8MP professional image processing unit
  • Audio Processing – Audio Codec supporting up to 8-way MIC array, 3A? phonetic algorithms, such as echo cancellation, noise suppression;
  • Networking – 10/100 Ethernet PHY

The processor is expected to be used in drones, IP cameras, car dashcams, sports/action cameras, as well as other applications such as panoramic cameras, computer vision applications, or real-time WiFi video transmitter. The company also released some comparison pictures and videos against a competitor camera showing the wider dynamic range of its solution. The competitor is likely based on either Allwinner V-series SoC or Novatek SoC.


The text is much clearer with RV1108 in the top picture, the building has much better details with RV1108 in the second picture. I’m not sure how much of that is related to the sensor used however, as the company did not mention anything about that, and instead explained that their ” advanced graph and image processing technology, WDR(Wide Dynamic Range)technology directly influences the video shooting effect”.

RV1108 runs Linux OS with an optional MiniGUI, drivers for WiFi, Bluetooth, or GPS, and support for Android/iOS apps.

Rockchip was also at CES 2017 demonstrating RV1108 capabilities with a 360° camera demo pictures in the top picture, a streaming media rearview mirror, and a 2K VR camera. There’s no product page for RV1108 on Rockchip website yet.

$55 OpenMV Cam M7 Open Source Computer Vision Board is Powered by an STM32F7 Cortex-M7 MCU

January 2nd, 2017 6 comments

I wrote about Jevois-A33 computer vision camera based on Allwinner A33 quad core Cortex A7 processor last week, and today, I’ve come across OpenMV Cam M7 open source computer vision board based on a much less powerful STMicro STM32F7 ARM Cortex M7 micro-controller, but with the advantage of consuming less power, and exposing some extra I/Os.

openmv-cam-m7OpenMV Cam M7 board specifications & features:

  • MCU – STMicro STM32F765VI ARM Cortex M7 @ up to 216 MHz with 512KB RAM, 2 MB flash.
  • External Storage – micro SD slot
  • Camera
    • Omnivision OV7725 image sensor supporting 640×480 8-bit grayscale images or 320×240 16-bit RGB565 images at 30 FPS
    • 2.8mm lens on a standard M12 lens mount
  • USB – 1x micro USB port (Virtual COM Port and a USB Flash Drive)
  • Expansion – 2x 8-pin headers with SPI, I2C CAN bus, asynchronous serial bus (Tx/Rx), 12-bit ADC, 12-bit DAC, 3x I/Os for servo control; interrupts and PWM on all I/O pins; 3.3V (5V tolerant)
  • Misc – RGB LED and 2x 850nm IR LEDs
  • Power Supply – 5V via micro USB port, 3.6 to 5V via VIN pin
  • Power Consumption (@ 3.3V) – Idle: 110mA;  active no μSD Card: 190mA; active with μSD Card: 200mA
  • Dimensions – 45 x 36 x 30 (H) mm
  • Weight – 16 grams
Click to Enlarge

Click to Enlarge

The camera board supports frame differencing (motion detection), marker tracking, face detection, eye tracking, color tracking (up to 32 colors at the same time), optical flow, edge/line detection, template matching, image capture (BMP/JPG/PPM/PGM), and video recording (MJPEG/GIF). Programming is done in OpenMV IDE using MicroPython language. You’ll find more details in OpenMV Cam’s documentation, and watch a description of the board and a QR code detection demo in the video below.

The computer vision board can be pre-ordered now for $55 on the product page with shipping scheduled for March 2017.