Archive

Posts Tagged ‘nvidia’

NVIDIA DRIVE PX Pegasus Platform is Designed for Fully Autonomous Vehicles

October 11th, 2017 1 comment

Many companies are now involved in the quest to develop self-driving cars, and getting there step by step with 6 levels of autonomous driving defined based on info from  Wikipedia:

  • Level 0 – Automated system issues warnings but has no vehicle control.
  • Level 1 (”hands on”) – Driver and automated system shares control over the vehicle. Examples include Adaptive Cruise Control (ACC), Parking Assistance, and Lane Keeping Assistance (LKA) Type II.
  • Level 2 (”hands off”) – The automated system takes full control of the vehicle (accelerating, braking, and steering), but the driver is still expected to monitor the driving, and be prepared to immediately intervene at any time. You’ll actually have your hands on the steering wheel, just in case…
  • Level 3 (”eyes off”) – The driver can safely turn their attention away from the driving tasks, e.g. the driver can text or watch a movie. The system may ask the driver to take over in some situations specified by the manufacturer such as traffic jams. So no sleeping while driving 🙂 . The Audi A8 Luxury Sedan was the first commercial car to claim to be able to do level 3 self driving.
  • Level 4 (”mind off”) – Similar to level 3, but no driver attention is ever required. You could sleep while the car is driving, or even send the car somewhere without your being in the driver seat. There’s a limitation at this level, as self-driving mode is limited to certain areas, or special circumstances. Outside of these areas or circumstances, the vehicle must be able to safely park the car, if the driver does not retake control.
  • Level 5 (”steering wheel optional”) – Fully autonomous car with no human intervention required, no other limitations

So the goal is obviously to reach level 5, which would allow robotaxis, or safely drive you home whatever your alcohol or THC blood levels. This however requires lots of redundant (for safety) computing power, and current autonomous vehicle prototypes have a trunk full of computing equipments.

NVIDIA has condensed the A.I processing power required  or level 5 autonomous driving into DRIVE PX Pegasus AI computer that’s roughly the size of a license plate, and capable of handling inputs from high-resolution 360-degree surround cameras and lidars, localizing the vehicle within centimeter accuracy, tracking vehicles and people around the car, and planning a safe and comfortable path to the destination.

The computer comes with four A.I processors said to be delivering 320 TOPS (trillion operations per second) of computing power, ten times faster than NVIDIA DRIVE PX 2, or about the performance of a 100-server data center according to Jensen Huang, NVIDIA founder and CEO. Specifically, the board combines two NVIDIA Xavier SoCs and two “next generation” GPUs with hardware accelerated deep learning and computer vision algorithms. Pegasus is designed for ASIL D certification with automotive inputs/outputs, including CAN bus, Flexray, 16 dedicated high-speed sensor inputs for camera, radar, lidar and ultrasonics, plus multiple 10Gbit Ethernet

Machine learning works in two steps with training on the most powerful hardware you can find, and inferencing done on cheaper hardware, and for autonomous driving, data scientists train their deep neural networks NVIDIA DGX-1 AI supercomputer, for example being able to simulate driving 300,000 miles in five hours by harnessing 8 NVIDIA DGX systems. Once trained is completed, the models can be updated over the air to NVIDIA DRIVE PX platforms where inferencing takes place. The process can be repeated regularly so that the system is always up to date.

NVIDIA DRIVE PX Pegasus will be available to NVIDIA automotive partners in H2 2018, together with NVIDIA DRIVE IX (intelligent experience) SDK, meaning level 5 autonomous driving cars, taxis and trucks based on the solution could become available in a few years.

NVIDIA Unveils Open Source Hardware NVDLA Deep Learning Accelerator

October 4th, 2017 2 comments

NVIDIA is not exactly known for their commitment to open source projects, but to be fair things have improved since Linus Torvalds gave them the finger a few years ago, although they don’t seem to help much with Nouveau drivers, I’ve usually read positive feedback for Linux for their Nvidia Jetson boards.

So this morning I was quite surprised to read the company had launched NVDLA (NVIDIA Deep Learning Accelerator), “free and open architecture that promotes a standard way to design deep learning inference accelerators”

Comparison of two possible NVDLA systems – Click to Enlarge

The project is based on Xavier hardware architecture designed for automotive products, is scalable from small to large systems, and is said to be a complete solution with Verilog and C-model for the chip, Linux drivers, test suites, kernel- and user-mode software, and software development tools all available on Github’s NVDLA account. The project is not released under a standard open source license like MIT, BSD or GPL, but instead NVIDIA’s own Open NVDLA license.

This an on-going project, and NVIDIA has a roadmap until H1 2018, at which point we should get FPGA support for accelerating software development, as well as support for TensorRT and other supported frameworks.

Via Phoronix

Short Demo with 96Boards SynQuacer 64-bit ARM Developer Box

September 27th, 2017 17 comments

Even if you are working on ARM platforms,  you are still likely using an Intel or AMD x86 build machine, since there’s not really a good alternative in the ARM world. Linaro talked about plans to change that at Linaro Connect Budapest 2017 in March, and a few days ago, GIGABYTE SynQuacer software development platform was unveiled with a Socionext SynQuacer SC2A11 24-core Cortex-A53 processor, and everything you’d expect from a PC tower with compartment for SATA drives, PCIe slots, memory slots, multiple USB 3.0 ports, and so on.

Click to Enlarge

The platform was just demonstrated a Linaro Connect San Francisco right after Linaro High Performance Computing keynotes by Kanta Vekaria, Technology Strategist, Linaro, and Yasuo Nishiguchi, Socionext’s Chairman & CEO.

If you have never used a system with more than 14 cores, you’d sadly learn that the tux logos at boot times will only be shown on the first line, skipping the remaining 10 cores, of the 24-core system. It was hard to stomach, but I’m recovering… 🙂

The demo showed a system with an NVIDIA graphics card connected to the PCIe x16 port and leveraging Nouveau open drivers, but it’s also possible to use it as an headless “developer box”. The demo system booted quickly into Debian + Linux 4.13. They then played a YouTube video, and ran top in the developer box showing all 24-cores and 32GB RAM. That’s it. They also took questions from the audience. We learned that the system can build the Linux kernel in less than 10 minutes, they are working on SBSA compliance, and the system will be available through 96Boards website, with a complete build with memory and storage expected to cost less than $1,000. The idea is to use any off-the-shelves peripherals typically found in x86 PC towers. We still don’t know if they take MasterCard though… The video below is the full keynote with the demo starting at the 52:30 mark.

NVIDIA Jetson TX1 Developer Kit SE Offered for $199 (Promo)

August 23rd, 2017 11 comments

Launched in 2015, NVIDIA Jetson TX1 developer kit integrates some serious processing power with a Jetson TX1 module with a 256-core Maxwell GPU, four Cortex A57 cores, 4GB RAM, 16GB eMMC, and plenty of ports and I/Os via a mini-ITX carrier board. The only problem is that it’s quite expensive, as it was launched with an official $599 price tag, and it’s still $579 on Amazon US. The good news is that NVIDIA decided to launch a promotion for Jetson TX1 Developer Kit SE, based on the same $500+ development kit minus USB cable and camera module, and offered for just $199.

Click to Enlarge

Let’s refresh our memory with the board’s specifications:

  • Jeston TX1 module
    • NVIDIA Maxwell GPU with 256 NVIDIA CUDA Cores
    • Quad-core ARM Cortex-A57 MPCore Processor
    • 4 GB LPDDR4 Memory
    • 16 GB eMMC 5.1 Flash Storage
    • Connects to 802.11ac Wi-Fi and Bluetooth enabled devices
    • 10/100/1000BASE-T Ethernet
  • NVIDIA Jetson TX1 Carrier Board
    • USB – 1x USB 3.0 Type A Safety Booklet, 1x  USB 2.0 Micro AB (supports recovery and host mode)
    • HDMI
    • M.2 Key E
    • PCIe x4
    • Gigabit Ethernet
    • Full size SD card slot
    • SATA data and Power
    • GPIOs, I2C, I2S, SPI
    • TTL UART with flow control
  • Power Supply – External 19V AC adapter AC Adaptor and power cord

The kit includes NVIDIA Jetson TX1 Carrier Board, an AC Adapter and power cord, antennas to connect to Wi-Fi enabled devices, 4x rubber feet, a Quick Start Guide, and a Safety Booklet. Various other optional accessories can also be added to your purchase such HDMI cable, USB camera, USB cable, and so on.

In order to qualify for the discount, you need to be part of NVIDIA Developer Program (free registration), and while the promotion is only available in the US and Canada, the company intends to offer the kit in “other geographies starting this September”.

Getting Started with OpenCV for Tegra on NVIDIA Tegra K1, CPU vs GPU Computer Vision Comparison

May 24th, 2017 No comments

This is a guest post by Leonardo Graboski Veiga, Field Application Engineer, Toradex Brasil

Introduction

Computer vision (CV) is everywhere – from cars to surveillance and production lines, the need for efficient, low power consumption yet powerful embedded systems is nowadays one of the bleeding edge scenarios of technology development.

Since this is a very computationally intensive task, running computer vision algorithms in an embedded system CPU might not be enough for some applications. Developers and scientists have noticed that the use of dedicated hardware, such as co-processors and GPUs – the latter traditionally employed for graphics rendering – can greatly improve CV algorithms performance.

In the embedded scenario, things usually are not as simple as they look. Embedded GPUs tend to be different from desktop GPUs, thus requiring many workarounds to get extra performance from them. A good example of a drawback from embedded GPUs is that they are hardly supported by OpenCV – the de facto standard libraries for computer vision – thus requiring a big effort from the developer to achieve some performance gains.

The silicon manufacturers are paying attention to the growing need for graphics and CV-oriented embedded systems, and powerful processors are being released. This is the case with the NVIDIA Tegra K1, which has a built-in GPU using the NVIDIA Kepler architecture, with 192 cores and a processing power of 325 GFLOPS. In addition, this is one of the very few embedded GPUs in the market that supports CUDA, a parallel computing platform from NVIDIA. The good news is that OpenCV also supports CUDA.

And this is why Toradex has decided to develop a System on Module (aka Computer on Module) – the Apalis TK1 – using this processor. In it, the K1 SoC Quad Core ARM Cortex-A15 CPU runs at up to 2.2GHz, interfaced to 2GB DDR3L RAM memory and a 16GB 8-bit eMMC. The full specification of the CoM can be found here.

The purpose of this article is to install the NVIDIA JetPack on the Apalis TK1 System on Module, thus also installing OpenCV for Tegra, and trying to assess how much effort is required to code some simple CV application accelerated by CUDA. The public OpenCV is also tested using the same examples, to determine if it is a viable alternative to the closed-source version from NVIDIA.

Hardware

The hardware employed in this article consists of the Apalis TK1 System on Module and the Apalis Evaluation Board. The main features of the Apalis TK1 have been presented in the introduction, and regarding the Apalis Evaluation Board, we will use the DVI output to connect to a display and the USB ports to interface a USB camera and a keyboard. The Apalis TK1 is presented in figure 1 and the Apalis Evaluation Board in figure 2:

Figure 1 – Apalis TK1 – Click to Enlarge

Figure 2 – Apalis Evaluation Board – Click to Enlarge

System Setup

NVIDIA already provides an SDK package – the NVIDIA JetPack – that comes with all tools that are supported for the TK1 architecture. It is an easy way to start developing applications with OpenCV for Tegra support. JetPack also provides many source code samples for CUDA, VisionWorks, and GameWorks. It also installs the NVIDIA Nsight, an IDE that is based on Eclipse and can be useful for debugging CPU and GPU applications.

OpenCV for Tegra is based on version 2.4.13 of the public OpenCV source code. It is closed-source but free to use and benefits from NEON and multicore optimizations that are not present in the open-source version; on the other hand, the non-free libraries are not included. If you want or need the open-source version, you can find more information on how to build OpenCV with CUDA support here – these instructions were followed and the public OpenCV 2.4.13 was also tested during this article’s development.

Toradex provides an article in the developer website with concise information describing how to install JetPack on the Apalis TK1.

Regarding hardware, it is recommended that you have an USB webcam connected to the Apalis Evaluation Board because samples tested in this article often need a video source as input.

OpenCV for Tegra

After you have finished installing the NVIDIA JetPack, OpenCV for Tegra will already be installed on the system, as well as the toolchain required for compilation on the target. You must have access to the serial terminal by means of an USB to RS-232 adapter or an SSH connection.

If you want to run Python code, an additional step on the target is required:

The easiest way to check that everything works as expected is to compile and run some samples from the public OpenCV repository since it already has the Cmake configuration files as well as some source code for applications that make use of CUDA:

We can begin testing a Python sample, for instance, the edge detector. The running application is displayed in figure 3.

Figure 3 – running Python edge detector sample – Click to Enlarge

After the samples are compiled, you can try some of them. A nice try is the “background/foreground segmentation” samples since they are available with and without GPU support. You can run them from the commands below, as well as see the results in figures 4 and 5.

Figure 4 – running bgfg_segm CPU sample – Click to Enlarge

Figure 5 – running bgfg_segm GPU sample – Click to Enlarge

By running both samples it is possible to subjectively notice the performance difference. The CPU version has more delay.

Playing Around

After having things setup, the question comes: how easy it is to port some application from CPU to GPU, or even start developing with GPU support? It was decided to play around a little with the Sobel application that is well described in the Sobel Derivatives tutorial.

The purpose is to check if it’s possible to benefit from CUDA out-of-the-box, therefore only the function getTickCount from OpenCV is employed to measure the execution time of the main loop of the Sobel implementations. You can use the NVIDIA Nsight for advanced remote debugging and profiling.

The Code

The first code is run completely on the CPU, while in the first attempt to port to GPU (the second code, which will be called CPU-GPU), the goal is to try to find functions analog to the CPU ones, but with GPU optimization. In the last attempt to port, some improvements are done, such as creating filter engines, which reduces buffer allocation, and finding a way to replace the CPU function convertScaleAbs into GPU accelerated functions.

A diagram describing the loop for the three examples is provided in figure 6.

Figure 6 – CPU / CPU-GPU / GPU main loop for Sobel implementations

The main loop for the three applications tested is presented below. You can find the full source code for them on Github:

  • CPU only code:
  • CPU-GPU code:
  • GPU code

The Tests

  • Each of the three examples is executed using a random picture in jpeg format as input.
  • The input pictures dimensions in pixels that were tested are: 3483×2642, 2122×1415, 845×450 and 460×290.
  • The main loop is being iterated 500 times for each run.
  • All of the steps described in figure 6 have their execution time measured. This section will present the results.
  • Therefore there are 12 runs total.
  • The numbers presented in the results are the average values of the 500 iterations for each run.

The Results

The results presented are the total time required to execute the main loop – with and without image capture and display time, available in tables 1 and 2 – and the time each task takes to be executed, which is described in figures 7, 8, 9 and 10. If you want to have a look at the raw data or reproduce the tests, everything is in the aforelinked GitHub repository.

Table 1 – Main loop execution time, in milliseconds

Table 2 – Main loop execution time, discarding read and display image times, in milliseconds

Figure 7 – execution time by task – larger image (3483×2642 pixels) – Click to Enlarge

Figure 8 – execution time by task – large image (2122×1415 pixels) – Click to Enlarge

Figure 9 – execution time by task – small image (845×450 pixels) – Click to Enlarge

Figure 10 – execution time by task – smaller image (460×290 pixels) – Click to Enlarge

The Analysis

Regarding OpenCV for Tegra in comparison to the public OpenCV, the results point out that OpenCV for Tegra has been optimized, mostly for some CPU functions. Even when discarding image read  – that takes a long time to be executed, and has approximately a 2x gain – and display frame execution times, OpenCV for Tegra still bests the open-source version.

When considering only OpenCV for Tegra, from the tables, it is possible to see that using GPU functions without care might even make the performance worse than using only the CPU. Also, it is possible to notice that, for these specific implementations, GPU is better for large images, while CPU is best for small images – when there is a tie, it would be nice to have a power consumption comparison, which hasn’t been done, or also consider the fact that this GPU code is not optimized as best as possible.

Looking at the figures 7 to 10, it can be seen that the Gaussian blur and scale conversion from 16 bits to 8 bits had a big boost when running on GPU, while conversion of the original image to grayscale and the Sobel derivatives had their performance degraded. Another point of interest is the fact that transferring data from/to the GPU has a high cost, and this is, in part, one of the reasons why the first GPU port was unsuccessful – it had more copies than needed.

Regarding image size, it can be noticed that the image read and display have an impact in overall performance that might be relevant depending on the complexity of the algorithm being implemented, or how the image capture is being done.

There are probably many ways to try and/or make this code more optimized, be it by only using OpenCV; by combining custom CUDA functions with OpenCV; by writing the application fully in CUDA or; by using another framework or tool such as VisionWorks.

Two points that might be of interest regarding optimization still in OpenCV are the use of streams – asynchronous execution of code on the CPU/GPU – and zero-copy or shared memory, since the Tegra K1 has CPU and GPU shared memory supported by CUDA (see this NVIDIA presentation from GPU Technology Conference and this NVIDIA blog post for reference).

Conclusion

In this article, the installation of the NVIDIA JetPack SDK and deployment on the Toradex Apalis TK1 have been presented. Having this tool installed, you are able to use OpenCV for Tegra, thus benefiting from all of the optimizations provided by NVIDIA. The JetPack SDK also provides many other useful contents, such as CUDA, VisionWorks and GameWorks samples, and the NVIDIA Nsight IDE.

In order to assess how easy it is for a developer freshly introduced to the CV and GPU concepts to take advantage of CUDA, purely using OpenCV optimized functions, a CPU to GPU port of a Sobel filter application was written and tested. From this experience, some interesting results were found, such as the facts that GPU indeed improves performance – and this improvement magnitude depends on a series of factors, such as size of the input image, quality of implementation – or developer experience, algorithms being used and complexity of the application.

Having a myriad of sample source code, it is easy to start developing your own applications, although care is required in order to make the Apalis TK1 System on Module yield its best performance. You can find more development information in the NVIDIA documentation, as well as the OpenCV documentation. Toradex also provides documentation about Linux usage in its developer website, and has a community forum. Hope this information was helpful, see you next time!

NVIDIA Shield Android TV Gets Unofficial USB Tuner (ATSC/DVB) Support

March 9th, 2017 3 comments

NVIDIA Shield Android TV may only be available in a limited number of countries, but if you happen to live in a country where it’s officially sold, it can be one of the best options due its hard-to-beat price to performance ratio, and official Android TV software support from Google & Nvidia. One features it does not support out of the box  is support for digital TV tuner, but linux4all has released an unofficial firmware image adding USB TV tuner support to Android TV (7.0) on Nvidia Shield Android TV 2015 and 2017 models.

You’ll first need a supported tuner either Hauppauge WinTV-dualHD (DVB-C, DVB-T and DVB-T2), Hauppauge WinTV-HVR-850 (ATSC), Hauppauge WinTV-HVR-955Q (ATSC, QAM, Analog), or Sony PlayTV dual tuner (DVB-T). More tuners may be supported in the future. One you’ve got your tuner connected to Nvidia Shield Android TV, make sure you have the latest Android TV 7.0 OTA update, unlock the bootloader, and flash the specific bootloader as explained in the aforelinked forum post. Upon reboot you should see “USB TV Tuner Setup” in the interface. Go through it and scan channels.

Finally, connected a USB 3.0 hard drive or micro SD card with at least 50GB and select format as device storage, and you should be able to watch free-to-air TV and record it as needed using Live channels.

If you are interested in adding more tuners, fix bugs, or possibly implemented this for another Android TV TV box, you’ll find the Linux source code with change history on github.

Note that it’s not the first hack to use USB tuners on Shield, as last year somebody used Kodi + TVheadend, so the real news is here probably integration into Android TV’s Live Channels.

Via AndroidTv.News, and thanks to Harley for the tip.

NVIDIA Introduces Jetson TX2 Embedded Artificial Intelligence Computer

March 8th, 2017 9 comments

NVIDIA has just announced an upgrade to to their Jetson TX1 module, with Jetson TX2 “Embedded AI Computer” with Tegra X2 Parker SoC that either doubles the performance of its predecessor, or runs at more than twice the power efficiency, while drawing less than 7.5 watts of power.

The company provided a comparison showing the differences between TX1 and TX2 modules.

Jetson TX2 Jetson TX1
GPU NVIDIA Pascal, 256 CUDA cores NVIDIA Maxwell, 256 CUDA cores
CPU HMP Dual Denver 2/2 MB L2 +
Quad ARM® A57/2 MB L2
Quad ARM® A57/2 MB L2
Video 4K x 2K 60 Hz Encode (HEVC)
4K x 2K 60 Hz Decode (12-Bit Support)
4K x 2K 30 Hz Encode (HEVC)
4K x 2K 60 Hz Decode (10-Bit Support)
Memory 8 GB 128 bit LPDDR4
58.3 GB/s
4 GB 64 bit LPDDR4
25.6 GB/s
Display 2x DSI, 2x DP 1.2 / HDMI 2.0 / eDP 1.4 2x DSI, 1x eDP 1.4 / DP 1.2 / HDMI
CSI Up to 6 Cameras (2 Lane)
CSI2 D-PHY 1.2 (2.5 Gbps/Lane)
Up to 6 Cameras (2 Lane)
CSI2 D-PHY 1.1 (1.5 Gbps/Lane)
PCIe Gen 2 | 1×4 + 1×1 OR 2×1 + 1×2 Gen 2 | 1×4 + 1×1
Data Storage 32 GB eMMC, SDIO, SATA 16 GB eMMC, SDIO, SATA
Other CAN, UART, SPI, I2C, I2S, GPIOs UART, SPI, I2C, I2S, GPIOs
USB USB 3.0 + USB 2.0
Connectivity 1x Gigabit Ethernet, 802.11ac WLAN, Bluetooth
Mechanical 50 mm x 87 mm (400-Pin Compatible Board-to-Board Connector)

The module still supports Linux for Tegra, as well as JetPack 3.0 SDK for AI computing with the following:

  • TensorRT 1.0  neural network inference engine for production deployment of deep learning applications
  • cuDNN 5.1, a GPU-accelerated library of primitives for deep neural networks
  • VisionWorks™ 1.6, a software development package for computer vision and image processing
  • The latest graphics drivers and APIs, including OpenGL 4.5, OpenGL ES 3.2, EGL 1.4 and Vulkan 1.0
  • CUDA 8, which turns the GPU into a general-purpose massively parallel processor, giving developers access to tremendous performance and power-efficiency

Just like with Jetson TX1 module, NVIDIA also provides Jetson TX2 Developer Kit, with a carrier board, Jetson TX2 module, and various accessories which can be preordered for $599 in the United States and Europe, and start shipping on March 14. The devkit will be launched in other regions in a few weeks. With the launch of the new TX2 devkit, NVIDIA also reduced the price of Jetson TX1 developer kit to $499.

You’ll find more details, and the pre-order link on NVIDIA’s Embedded Modules & Devkits page.

Second Generation NVIDIA Shield Android TV Box Photos Leaked Ahead of Launch

December 20th, 2016 15 comments

NVIDIA Shield Android TV box may have launched in the first part of 2015, but even at the end of 2016 it’s still one of the best Android TV boxes with a powerful Tegra X1 processor, 3GB RAM, 4K video support, HD audio pass-through, the fastest GPU found in TV boxes so far, Netflix HD & 4K certified, and more. The company is allegedly preparing to launch a new model, and some photos have been leaked to Android Police.

nvidia-shield-android-tv-box-2The design of the box looks basically identical to the new model, but it comes in two different sizes maybe because of extra ports and internal storage, and the game controller has been re-designed with a mix of triangular shapes.

What we don’t know are the specifications. The company may have done a simple refresh, keeping Tegra X1 processor, increasing the memory and storage capacity, and possibly adding some extra interfaces, or they went with one of their new more powerful SoC initially targeting the automotive market: Parker SoC with two Denver custom ARMv8 cores, four Cortex A57 cores, and 256-core Pascal GPU, or Xavier SoC with 8x custom ARMv8 cores, and a 512-core Pascal. The latter has only been unveiled recently, and supports 8K video with HDR, so it’s probably way too early for that… Another possibility is that the company designed a new unannounced SoC specifically designed for TV boxes.

We’ll hopefully find out more at CES 2017 in about two weeks time. This could also mean some good deals for the first generation hardware once the box is officially unveiled.

Via Liliputing