Archive

Posts Tagged ‘arm’

Fedora 26 Supports Single “Unified” OS Images for Multiple ARM Platforms

August 14th, 2017 29 comments

The decision to use device tree in Linux occurred several years ago, after Linus Torvalds complained that Linux on ARM was a mess, with the ultimate goal of providing a unified ARM kernel for all hardware. Most machine specific board files in arch/arm/mach-xxx/ are now gone from the Linux kernel, being replaced by device tree files, and in many case you simply need to replace the DTB (Device Tree Binary) file from an operating system to run on different hardware platforms. However, this is not always that easy as U-boot still often differ between boards / devices, so it’s quite frequent to distribute different firmware / OS images per board. Fedora has taken another approach, as the developers are instead distributing a single Fedora 26 OS ARMv7 image, together with an installation script.

Images for 64-bit ARM (Aarch64) are a little different since they are designed for SBSA compliant servers, so a single image will work on any server leveraging UEFI / ACPI implementation on the hardware. So what follows is specific to ARMv7 hard-float images as explained in the Wiki.

You’ll need to install Fedora Arm installer after downloading one of the Fedora 26 images. This requires an Fedora machine, and since I’m running Ubuntu 16.04, and don’t want to setup a Fedora virtual machine in Virtualbox, I used docker instead right inside Ubuntu as it’s much faster to do:

The last line requires some explanation. /media/hdd is the mount point of the storage device on the host where I download the Fedora image and that will be accessible through /mnt in docker, /dev/sdd is my micro SD card device, while /dev/sdd3 will be the rootfs partition.Note that it took me a while to get that right, and I’m not sure it works for all targets (other other /dev/sddx are also needed), so using an actual Fedora 26 installation would be easier. The rest of the instructions below are not specific to docker.

I could then install the Fedora ARM Installer and the required xz & file packages…

…and check the usage:

Let’s see how many boards are supported in /usr/share/doc/fedora-arm-installer/SUPPORTED-BOARDS file:

AllWinner Devices:
A10-OLinuXino-Lime A10s-OLinuXino-M A13-OLinuXino A13-OLinuXinoM
A20-OLinuXino-Lime A20-OLinuXino-Lime2 A20-OLinuXino_MICRO
A20-Olimex-SOM-EVB Ampe_A76 Auxtek-T003 Auxtek-T004 Bananapi Bananapro CHIP
CSQ_CS908 Chuwi_V7_CW0825 Colombus Cubieboard Cubieboard2 Cubietruck
Cubietruck_plus Hummingbird_A31 Hyundai_A7HD Itead_Ibox_A20 Lamobo_R1
Linksprite_pcDuino Linksprite_pcDuino3 Linksprite_pcDuino3_Nano MK808C
MSI_Primo73 MSI_Primo81 Marsboard_A10 Mele_A1000 Mele_A1000G_quad Mele_I7
Mele_M3 Mele_M5 Mele_M9 Mini-X Orangepi Orangepi_mini Sinlinx_SinA31s
UTOO_P66 Wexler_TAB7200 Wits_Pro_A20_DKT Yones_Toptech_BS1078_V2 ba10_tv_box
colorfly_e708_q1 difrnce_dit4350 dserve_dsrv9703c i12-tvbox iNet_86VS
icnova-a20-swac inet86dz jesurun_q5 mk802 mk802_a10s mk802ii orangepi_2
orangepi_lite orangepi_pc orangepi_plus polaroid_mid2809pxe04
pov_protab2_ips9 q8_a13_tablet q8_a23_tablet_800x480 q8_a33_tablet_1024x600
q8_a33_tablet_800x480 r7-tv-dongle sunxi_Gemei_G9

MX6 Devices:
cm_fx6 mx6cuboxi novena riotboard wandboard

OMAP Devices:
am335x_boneblack am57xx_evm kc1 omap3_beagle omap4_panda omap5_uevm

MVEBU Devices:
clearfog

ST Devices:
stih410-b2260

Other Devices:
jetson-tk1 rpi2 rpi3 trimslice

So we’ve got a list of device to choose from. For example, if you wanted to install Fedora 26 server in a micro SD card for Raspberry Pi 3, you’d run something like:

You’ll then be ask to confirm:

The full process will take several minutes, and at the end you’ll get “_/” rootfs partition, “_/boot ” partition, and a “30 MB volume” with u-boot, config,etc…


I did not try the micro SD card in Raspberry Pi 3 board myself, because Geek Till It Hertz has already done it successfully on both RPi 3 and Banana Pi boards as shown in the video below.

He also showed the boards run Linux 4.11.8 version, but that can be upgraded with dnf update to Linux 4.11.11, just as on his Fedora 26 installation on a x86-64  computer.

Movidius Neural Compute Stick Shown to Boost Deep Learning Performance by about 3 Times on Raspberry Pi 3 Board

August 9th, 2017 14 comments

Intel recently launched Movidius Neural Compute Stick (MvNCS)for low power USB based deep learning applications such as object recognition, and after some initial confusions, we could confirm the Neural stick could also be used on ARM based platforms such as the Raspberry Pi 3. Kochi Nakamura, who wrote the code for GPU accelerated object recognition on the Raspberry Pi 3 board, got hold of one sample in order to compare the performance between GPU and MvNCS acceleration.

His first attempt was quite confusing as with GoogLeNet, Raspberry Pi 3 + MvNCS achieved an average inference time of about 560ms, against 320 ms while using VideoCore IV GPU in RPi3 board. But then it was discovered that the “stream_infer.py” demo would only use one core out of the 12 VLIW 128-bit vector SHAVE processors in Intel’s Movidius Myriad 2 VPU, and after enabling all those 12 cores instead of just one, performance increased to around 108 ms average time per inference. That’s almost 3 times faster compare to using the GPU in RPi3 for this specific demo, and it may vary for other demos / applications.

That’s the description in YouTube:

Comparison of deep learning inference acceleration by Movidius’ Neural Compute Stick (MvNCS) and by Idein’s software which uses Raspberry Pi’s GPU (VideoCore IV) without any extra computing resources.

Movidius’ demo runs GoogLeNet with 16-bit floating point precision.Average inference time is 108ms.
We used MvNC SDK 1.07.07 and their official demo script without any changes. (ncapi/py_examples/stream_infer/stream_infer.py)
It seems something is wrong with the inference results.
We recompiled graph file with -s12 option to use 12 SHAVE vector processor simultaneously.

Idein’s demo also runs GoogLeNet with 32-bit floating point precision. Average inference time is 320ms.

It’s interesting to note the GPU demo used 32-bit floating point precision, against 16-bit floating point precision on the Neural Compute Stick, although it’s unclear to me how that may affect performance of such algorithms. Intel recommends a USB 3.0 interface for MvNCS, and the Raspberry Pi 3 only comes with a USB 2.0 interface that shares the bandwidth for the USB webcam and the MvNCS, so it’s possible an ARM board with a USB 3.0 interface for the stick, and a separate USB interface for the webcam could perform better. Has anybody tested it? A USB 3.0 interface and hub would also allow to cascade several Neural Compute Sticks.

How ARM Nerfed NEON Permute Instructions in ARMv8

August 7th, 2017 29 comments

This is a guest post by blu about an issue he found with a specific instruction in ARMv8 NEON. He previously wrote an article about OpenGL ES development on Ubuntu Touch, and one or two other posts.

This is not a happy-ending story. But as with most unhappy-ending stories, this is a story with certain moral for the reader. So read on if you appreciate a good moral.

Once upon a time there was a very well-devised SIMD instruction set. Its name was NEON, or formally — ARM Advanced SIMD — ASIMD for short (most people still called it NEON). It was so nice, that veteran coders versed in multiple SIMD ISAs often wished other SIMD ISAs were more like NEON.

NEON had originated as part of the larger ARM ISA version 7, or ARMv7, for short. After much success in the mobile and embedded domains, ARMv7 was superseded by what experts acknowledged as the next step in the evolution of modern ISAs – ARMv8. It was so good that it was praised by compiler writers as possibly the best ISA they could wish for. As part of all the enhancements in the new ISA, NEON too got its fair share of improvements – and so ASIMD2 superseded NEON (ARMv8’s SIMD ISA is called ASIMD2, but some call it NEON2).

Now, one of the many things the original NEON got right was the permute capabilities. Contrary to other ISAs whose architects kept releasing head-banging permute ops one after another, the architects of NEON got permutes right from the start. They did so by providing a compact-yet-powerful set of permutation ops, the most versatile of which, by far, being the tbl op and its sister op tbx; each of those provided a means to compose a SIMD vector from all the thinkable combinations of the individual byte-lanes of up to 4 source SIMD vectors. Neat. The closest thing on AMD64 is pshufb from SSSE3, but it takes a single vector as input (and the AVX2 256-bit vpshufb is borked even further).

Not only NEON had those ops on an architectural level, but the actual implementations – the different μarchitectures that embodied NEON, did so quite efficiently. Second and third generation performant ARMv7 Cortex CPUs could issue up to two tbl ops per clock and return the results as soon as 3 clocks later.

So, with this fairy tale to jump-start our story, let’s teleport ourselves to present-day reality.

I was writing down an ingenious algorithm last week, one that was meant to filter elements from an input stream. Naturally, the algorithm relied heavily on integer SIMD vectors for maximum efficiency, and it happened so that I was writing the initial version on ARM64, with plans for later translation to AMD64. Now, as part of that algorithm, a vector-wise horizontal sort had to be carried – something which is best left to a sorting network (See Ken Batcher’s Sorting Network algorithms). Sorting networks are characterized by doing a fixed number of steps to sort their input vector, and at each of those steps a good amount of permutations occur. As I was sorting a 16-lane vector (a rather wide one), its sorting network was a 10-deep one, and while some of the stages required trivial permutations, others called for the most versatile permutes of them all – the mighty tbl op. So I decided that for an initial implementation I’d use tbl throughput the sorting network.

As I was writing the algorithm away from home, I was using my trusty Ubuntu tablet (Cortex-A53, ARM64) as a workstation (yes, with a keyboard). I had a benchmark for a prima-vista version up and running off L1 cache, showing the algo performing in line with my work-per-clock expectations. It wasn’t until early on the following week that I was able to finally test it on my Cortex-A72 ARM64 workhorse desktop. And there things turned bizarre.

To my stupefaction, on the A72 the bench performed nothing like on the A53. It was effectively twice slower, both in absolute times as well as in per-clock performance (tablet is 1.5GHz, desktop is 2.0GHz but I kept it at 1.3GHz when doing nothing taxing). I checked and double-checked that the compiler had not done anything stupid – it hadn’t – disassembled code was exactly as expected, and yet, there was the ‘big’ A72, 3-decode, 8-dispatch, potent-OoO design getting owned by a ‘little’ tablet’s (or a toaster’s – A53s are so omnipresent these days) in-order, 2-decode design. Luckily for me, my ARM64 desktop is perf-clad (perf is the linux profiler used by kernel developers), so seconds later I was staring at perf reports.

There was no room for guessing – there were some huge, nay, massive stalls clumped around the permute ops. The algo was spending the bulk of its time in stalling on those permutes. Those beautiful, convenient tbl permutes – part of the reason I went to prototype the algo on ARM64 in the first place. The immediate take was that A72 tbl op performed nothing like the A53 tbl op. Time to dust up the manual, buddy. What I saw in the A72 (and A57) optimization manual had me scratch my head more than I could’ve expected.

First off, in 32-bit mode (A32) tbl op performs as I’d expect it to, and as it appears to still do on the A53 in A64 mode (64-bit):

op throughput, ops/clock latency, clocks
tbl from 1 source,  64-bit-wide 2 3
tbl from 2 sources, 64-bit-wide 2 3
tbl from 3 sources, 64-bit-wide 2 6
tbl from 4 sources, 64-bit-wide 2 6

But in 64-bit mode (A64), that transforms into:

op throughput, ops/clock latency, clocks
tbl from 1 source,  64-bit-wide 2 3 * 1 = 3
tbl from 2 sources, 64-bit-wide 2 3 * 2 = 6
tbl from 3 sources, 64-bit-wide 2 3 * 3 = 9
tbl from 4 sources, 64-bit-wide 2 3 * 4 = 12
tbl from 1 source,  128-bit-wide 2 3 * 1 + 3 = 6
tbl from 2 sources, 128-bit-wide 2 3 * 2 + 3 = 9
tbl from 3 sources, 128-bit-wide 2 3 * 3 + 3 = 12
tbl from 4 sources, 128-bit-wide 2 3 * 4 + 3 = 15

That’s right – 64-bit-wide tbl is severely penalized in A64 mode on A72 vs A32 mode. In my case, I was using the 128-bit-wide versions of the op, with 2 source arguments. So on the A72 I ended up getting (snippet of relevant code timeline):

= 12 clocks of latency for the snippet

But on the A53 same snippet yielded:

= 6 clocks of latency for the snippet

As the performance of the entire algorithm was dominated by the network sort, and the entirety of the network sort was comprised of repetitions of the above snippet, all observations fell into place — A53 was indeed twice faster (per-clock) than A72/A57 on this code, by design! So much for my elegant algorithm. Now I’d need to increase the data window so much as to be able to amortize the massive pipeline bubbles with more non-dependent work. Anything less would penalize the ‘big’ ARMv8 designs.

But that’s not what gets me in this entire story – I have no issue rewriting prototype or any other code. What does put me into contemplative mood is that code written for optimal work on A53’s pipeline could choke its ‘big’ brothers A57 & A72, and code written for optimal utilization of the pipelines of those CPUs could not necessarily be the most efficient code on the A53. All it takes is some tbl permutes. That is only exacerbated by big.LITTLE setups. That begs the question what were ARM thinking when they were designing A64 mode tbl on the ‘big’ cores.

Categories: Programming Tags: AArch64, arm, neon

Mediatek MT2625 NB-IoT SoC is Designed for Cellular IoT Devices working Worldwide

August 4th, 2017 1 comment

Mediatek has recently unveiled MT2625 SoC based on an ARM Cortex-M core, equipped with an NB-IoT “WorldMode” modem allowing for a single design worldwide, and supporting the latest 3GPP Release 14 (LTE Cat NB2) specification.

Mediatek MT2625 specifications:

  • CPU – ARM Cortex-M @ up to 104 MHz with FPU
  • Embedded Memory – 4MB PSRAM
  • Storage – 4MB NOR Flash
  • Connectivity
    • NB-IoT compatible with 3GPP Release 14
    • Full frequency band (450MHz to 2.1GHz) of 3GPP R13 (NB1) and R14 (NB2) standards
    • Integrated baseband, RF, and modem DSP
  • Peripherals – I2C,  I2S,  PCM,  SDIO,  UART
  • Power Supply – Integrated PMU

The solution will be found in products for worldwide transportation, municipal use, and consumer products, with a much longer battery life compared to existing devices relying on other 2G/3G/4G standards.

According to the press release, one of the first module based on MT2625 has been designed in collaboration with China Mobile, integrates the company’s eSIM card, and supports OneNET IoT open platform. You won’t find many details on Mediatek MT2625’s product page, but you could contact the company there, if you plan to design and deploy such modules in large quantities.

Intel’s Movidius Neural Compute Stick Supports Raspberry Pi 3 Board

August 2nd, 2017 8 comments

Last month, Intel introduced Movidius Neural Computer Stick to accelerate applications such as object recognition, and do so offline, i.e. without the cloud, and at low power. While there was not that much information available at the time, the minimal requirements for the host machine were that it had to be a x86_64 computer running Ubuntu 16.04, and come with at least 1GB RAM, and 4GB storage.

So I understood the stick would only work attached with 64-bit Intel or AMD processors, and ARM development boards would not be an option. But today, I’ve found that Movidius had uploaded a new video showing a Python based object recognition demo with the Neural Compute Stick connected to the the Raspberry Pi 3 board. You just need to add a USB camera, copy ncapi directory from the SDK installed on your Ubuntu 16.04 development machine to the Debian Jessie installed on RPi 3 board, install the relevant .deb packages from that directory, and as well as some required packages (e.g. Gstreamer), and run one of the demos such as stream_infer as explained in the video.

Since all computing is supposed to happen in the stick, I’d assume this should work on other ARM development board with Debian and Gstreamer support. I understand you’ll need an Ubuntu PC to compile neural networks using the toolkit, but you can run inferencing on lower end ARM hardware.

Realtek RTL8195AM Ameba WiFi + NFC Module Sells for $9 Shipped

August 1st, 2017 1 comment

Last year, Realtek Ameba IoT SoCs and development kits launched with boards such as Ameba Arduino, and later, the family got some buzz thanks to $2 RTL8710AF modules like Pine64 Padi IoT stamp, which looked competitive priced against ESP8266 SoC, and featuring an ARM Cortex-M3 core. Back to 2017, ESP8266 appears to still be the preferred platform for makers, and the community around Reatek Ameba processor is relatively small, but maybe the solutions are being integrated into commercial products rather than hobbyists project. Today, as I browsed the web, I noticed that are also some Realtek RTL8195AM module with WiFi, and NFC starting with an “Realtek Ameba-RTL8195AM WiFi & NFC Module” I first found on DFrobot for $15 per unit, but after spending a bit more time searching, I ended finding what looks like the same model for $8.99 including shipping on IC Station.

RTL8195AM module (MJIOT-AMB-02) specifications:

  • SoC – Realtek RTL8195AM ARM Cortex-M3 processor @ 166 MHz with 1MB ROM,2MB SDRAM,512KB SRAM
  • Connectivity – 802.11 b/g/n 1×1 Wi-Fi up to 150 Mbps via u.FL antenna connector, NFC read/write
  • Interfaces via half-holes:
    • 10/100M Ethernet MII/ RMII/RGMII interface
    • 1x USB OTG
    • SDIO device/SD card controller
    • Up to 30x GPIO
    • 2x SPI master-slave, 3x UART (2x HS-UART, 1x log UART), 2x I2C, 4x PWM
    • 2x I2S/PCM
    • 2x ADC, 1x DAC
  • Security – Hardware SSL engine in Realtek SoC
  • Power Supply – 3.0V~3.3V
  • Dimensions – 24 x 19mm
  • Temperature range – -20℃~+85℃

The module can be programmed with the Arduino IDE and Micropython, and you can have access to the SDK via Ameba IoT website. For evaulation, you may consider to order the module with breakout board instead going for $9.74 shipped.

The module appears to be manufactured by Shenzhen Minjun IOT Technology, and you’ll find more technical details and information about the module on the product page. Other RTL8195AM modules include CC&C WM-8195AM, and Rayson WFM-250, none of which appear to support NFC.

MYiR Introduces Z-Turn Lite Board Powered by Xilinx Zynq-7007S/Zynq-7010 SoC for $69 and Up

July 20th, 2017 9 comments

Xilinx launched a cost down version of their Zynq-7000 series with Zynq-7000S series SoC combining a single ARM Cortex A9 core with Artix FPGA fabric last year. We’ve already seen sub 100 Euros/Dollars board based on the new SoCs with ZynqBerry and MiniZed boards. MYiR Tech has now launched their own version, a cost-down version of their Z-Turn board, with Z-Turn Lite board featuring either the new cost-down Zynq-7007S or the “good old” Zynq-7010 SoC.

Z-Turn Lite specifications:

  • SoC
    • Xilinx XC7Z007S-1CLG400C (Zynq-7007S) with a single ARM Cortex A9 core @ 667 MHz, Artix-7 FPGA fabric with with 23K logic cells, 14,400 LUTs, 66 DSP slices OR
    • Xilinx XC7Z010-1CLG400C (Zynq-7010) with two ARM Cortex A9 cores @ 667 MHz, Artix-7 FPGA fabric with 28K logic cells, 17,600 LUTs, 80 DSP slices.
  • System Memory – 512 MB DDR3 SDRAM (2 x 256MB, 32-bit)
  • Storage – 4GB eMMC flash, 16MB QSPI flash, and a micro SD slot
  • Connectivity – 10/100/1000M Ethernet
  • USB – 1x mini USB 2.0 OTG port
  • Debugging – USB-UART debug interface, 14-pin JTAG interface
  • User I/O –1x 0.5mm pitch 120-pin connector for expansion interface on the bottom of the board
  • Sensors – 3-axis acceleration sensor and temperature sensor
  • Misc – 2x buttons (reset and user), boot selection jumpers, 5x LEDs, 1x Buzzer
  • Power – 5V/2A  via power barrel
  • Dimensions – 91 x 63 mm (10-layer PCB design)

Compare to Z-Turn, Z-Turn Lite comes with less memory (512MB vs 1GB), adds a 4GB eMMC flash, and removed HDMI, CAN bus, and motion / temperature sensors, and only comes with one expansion interface instead to two. Z-Turn Lite board runs Linux 3.15.0, and the company provides all drivers with source code, Sourcery GCC 6.1 toolchain, and a ramdisk image. Potential target applications include Zynq-7000S series evaluation, multi-axis motor control, machine vision, programmable logic controller (PLC), industrial automation, and test & measurement.

Z-Turn Lite board will start shipping on August 11st, but the company is already taking pre-order for $69 for the Zynq-Z7007S version, and $75 with Zynq-7010, including a 4GB SD card and product disk with documentation and source code. Alternatively, you can also get more complete kit with power supply, and cables for $89 and up. You’ll find purchase link and some hardware documentation like the PDF schematics on the product page.

Linux 4.12 Release – Main Changes, ARM & MIPS Architectures

July 3rd, 2017 6 comments

Linus Torvalds has just released Linux 4.12:

Things were quite calm this week, so I really didn’t have any real reason to delay the 4.12 release.

As mentioned over the various rc announcements, 4.12 is one of the bigger releases historically, and I think only 4.9 ends up having had more commits. And 4.9 was big at least partly because Greg announced it was an LTS kernel. But 4.12 is just plain big.

There’s also nothing particularly odd going on in the tree – it’s all just normal development, just more of it that usual. The shortlog below is obviously just the minor changes since rc7 – the whole 4.12 shortlog is much too large to post.

In the diff department, 4.12 is also very big, although the reason there isn’t just that there’s a lot of development, we have the added bulk of a lot of new  header files for the AMD Vega support. That’s almost exactly half the bulk of the patch, in fact, and partly as a result of that the driver side dominates  everything else at 85+% of the release patch (it’s not all the AMD Vega headers – the Intel IPU driver in staging is big too, for example).

But aside from just being large, and a blip in size around rc5, the rc’s stabilized pretty nicely, so I think we’re all good to go.

Go out and use it.

Oh, and obviously this means that the merge window for 4.13 is thus open. You know the drill.

Linus

Linux 4.11 provided various improvements for Intel Bay Trail and Cherry Trail targets, OPAL drive support, pluggable IO schedulers framework, and plenty of ARM and MIPS changes.

Some of the most notable changes in Linux 4.12 include:

  • Initial AMD Radeon RX Vega GPU support
  • BFQ (Budget Fair Queuing) and Kyber block I/O schedulers have been merged, meaning the kernel now has two multiqueue I/O schedulers suitable for various use cases that should improve the responsiveness of systems.
  • Added AnalyzeBoot tool to create a timeline of the kernel’s bootstrap process in HTML format.
  • Implemented “hybrid consistency model” for live kernel patching in order to enable the applications patchsets that change function or data semantics. See here for details.
  • Build of Open Sound System (OSS) audio drivers has been disabled, and will likely be removed in future Linux releases
  • AVR32 support has been removed

Some of the bug fixes and improvements for the ARM architecture include:

  • Allwinner:
    • Allwinner H3 –  USB OTG support
    • Allwinner H5 – pinctrl driver, CCU (sunxi-ng) driver, USB OTG support
    • Allwinner A31/H3 SPI driver – Support transfers larger than 64 bytes
    • AXP PMICs – AXP803 basic support, ACIN Power Supply driver, ADC IIO driver, Battery Power Supply driver
    • Added support for: FriendlyARM NanoPi NEO Air, Xunlong Orange Pi PC 2
  • Rockchip:
    • Updates to Rockchip clock drivers
    • Modification for Rockchip PCI driver
    • RK3328 pinctrl driver
    • Sound support for Radxa Rock2
    • USB 3.0 controllers for RK3399
    • Various changes for RK3368 (dma, i2s, disable mailbox per default, mmc-resets)
    • Added Samsung Chromebook Plus (Kevin) and the other RK3399 “Gru family” of ChromeOS devices.
    • Added Rockchip RK3288 support for ASUS Tinker board, Phytec phyCORE-RK3288 SoM and RDK; added Rockchip RK3328 evaluation board
  • Amlogic
    • New clock drivers for I2S and SPDIF audio, and Mali GPU
    • DRM/HDMI support for Amlogic GX SoC
    • Add GPIO reset to Ethernet driver
    • Enable PWM LEDs and LEDs default-on trigger
    • New boards: Khadas VIM, HwaCom AmazeTV
  • Samsung
    • Split building of the PMU driver between ARMv7 and ARMv8
    • Various Samsung pincrl drivers updates
    • ARM DT updates:
      • Enhancements to PCIe nodes on Exynos5440.
      • Fix thermal values on some of Exynos5420 boards like Odroid XU3.
      • Add proper clock frequency properties to DSI nodes.
      • Fix watchdog reset on Exynos4412.
      • Fix watchdog infinite interrupt in soft mode on Exynos4210, Exynos5440, S3C64xx and S5Pv210.
      • Enable watchdog on Exynos4 and S3C SoCs.
      • Enable DYNAMIC_DEBUG because it is useful for debugging
      • Increase CMA memory region to allow handling H.264 1080p videos.
    • ARM64 DT updates:
      • Exynos power management drivers support now ARMv8 SoC – Exynos5433 – so select them in ARCH_EXYNOS
      • Enable few Exynos drivers (video, DRM and LPASS drivers) for supported ARMv8 SoCs (Exynos5433 and Exynos7)
      • Add IR, touchscreen and panel to TM2/TM2E boards
      • Add proper clock frequency properties to DSI nodes
  • Qualcomm
    • Enable options needed for QCom DB410c board in defconfig
    • Added new PHY driver for Qualcomm’s QMP PHY (used by PCIe, UFS and USB), and Qualcomm’s QUSB2 PHY
    • Qualcomm Device Tree Changes
      • Add Coresight components for MSM8974
      • Fixup MSM8974 ADSP XO clk and add RPMCC node
      • Fix typo in APQ8060
      • Add SDCs on MSM8660
      • Revert MSM8974 USB gadget change due to issues
      • Add SCM APIs for restore_sec_cfg and iommu secure page table
      • Enable QCOM remoteproc and related drivers
    • Qualcomm ARM64 Updates for v4.12
      • Fixup MSM8996 SMP2P and add ADSP PIL / SLPI SMP2P node
      • Replace PMU compatible w/ A53 specific one
      • Add APQ8016 ramoops
      • Update MSM8916 hexagon node
      • Add PM8994 RTC
  • Mediatek
    • New clock drivers for MT6797, and hi655x PMIC
    • Fix Mediatek SPI (flash) controller driver
    • Add DRM driver and thermal driver for Mediatek MT2701 SoC
    • Add support for MT8176 and MT817x to the Mediatek cpufreq driver
    • Add driver for hardware random generator on MT7623 SoC
    • Add DSA support to Mediatek MT7530 7-port GbE switch
    • Add v4l2 driver for Mediatek JPEG Decoder
  • Misc
    • Added ARM TEE framework to support trusted execution environments on processors with that capability (e.g. ARM CPUs with TrustZone)
    • ARM64 architecture now has kernel crash-dump functionality.
  • Other new ARM hardware platforms and SoCs:
    • NXP – NXP/Freescale LS2088A and LKS1088A SoC, I2SE’s i.MX28 Duckbill-2 boards, Gateworks Ventana i.MX6 GW5903/GW5904, Zodiac Inflight Innovations RDU2 board, Engicam i.CoreM6 Quad/Dual OpenFrame modules, Boundary Device i.MX6 Quad Plus SoM.
    • Nvidia – Expanded support for Tegra186 and Jetson TX2
    • Spreadtrum – Device tree for SP9860G
    • Marvell – Crypto engine for Armada 8040/7040
    • Hisilicon – Device tree bindings for Hi3798CV200 and Poplar board
    • Texas Instruments – Motorola Droid4 (OMAP processor)
    • ST Micro – STM32H743 Cortex-M7 MCU support
    • Various Linksys platforms,  Synology DS116

The MIPS architecture also had its share of changes:

  • Fix misordered instructions in assembly code making kenel startup via UHB unreliable.
  • Fix special case of MADDF and MADDF emulation.
  • Fix alignment issue in address calculation in pm-cps on 64 bit.
  • Fix IRQ tracing & lockdep when rescheduling
  • Systems with MAARs require post-DMA cache flushes.
  • Fix build with KVM, DYNAMIC_DEBUG and JUMP_LABEL
  • Three highmem fixes:
    • Fixed mapping initialization
    • Adjust the pkmap location
    • Ensure we use at most one page for PTEs
  • Fix makefile dependencies for .its targets to depend on vmlinux
  • Fix reversed condition in BNEZC and JIALC software branch emulation
  • Only flush initialized flush_insn_slot to avoid NULL pointer dereference
  • perf: Remove incorrect odd/even counter handling for I6400
  • ftrace: Fix init functions tracing
  • math-emu – Add missing clearing of BLTZALL and BGEZALL emulation counters; Fix BC1EQZ and BC1NEZ condition handling; Fix BLEZL and BGTZL identification
  • BPF – Add JIT support for SKF_AD_HATYPE;  use unsigned access for unsigned SKB fields; quit clobbering callee saved registers in JIT code; fix multiple problems in JIT skb access helpers
  • Loongson 3 – Select MIPS_L1_CACHE_SHIFT_6
  • Octeon – Remove vestiges of CONFIG_CAVIUM_OCTEON_2ND_KERNEL, as well as PCIERCX, L2C  & SLI types and macros;  Fix compile error when USB is not enabled; Clean up platform code.
  • SNI – Remove recursive include of cpu-feature-overrides.h
  • Sibyte – Export symbol periph_rev to sb1250-mac network driver; fix Kconfig warning.
  • Generic platform – Enable Root FS on NFS in generic_defconfig
  • SMP-MT – Use CPU interrupt controller IPI IRQ domain support
  • UASM – Add support for LHU for uasm; remove needless ISA abstraction
  • mm – Add 48-bit VA space and 4-level page tables for 4K pages.
  • PCI – Add controllers before the specified head
  • irqchip driver for MIPS CPU – Replace magic 0x100 with IE_SW0; prepare for non-legacy IRQ domains;  introduce IPI IRQ domain support
  • NET – sb1250-mac: Add missing MODULE_LICENSE()
  • CPUFREQ – Loongson2: drop set_cpus_allowed_ptr()
  • Other misc changes, and code cleanups…

For further details, you could read the full Linux 4.12 changelog – with comments only – generated using git log v4.11..v4.12 --stat. You may also want to ead kernelnewsbies’s Linux 4.12 changelog once it is up.