Baikal T1 MIPS Processor – The Last of the Mohicans?

CNXSoft: Guest post by Blu about Baikal T1 development board and SoC, potentially one of the last MIPS consumer grade platforms ever.

It took me a long time to start writing this article, even though I had been poking at the test subject for months, and I felt during that time that there were findings worth sharing with fellow embedded devs. What was holding me back was the thought that I might be seeing one of the last consumer-grade specimen of a paramount ISA that once turned upside-down the CPU world. That thought was giving me mixed feelings of part sadness, part hesitation ‒ to not do some injustice to a possibly last-of-its-kind device. So it was with these feelings that I took to writing this article. But first, a short personal story.

Two winters ago I was talking to a friend of mine over beers. We were discussing CPU architectures and hypothesizing on future CPU developments in the industry, when I mentioned to him that the latest Imagination Technologies’ MIPS P5600 ‒ a MIPS32r5 ‒ hosted an interesting SIMD extension ‒ a previously-unseen one in the MIPS world. I had just skimmed through the docs for that extension ‒ MIPS SIMD Architecture (MSA), and I was impressed with how clean and practical this new vector instruction set looked in comparison to the SIMD ISAs of the day, partiularly to those by a very venerable CPU manufacturer. We discussed how the P5600 had found its way into a SoC by the Russian semiconductor vendor Baikal Electronics, and how they were releasing a devboard, which, thanks to limited-series manufacturing, would be well out-of-reach for mortal devs.

Fast forward to this summer, when I got a ping from my friend ‒ he was currently in St. Petersburg, Russia, and he was browsing the online store of a Moscow computer shop, and there was the Baikal T1 BFK 3.1 board, for the equivalent of 500 EUR, so if I ever wanted to get one, now was the time.

Did I want one? Last MIPS I had an encounter with was the Imagination CI20 board, hosting an Ingenic JZ4780 application SoC ‒ a dual-core MIPS32r2 implementation, and that was a mixed experience. I just had higher expectations of that SoC, as neither the SoC vendor nor Imagination did a good job setting the user expectations of what the XBurst MIPS cores actually were ‒ short in-order pipelines, with a non-pipelined scalar FPU, and an obscure integer-only SIMD specialized for video codecs. The one interesting part in that SoC, from my perspective, was the fully-fledged GLESv2/EGL stack for the aging SGX540. What I was looking for this time around was a “meatier” MIPS, one which was closer to the state of the art of this ISA, and the P5600 was precisely that.

So, yes, I very much wanted one. That price was very close to my threshold of ‘buy for science’, but I still had to keep in check my overgrown annual ‘scientific budget’ (as I refer to my devboard expenses in front of my wife), so I hesitated for a moment. To which my friend suggested ‘Listen, your birthday occurs annually, so how about I get you a birthday present, with some credit from future birthdays?’ [A huge thank you, Mitia, for your ingenuity, kindness and generosity!]

The BFK 3.1 is a sub-uATX board ‒ namely of the flexATX factor ‒ a bit larger than mini-ITX, which means it’s compact ‒ not RPi compact, mind you, but still compact for a devboard. Baikal T1 itself is a compact SoC ‒ not much larger than the Ingenic JZ4780. The latter is 17x17mm BGA390 (40nm), vs 25x25mm BGA576 (28nm) for the T1. But the T1 is a proper SoC that contains everything needed for a small gen-purpose computer (sans a GPU), which is what the BFK 3.1 seeks to be. Combined with the versatile MCU STM32F205 (ARM Cortex-M3 @ 120MHz), the T1 allows for an essentially two-chip devboard. Aside form the SoC and its companion MCU, the BFK 3.1 hosts a PCIe x16 connector (x4 active lanes), a SO-DIMM slot, an ATX power connector, 2x 1Gb Ethernet and 2x SATA 3 connectors, a USB2.0, an UART (via mini-USB) and what appears to be a USB OTG, a couple of JTAGs and even a RPi GPIO connector ‒ the rest of the board’s top surface is nearly pristine clean. Ok, there’s one more connector ‒ a proprietary one for the optional 10Gb Ethernet add-on, but that comes more as a curiosity from my current perspective.

Getting the board live was practically uneventful. BFK 3.1 power delivery is via a 24-pin ATX connector ‒ no barrel connectors of any kind, which in my case made two large drawers worth of PSUs useless, but I also had a 20-pin ATX picoPSU at hand (80W DC-DC, 12V input) and a spare AC-DC 12V convertor (60W) ‒ that improvised power delivery covered the board plus a SSD more than fine ‒ actually it was an overkill, given the manufacturer’s TDP rating of the SoC of 5W. I also had a leftover 4GB DDR3 SO-DIMM from a decommissioned notebook, so I thought I had the RAM covered as well. A “minor” detail had escaped my attention ‒ that SO-DIMM was of the 1333MT/s (667MHz) variety, whereas the board took 1600MT/s (800MHz) sharp ‒ my first booting of the board took me as far as RAM controller negotiations.

Board fitted with “wrong” SO-DIMM @ 667 MHz – Click to Enlarge

One facepalm and a visit to the local store later, the board was hosting shiny-new 8GB of DDR3, to specs and all.

Yet another minor detail about the RAM had originally escaped my attention, but that detail was not crucial to the booting of the board, and I found it out only after the first boot: the SoC had a 32-bit RAM bus, so it was seeing half the capacity of the 64-bit DIMM. Perhaps it could be arranged for such a bus to see the full DIMM capacity ‒ I’m not a hw engineer to know such things, and the designers of the BFK 3.1 clearly did not arrange for that. Which is a bit unfortunate for a devboard. Oh well ‒ back to square ‘4GB of RAM’.

Click to Enlarge

Apropos, as it turned out, I did really need RAM, since for exposing the full potential of the P5600 I had some compiler building ahead of me, and I always self-host builds when possible. But I’m getting ahead of myself.

The board arrives with a Busybox in SPI flash, and Baikal Electronics provide two revisions of Debian Stretch images with kernel 4.4 for day-to-day uses from a SATA drive. All available boot media are exposed via the cleanest U-Boot menu interface I’ve seen yet.

Footnote: aside from dd-ing the Debian image to the SSD, all interactions with the BFK 3.1 were done without involvement of PCs ‒ the above screengrab is from my trusty chromebook.

The obligatory dump of basic caps follows:

Whether the kernel saw this as a MIPS32r2 machine or it made use of the address extensions ‒ all that was beyond the scope of this first reconnaissance. I wanted to examine uarch performance, and as long as compilers were in the clear about the CPU’s true ISA capabilities I was set.

The VZ extension is a virtualization thing ‒ far from my interests. The EVA and XPA are addressing extensions ‒ Enhanced Virtual Address and Extended Physical Address, respectively. The former allows more efficient virtual-space mapping between kernel and userspace for the 32-bit/4GB process-addressable memory space. And the latter is, well, a physical address extension. From the P5600 manual:

Extended Physical Address (XPA) that allows the physical address to be extended from 32-bits to 40-bits.

Clearly both addressing extensions could be of good use to kernel developers. Me, of the listed ISA extensions, MSA was the one I truly cared about.

How about FS performance?

As wise men say, ‘Have decent SATA performance ‒ will use for a build machine.’

And finally, an interrupts-related observation that might help me obtain cleaner benchmarking results:

Notice how all serial and SATA interrupts are serviced by the 1st core? We could put that to some use.

Now the actual fun could begin! Being the control freak that I am, I tend to run a couple of micro-benchmarks when testing new uarchitectures ‒ one on the ‘gen-purpose’ side of performance, and one on the ‘sustained fp’ side of performance. Both of them being single-threaded, and the CPU at hand not featuring SMT, that meant I could focus on the details of the uarch by isolating all tests to the relatively-uninterrupted 2nd core.

Unfortunately, there was one last obstacle before me ‒ Debian Stretch comes with gcc-6.3 which does not know of the MSA extension in the P5600. For that I needed one major compiler revision later ‒ gcc-7.3 was fully aware of the novel instruction set, and so my next step was building gcc-7.3 for the platform. Easy-peasy. Or so I thought.

A short rant: I have difficulties understanding why a compiler’s default-settings self-hosted build would fail with an ‘illegal instruction’ in the bootstrap phase. But that’s the case with g++-7.3 on Debian Stretch when doing a self-hosted --target=mipsel-linux-gnu build on the BFK 3.1, and that’s what made me approach the gcc-dev mailing list with the wrong kind of support question, to which, luckily, I still got helpful responses.

Back to the BFK 3.1, where I eventually got a good g++-7.3 build via the following config, largely copied over from Debian’s g++-6.3:

Which gave me:

Yay, got MSA compiler support! Now I could do all the fp32 (and not only) SIMD I wanted.

But first I stumbled upon a surprise coming from the non-SIMD micro-benchmark ‒ a Mandelbrot plot written in the language Brainfuck, and run through a home-grown Brainfuck interpreter.

Running that before and after upgrading the compiler showed the following results:

Brainstorm Mandelbrot ‒ three versions of the code, across two compilers:
g++-6.3.0: 0m43.539s (vanilla)
g++-6.3.0: 0m38.176s (alt)
g++-6.3.0: 0m38.176s (alt^2)

g++-7.3.0: 0m36.003s (vanilla)
g++-7.3.0: 0m36.561s (alt)
g++-7.3.0: 0m31.852s (alt^2)

Notice how for the exact-same code and the exact-same optimization flags the two compilers produced performance delta for the resulting binary as large as 20% in favor of the newer g++? That was not due to some new, smarter P5600 instructions utilized by the newer compiler ‒ nope, the generated codes in both cases used the same ISA. It’s just that the newer compiler produced notably better-quality code ‒ fewer branches, more linear control flow. Yay for better compilers!

Those g++7.3 results positioned the P5600 firmly between the AMD A8-7600 and the Intel Core2 Duo P8600 in the clock-normalized Mandelbrot performance charts (where the Penryn also takes advantage of the custom Apple clang compiler, which generally outperforms gcc at this combination of CPU and task.

Per-clock, the P5600 also scored ahead of the Cortex-A15, which I believe is the closest competitor in the category of the P5600. Where the P5600, or perhaps its incarnation in the Baikal T1, fell short, was in absolute performance due to low clocks. Should that core reach clocks closer to 2GHz, we’d be seeing much more interesting absolute-performance results.

Ok, it was time to see how the P5600 did at fp32 SIMD. For that an SGEMM matrix multiplier was to be used. Making use of the novel MSA ISA took minimal effort, partially thanks to gcc’s support for generic vectors, partially thanks to the simplicity of the MSA ISA. The MSA version of the matmul code, dubbed ‘ALT=8’, took less than an hour to code and tune, and resulted in ~3.9 flop/clock for the small, cache-fitting dataset (64×64 matrices), and 2.1 flop/clock for the large dataset (512×512 matrices). Those results placed the P5600 firmly between Intel Merom and Intel Penryn for the small dataset, and slightly below the level of ARM Cortex-A72 and Intel Merom for the large dataset. The large dataset, though, exhibited a rather erratic behavior ‒ run-times varied considerably even when pinned to the 2nd core. It was as if the memory subsystem, past L2D, was behaving inconsistently doing 128-bit-wide accesses. That warranted further investigation, which would happen on a better day.

But let me finish my BFK 3.1 story here, and give my subjective, not-guaranteed-impartial opinion of the test subject.

My impressions of the P5600 in the Baikal T1 are largely positive. Using my limited micro-benchmark set as a basis, that uarchitecture does largely deliver on its promises of good gen-purposes IPC and good SIMD throughput per clock, and could be considered a direct competitor to the best of 32-bit ARM Cortex designs. That said, Baikal T1 could use higher clocks, which would position it in absolute-performance terms right in the group of the Core2 lineup by Intel and the Cortex-A12/15/17 lineup by ARM. Which, if one thinks of it in the grand scheme things, would be nothing short of a great achievement for the Baikal Warrior (Imagination aptly named the P-series MIPS designs ‘Warrior’ ‒ they’d have to fight for the survival of their ISA). If we ever live to see another Baikal T-series, that is ‒ Baikal Electronics are also developing their Baikal M-series ‒ ARM Cortex-A57 designs.

MIPS once turned the CPU world around. Can it survive its darkest hour (at least in the West ‒ in the East the Chinese have their Loongson) and step into a renaissance, or will it perish into oblivion? I, for one, would love to see the former, but I’m just an old coder, and old coders don’t get much say these days.

Share this:

Support CNX Software! Donate via PayPal or cryptocurrencies, become a Patron on Patreon, or buy review samples

24 Replies to “Baikal T1 MIPS Processor – The Last of the Mohicans?”

  1. Who will make one outside Russia. Via and China engineers are working on a Via chip. Subor of copy games consoles fame, is working with AMD CPU, GPU also Fujitsu’s got A64FX post K supercomputer.
    So where are Wave Computing taking MIPS since purchase?

    1. No idea. AFAIK, Baikal sell the SoC for ~50 EUR (I guess for bulk quantities), so if one really wanted to, they could design their own board with the SoC.

    2. Both and offer RISC-V cores in Russia while, according to HiFive’s CEO, there’s already 300+ Chinese companies working on RISC-V products. So, in and out of Russia, MIPS’ future is looking rather bleak to me.

      1. RISC-V is a good initiative, but announcing it as the next MIPS is a tad premature, I think. For a start, RISC-V needs to trim that excessive ISA fragmentation via consolidation. Otherwise bad things might happen — for an example of ISA fragmentation gone terribly wrong see x86 SIMD extensions.

    1. Thank you, Jota — my exact thoughts when I was using the U-Boot menu — ‘Why can’t board vendors do that as a standard?’

    2. Because it’s apparently very time consuming to make it look nice like that, which equals developer hours that could be put to better use. Unfortunately this is the world we live in.

  2. I was curious so I looked it up. Supposedly this chip is fabricated in 28nm. Since they’re using an A57 for another design, it seems they’re moving away from that. Does anyone else know much about Baikal and what fab capabilities they have?

    1. AFAIK, Baikal currently outsource all their 28nm devices to TSMC, with plans to eventually move that in-house. When — no idea.

      1. Ahh, okay, thank you. So, they have access to modern fab capacity–if they can pay like everyone else at least. A57 just struck me a strange. Everyone but a few companies avoided it. Those who did use it used it like *once* and quickly moved on. Designing something with it *now*? Strange.

    1. Nice to see Baikal supporting the BFK 3.1 properly. If I had checked their support pages past the initial Stretch download that could’ve saved me the gcc-7.3 builds. And I should check the newer compiler too.

  3. You may get better (and more stable) results on MIPS/MSA (and likely other architectures), if you use Eigen (currently only the default branch has MSA support) or OpenBLAS.

    1. Last time I checked, HPL produced results in agreement with my SGEMM on the Cortex A72. I’ll check Eigen, though (I’ve been meaning to anyway), as my current MSA SGEMM is a prima-vista code.

      1. BTW, I just ran a check out of curiosity: on a SNB my SGEMM sits within 0.7% from Intel Math Kernel Library (29.9GFLOPS for mine, vs 30.1GFLOPS for MKL, for 2048×2048 matrices), and according to Eigen’s own benchmarks it performs effectively on par to MKL (within a few percent), so take that as you wish. But as I said, I will check Eigen’s MSA code next. And in case I left somebody with the wrong impression — the SGEMM test-results stability issue manifests exclusively on the Baikal T1 — all other measurements from all other platforms are fairly stable.

  4. If they open source it back up again, it has a shot… otherwise, no. Heck, I’d rather use open source PowerPC and get some 7nm love for that in ultra low power scenarios.

Leave a Reply

Your email address will not be published.