Taalas HC1 is an AI accelerator hardwired (i.e, implemented in hardware) with Llama-3.1 8B and delivering close to 17,000 tokens/s of AI performance with the model, outperforming datacenter accelerators such as NVIDIA B200 or Cerebras chips.
The Taalas HC1 is about 10x faster than the Cerebras chip, costs 20x less to build, and consumes 10x less power. The main downside is that it only works with the model hardwired into the hardware, currently Llama-3.1 8B, although we’re told it “retains flexibility through configurable context window size and support for fine-tuning via low-rank adapters (LoRAs)”.
Hardware accelerators usually come with memory on one side and compute on the other. Both operate at different speeds, and the memory bandwidth is usually the bottleneck for Large Language Models. Taalas technology unifies storage and compute on a single chip, at DRAM-level density, to massively increase the performance and reduce power consumption.
Ultra-fast inference can be useful on servers where multiple users access accelerators, and in robots using voice interaction. I noticed the latter when reviewing the SunFounder Fusion HAT+ where the prompt was sent to an LLM service (Gemini AI), which replies at a specific tokens/s rate before the text-to-speech engine takes over. It creates delays, and the conversation does not feel natural due to the lag. When I first started writing this post, I assumed the Taalas-HC1 could be used for robotics, but considering it’s designed for 2.5kW servers, we’re not quite there yet… The HC1 chip is manufactured using TSMC’s 6nm process, measures| 815mm2, and features 53 billion transistors.

The company set up an online chatbot demo, so anybody can try it, and it’s indeed super fast. It reported 19.997 tokens/s when I asked “what is 2+2?”, but more typical questions like “Why is the sky blue?”, or “what do you know about CNX Software?” were processed at about 15K/16K. I tried to abuse it a little bit by asking it to write a 100-page book about the meaning of life, but instead, I got an outline if a 14-chapter book, generated in 0.064s at 15,651 tokens/s. Note that it’s an 8 billion parameter model, so answers are not always correct.
The company is now working on a second mid-sized reasoning LLM, still based on the HC1 silicon, that will launch in Q2. Further down the road, the second-generation silicon platform (HC2) will enable higher density and even faster execution, and deployments should start by the end of the year. More details can be found in the announcement.

Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.
Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress. We also use affiliate links in articles to earn commissions if you make a purchase after clicking on those links.





This is an insane tradeoff, but I could see how it could be worth it. It’s also a rather large chip near the reticle limit, although it’s nothing compared to the Cerebras Wafer Scale Engine.
A hard-wired image generation model with LoRA support in a PCIe card could be pretty interesting. I think those are rapidly reaching a quality plateau.
Ohh, the reticle limit. I saw some reports and comments on this hardware talking about the size of the chip, and I thought it didn’t seem all that large. That makes sense.
I mean, it’s an 815mm^2 chip on TSMC N6 with 53 billion transistors.
By comparison, the RTX 5090 is 750mm^2 on TSMC 4N (slight custom N4 variant) with 92.2 billion transistors. The RTX 4090 is 608.5mm^2 and 76.3 billion transistors on the same node.
So that’s a big chip, about as big as you can get without using multiple dies. Solely for running an 8B parameter Llama3. At least TSMC N6 is a mature budget node in comparison to the 2/3nm class nodes being used today.
[ “815mm^2”
is about 29x29mm (thx) ]
This is super interesting. I’ve been waiting for the last 3 years about analog implementations of LLMs, and once you start to hard-code a model, you can easily implement weights using resistors, and replace multipliers with a transistor. It’s very likely something comparable that’s being done here. We’re already seeing limits though, the quality of responses compared to a traditional llama-3.1-8B suggests that weights were quantized in 4 bits or so, which is a bit too small for such small models. But it’s a great demonstration of their ability to do things right. For example, implementing reasoning models with this technology would finally make them usable for practical things and compensate for inaccuracy.
They indeed talk about custom 3-bit and 6-bit weights in the announcement:
There were a few talks about it five years ago or so. Issue with “analog” DL circuits is that they accumulate noise way to fast. A lot of special additional machinery needed to fight this noise in very complicated ways, that made these designs impractical. “AI” term was not used back then, just Deep Learning.
But I think that working on addressing this noise is much more promising than trying to fit the smallest 4- or 8-bit MAC operators on silicon and associating each of them with DRAM. In addition to MAC operators, flipping many bits is also very inefficient energy-wise. And by the way, regarding analog noise, we’ve made progress on this, it’s already used in flash (QLC is 4 bits per cell, exactly what we need here to start having roughly a correct LLM).
Another thing to think about, a human brain consumes around 20W, runs with low signal propagation speeds and occupies one liter. Thus it is physically possible to achieve this level of processing in this thermal envelope. LLM chips run at very high frequencies (very sensitive to noise) and are tiny. But once you add the necessary cooling and motherboard, they’re much bigger than a human brain. So maybe we should get prepared to using way more silicon at lower frequencies and without heatsinks in the same volume. That’s what is already being done with flash where some chips employ up to 8 silicon dies stacked on top of each other, and many such chips are soldered next to each other on a board for a single SSD device.
Yeah, quite possible, or China will do photonics, giving how good they are at attracting fundamental scientists now.
About efficiency, I got that modern LLM mostly struggle because of the abysmal training data quality, there was a talk where Andrej Karpathy explains it, and I heard the same from few other people. I assume new generations of LLM models are trained with AI data clean up and assistance, since I see a huge progress on every new generation in 8-12-16Gb size ranges. There are even 1-2Gb models that can run on the phone, and perform better than 16Gb models from just a year ago.
[ static: the weights of the trained model data, same for the post-training/fine-tuning;
dynamic updates with:
A) Periodic Retraining (Static but refreshed),
B) Retrieval-Augmented Generation (RAG), means pulling fresh documents, documents inserted into the prompt, model is reasoning over it,
C) Tool Use (Runtime Dynamic Information), Modern systems can call: Web search, Databases, Calculators, Code execution, APIs, ‘calling experts’, etc.
How updating the hardware on the ‘main static part’, the weighted model data? (thx) ]
insane performance, reminds me of the time we had neural accelerator for convolutions many had yolo model implemented in hardware. I have no doubt that the future of LLMs are accelerators and not GPUs, not sure if this kind as every metric keep evolving day by day
If at least it could make the GPU and DRAM markets collapse, and Nvidia calm down and get back to what they used to do before (i.e. GPUs with a G standing for Graphics), it would help the whole IT industry. But they’re probably not prepared to see their stock divided by 5 yet.
[ just summarizing for energy consumption plans for ‘data centers’: 10x less power for each 10x faster processing, means 100x less consumption per ‘comparable/equal’ requests/answers(?)
Seems like a suitable investment for ‘FPGA’ chiplets(?) (thx) ]
[ probably no for a (common) FPGA’s advantage:
“A hardwired LLM ASIC can be 10× to 50× more energy-efficient
and 5× to 20× higher throughput per watt than FPGA implementations.”
From that a FPGA AI implementation on ‘hardware’ would be at about the same efficiency/performance level like software AI inference, maybe (optimized) getting closer towards GPU-based inference performance.
ASIC AI hardware (like the Talaas HC1) could/would degrade in performance with extensive usage of network related tools that introduce latency to the inference/reasoning process, for the advantage of flexibility with having updated data for reasoning the hardware embedded weights of the trained data(?)
Unexpected, from my POV. (thx) ]
[ What are min. hardware requirements for (useful, local?) AI LLM training and inference?
for inference: ‘7B–13B parameter models like LLaMA-class, Mistral Mini, Alpaca, Qwen, etc.’, that’s a capable 4core (Ryzen), about 32GB memory, ~12GB GPU, ~500GB-1TB storage(?)
for fine-tuning: about double of the above
for training: at least about 20x from inference and ‘weeks’
fine-tuning (compute demand), 7B model: ~$0-500
training, 7B model from scratch: ~$200k
training, GPT-4 class: Tens–hundreds of millions (thx) ]
Its 250w per card, the 2.5kw figure is 10 of them in a server rack.