Raspberry Pi sent me a sample of their AI HAT+ 2 generative AI accelerator based on Hailo-10H for review. The 40 TOPS AI accelerator is advertised as being suitable for LLMs (Large Language Models) and VLM (Vision Language Models), while delivering about the same performance as the first-generation AI HAT+ (Hailo-8) for AI vision/computer vision models.
After going through an unboxing, I’ll assemble the AI HAT+ 2 to a Raspberry Pi 5 with 2GB of RAM fitted with a Raspberry Pi Camera Module 3, before quickly checking whether AI vision models work as expected, and spending more time on testing LLM and VLM samples.
Raspberry Pi AI HAT+ 2 unboxing
My sample had a somewhat long and rough trip from the UK to Thailand, and the package did not look that good when DHL delivered it.
But luckily, nothing was damaged, and I got the AI HAT+ 2 with a heatsink, a 40-pin GPIO extension header, plastic standoffs and screws, and a sheet explaining how to assemble the heatsink.

The Raspberry Pi AI HAT+ 2 has two main ICs: the Hailo-10H AI accelerator and an 8GB memory chip, which should allow us to run LLMs and VLMs on a Raspberry Pi 5 with limited memory. This differs from the Hailo-8-based AI HAT+ that relies on the memory on the Raspberry Pi 5, and is only suitable for computer vision applications.

There’s nothing much on the bottom apart from passive components and the PCIe flat cable.
Raspberry Pi AI HAT+ 2 assembly with Raspberry Pi 5
The assembly is pretty straightforward. First, I secured the heatsink on top of the HAT after peeling the protective film and pressing the two spring clips attached to the heatsink. I also installed four plastic standoffs on the Raspberry Pi 5 and inserted the GPIO extension header.

At that point, you’ll likely want to insert the PCIe flat cable into the 16-pin PCIe FFC connector of the Pi 5, before placing the AI HAT+ 2 on top and securing it with the four remaining screws.

If you plan to use another HAT+ on top, you’ll have to make sure you don’t fully push the GPIO extension header since the GPIO pins won’t be accessible otherwise.

Note that we can’t use an NVMe SSD with the AI HAT+ 2 unless you add another HAT with a PCIe switch like the HatBRICK! Commander.
Install Raspberry Pi OS Trixie 64-bit and Hailo package
My board was still running Raspberry Pi OS Bookworm based on Debian 12, but the AI HAT+ 2 requires the latest Raspberry Pi OS Trixie 64-bit. So I removed the 32GB microSD card to flash it with the latest OS. If it is already installed on your board, you may want to make sure it’s up-to-date:
|
1 2 3 4 |
sudo apt update sudo apt full-upgrade -y sudo rpi-eeprom-update -a sudo reboot |
We can now install the package required by the Hailo-10H accelerator and reboot:
|
1 2 |
sudo apt install dkms hailo-h10-all sudo reboot |
We can confirm the HAILO10H accelerator is properly detected:
|
1 2 3 4 5 6 7 |
pi@raspberrypi:~ $ hailortcli fw-control identify Executing on device: 0001:01:00.0 Identifying board Control Protocol Version: 2 Firmware Version: 5.1.1 (release,app) Logger Version: 0 Device Architecture: HAILO10H |
Computer Vision samples with rpicam-apps
As I understand, the Hailo-10H has no benefit over the Hailo-8 when it comes to computer vision processing, but it’s still important that those work. So I installed rpicam-apps to repeat at least one of the tests I did with the Raspberry Pi AI HAT+.
|
1 |
sudo apt install rpicam-apps |
As mentioned in the introduction, I also connected a Raspberry Pi Camera Module 3 to the single board computer, so I can run the YoloV8 model on the Hailo-10H HAT:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
pi@raspberrypi:~ $ DISPLAY=:0 rpicam-hello -t 0 --post-process-file /usr/share/rpi-camera-assets/hailo_yolov8_inference.json --lores-width 640 --lores-height 640 --rotation 180 [0:04:21.660771575] [1685] INFO Camera camera_manager.cpp:340 libcamera v0.6.0+rpt20251202 [0:04:21.674084325] [1688] INFO RPI pisp.cpp:720 libpisp version 1.3.0 [0:04:21.678205885] [1688] INFO IPAProxy ipa_proxy.cpp:180 Using tuning file /usr/share/libcamera/ipa/rpi/pisp/imx708.json [0:04:21.689349513] [1688] INFO Camera camera_manager.cpp:223 Adding camera '/base/axi/pcie@1000120000/rp1/i2c@88000/imx708@1a' for pipeline handler rpi/pisp [0:04:21.689382901] [1688] INFO RPI pisp.cpp:1181 Registered camera /base/axi/pcie@1000120000/rp1/i2c@88000/imx708@1a to CFE device /dev/media0 and ISP device /dev/media1 using PiSP variant BCM2712_D0 Made X/EGL preview window Postprocessing requested lores: 640x640 BGR888 Reading post processing stage "hailo_yolo_inference" Reading post processing stage "object_detect_draw_cv" Mode selection for 2304:1296:12:P SRGGB10_CSI2P,1536x864/0 - Score: 3400 SRGGB10_CSI2P,2304x1296/0 - Score: 1000 SRGGB10_CSI2P,4608x2592/0 - Score: 1900 Stream configuration adjusted [0:04:21.813844172] [1685] INFO Camera camera.cpp:1215 configuring streams: (0) 2304x1296-YUV420/sYCC (1) 640x640-BGR888/sRGB (2) 2304x1296-RGGB_PISP_COMP1/RAW [0:04:21.813939227] [1688] INFO RPI pisp.cpp:1485 Sensor: /base/axi/pcie@1000120000/rp1/i2c@88000/imx708@1a - Selected sensor format: 2304x1296-SRGGB10_1X10/RAW - Selected CFE format: 2304x1296-PC1R/RAW Hailo device: HAILO10H |
The HAILO10 device is detected, and everything works smoothly as it did with the Hailo-8, and at a higher FPS than just relying on the Raspberry Pi 5. I won’t spend time testing other CV samples, and instead focus on LLMs and VLMs for the rest of the review.
Testing LLMs on the Raspberry Pi AI HAT+ 2
Running LLMs on the command line with Hailo Ollama server
We’ll mostly follow the instructions posted on the Raspberry Pi website. The first step is to install the Hailo Ollama server (version 5.1.1):
|
1 2 |
wget https://dev-public.hailo.ai/2025_12/Hailo10/hailo_gen_ai_model_zoo_5.1.1_arm64.deb sudo dpkg -i hailo_gen_ai_model_zoo_5.1.1_arm64.deb |
Let’s start the server in one Terminal window:
|
1 2 |
pi@raspberrypi:~ $ hailo-ollama I |2026-01-17 14:27:48 1768634868341366| MyApp:Server running on port 8000 |
It’s running on port 8000 and server that exposes a REST API for model inference. We can now open another Terminal window to list the available models:
|
1 2 |
pi@raspberrypi:~ $ curl --silent http://localhost:8000/hailo/v1/list {"models":["deepseek_r1_distill_qwen:1.5b","llama3.2:3b","qwen2.5-coder:1.5b","qwen2.5-instruct:1.5b","qwen2:1.5b"]} |
That would be five models. Let’s start by downloading the DeepSeek model:
|
1 2 3 |
pi@raspberrypi:~ $ curl --silent http://localhost:8000/api/pullcurl --silent http://localhost:8000/api/pull \ -H 'Content-Type: application/json' \ -d '{ "model": "deepseek_r1_distill_qwen:1.5b", "stream" : true }' |
We can send a request to translate some text from English to French:
|
1 2 3 |
pi@raspberrypi:~ $ curl --silent http://localhost:8000/api/chat \ -H 'Content-Type: application/json' \ -d '{"model": "deepseek_r1_distill_qwen:1.5b", "messages": [{"role": "user", "content": "Translate to French: The cat is on the table."}]}' |
After some wait, it will start to output data:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
{"model":"deepseek_r1_distill_qwen:1.5b","created_at":"2026-01-17T07:35:19.710187637Z","message":{"role":"assistant","content":"Alright"},"done":false} {"model":"deepseek_r1_distill_qwen:1.5b","created_at":"2026-01-17T07:35:19.857711571Z","message":{"role":"assistant","content":","},"done":false} {"model":"deepseek_r1_distill_qwen:1.5b","created_at":"2026-01-17T07:35:20.001829975Z","message":{"role":"assistant","content":" let"},"done":false} {"model":"deepseek_r1_distill_qwen:1.5b","created_at":"2026-01-17T07:35:20.146264490Z","message":{"role":"assistant","content":" me"},"done":false} {"model":"deepseek_r1_distill_qwen:1.5b","created_at":"2026-01-17T07:35:20.291402397Z","message":{"role":"assistant","content":" figure"},"done":false} ... {"model":"deepseek_r1_distill_qwen:1.5b","created_at":"2026-01-17T07:35:45.133824310Z","message":{"role":"assistant","content":"<\/think>"},"done":false} {"model":"deepseek_r1_distill_qwen:1.5b","created_at":"2026-01-17T07:35:45.277909447Z","message":{"role":"assistant","content":"\n\n"},"done":false} {"model":"deepseek_r1_distill_qwen:1.5b","created_at":"2026-01-17T07:35:45.422546419Z","message":{"role":"assistant","content":"\""},"done":false} {"model":"deepseek_r1_distill_qwen:1.5b","created_at":"2026-01-17T07:35:45.566536907Z","message":{"role":"assistant","content":"Le"},"done":false} {"model":"deepseek_r1_distill_qwen:1.5b","created_at":"2026-01-17T07:35:45.710771008Z","message":{"role":"assistant","content":" chat"},"done":false} {"model":"deepseek_r1_distill_qwen:1.5b","created_at":"2026-01-17T07:35:45.855492221Z","message":{"role":"assistant","content":" est"},"done":false} {"model":"deepseek_r1_distill_qwen:1.5b","created_at":"2026-01-17T07:35:45.999638229Z","message":{"role":"assistant","content":" sur"},"done":false} {"model":"deepseek_r1_distill_qwen:1.5b","created_at":"2026-01-17T07:35:46.143869366Z","message":{"role":"assistant","content":" le"},"done":false} {"model":"deepseek_r1_distill_qwen:1.5b","created_at":"2026-01-17T07:35:46.288438597Z","message":{"role":"assistant","content":" tableau"},"done":false} {"model":"deepseek_r1_distill_qwen:1.5b","created_at":"2026-01-17T07:35:46.432704846Z","message":{"role":"assistant","content":".\""},"done":false} {"model":"deepseek_r1_distill_qwen:1.5b","created_at":"2026-01-17T07:35:46.585714626Z","message":{"role":"assistant","content":""},"done":true,"done_reason":"stop","total_duration":28366846153,"eval_count":186} |
It works, although the translation is not quite accurate… It generates 186 tokens in 28.6 seconds for 186 tokens, or about 6.5 tokens/s. We can download and play around with other models in the command, but I won’t go into detail for each here. These rather small models with 1.5 to 3 billion parameters are fine for testing, but most people will probably build custom models optimized for their application.
Running models in a web browser with Open WebUI
Instead, I’m going to show how to run the models in a web browser. As we’ve seen in our review of the UP Squared Pro TWL AI Dev Kit with an Hailo-8 AI accelerator, the Hailo SDK is very finicky with the Python version used. Since the Python version in Raspberry Pi OS Trixie is not compatible, we’ll use Docker instead.
You may first want to remove old Docker packages (skip if you’ve just installed Raspberry Pi OS Trixie):
|
1 |
sudo apt remove $(dpkg --get-selections docker.io docker-compose docker-doc podman-docker containerd runc | cut -f1) |
Now add the Docker PGP key:
|
1 2 3 4 5 |
sudo apt update sudo apt install ca-certificates curl sudo install -m 0755 -d /etc/apt/keyrings sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc sudo chmod a+r /etc/apt/keyrings/docker.asc |
Create the file /etc/apt/sources.list.d/docker.sources as a superuser with:
|
1 2 3 4 5 |
Types: deb URIs: https://download.docker.com/linux/debian Suites: trixie Components: stable Signed-By: /etc/apt/keyrings/docker.asc |
Install and run Docker:
|
1 2 3 |
sudo apt update sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin sudo systemctl start docker |
Add the current user to the docker group:
|
1 |
sudo usermod -aG docker $USER |
Exit the terminal, and log in again to test Docker:
|
1 |
pi@raspberrypi:~ $ docker run hello-world |
Output:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
Unable to find image 'hello-world:latest' locally latest: Pulling from library/hello-world 198f93fd5094: Pull complete 95ce02e4a4f1: Download complete Digest: sha256:05813aedc15fb7b4d732e1be879d3252c1c9c25d885824f6295cab4538cb85cd Status: Downloaded newer image for hello-world:latest Hello from Docker! This message shows that your installation appears to be working correctly. To generate this message, Docker took the following steps: 1. The Docker client contacted the Docker daemon. 2. The Docker daemon pulled the "hello-world" image from the Docker Hub. (arm64v8) 3. The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading. 4. The Docker daemon streamed that output to the Docker client, which sent it to your terminal. To try something more ambitious, you can run an Ubuntu container with: $ docker run -it ubuntu bash Share images, automate workflows, and more with a free Docker ID: https://hub.docker.com/ For more examples and ideas, visit: https://docs.docker.com/get-started/ |
All good. Now we can install and use Open WebUI, while the hailo-ollama server is running in another Terminal window:
|
1 2 |
docker pull ghcr.io/open-webui/open-webui:main docker run -d -e OLLAMA_BASE_URL=http://127.0.0.1:8000 -v open-webui:/app/backend/data --name open-webui --network=host --restart always ghcr.io/open-webui/open-webui:main |
It will take a little while to start, but then you can access Open WebUI using http://localhost:8080 on the Raspberry Pi, or http://
Clicking on Get Started will bring you to an interface to create an admin account. I entered a valid email address, but it’s not even needed, as I never received an email.
After that, you’ll be brought to the Open WenUI dashboard, where you can select one of the models and chat with it.
I used the same request as in the command line: “Translate to French: The cat is on the table.”

There’s no benchmark data, and I tried to install the Time Token Tracker and the Chat Metrics functions to get a tokens/s value, but neither worked for me.
Benchmarking LLM performance of the AI HAT+ 2
One way to get performance data is to hover over the Information icon.
Here, we can get “total_tokens” and “total_duration” values to calculate the performance, but it’s not ideal. So instead, I did that in the command line with the jq utility.
Here are the results for the English to French translation requests
DeepSeek R1 1.5B:
|
1 2 3 |
pi@raspberrypi:~ $ sudo apt install jq pi@raspberrypi:~ $ curl --silent http://localhost:8000/api/chat -H 'Content-Type: application/json' -d '{"model": "deepseek_r1_distill_qwen:1.5b", "messages": [{"role": "user", "content": "Translate to French: The cat is on the table."}], "stream": false }' | jq -r '"\(.eval_count) tokens in \(.total_duration/1e9) seconds = \(.eval_count/(.total_duration/1e9)) tokens/s"' 380 tokens in 56.508142192 seconds = 6.724694623809408 tokens/s |
Qwen2 1.5B:
|
1 2 |
pi@raspberrypi:~ $ curl --silent http://localhost:8000/api/chat -H 'Content-Type: application/json' -d '{"model": "qwen2:1.5b", "messages": [{"role": "user", "content": "Translate to French: The cat is on the table."}], "stream": false }' | jq -r '"\(.eval_count) tokens in \(.total_duration/1e9) seconds = \(.eval_count/(.total_duration/1e9)) tokens/s"' 7 tokens in 1.187233065 seconds = 5.89606220241179 tokens/s |
Llama3.2 3B:
|
1 2 |
pi@raspberrypi:~ $ curl --silent http://localhost:8000/api/chat -H 'Content-Type: application/json' -d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "Translate to French: The cat is on the table."}], "stream": false }' | jq -r '"\(.eval_count) tokens in \(.total_duration/1e9) seconds = \(.eval_count/(.total_duration/1e9)) tokens/s"' 180 tokens in 69.092325239 seconds = 2.6052097592222414 tokens/s |
Qwen 2.5 instruct 1.5B:
|
1 2 |
pi@raspberrypi:~ $ curl --silent http://localhost:8000/api/chat -H 'Content-Type: application/json' -d '{"model": "qwen2.5-instruct:1.5b", "messages": [{"role": "user", "content": "Translate to French: The cat is on the table."}], "stream": false }' | jq -r '"\(.eval_count) tokens in \(.total_duration/1e9) seconds = \(.eval_count/(.total_duration/1e9)) tokens/s"' 102 tokens in 15.127930596 seconds = 6.742495237714138 tokens/s |
I also benchmarked Qwen 2.5 Coder 1.5B, asking it to write a function in Python:
|
1 2 |
pi@raspberrypi:~ $ curl --silent http://localhost:8000/api/chat -H 'Content-Type: application/json' -d '{"model": "qwen2.5-coder:1.5b", "messages": [{"role": "user", "content": "Write an FFT function in Python"}], "stream": false }' | jq -r '"\(.eval_count) tokens in \(.total_duration/1e9) seconds = \(.eval_count/(.total_duration/1e9)) tokens/s"' 666 tokens in 82.620543785 seconds = 8.060949123417828 tokens/s |
Some answers take over one minute, and it is possible to make the answer faster by limiting the number of tokens:
|
1 |
"options": {"num_predict": 64} |
Note that all these little tests take some space:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
pi@raspberrypi:/usr/local $ df -h Filesystem Size Used Avail Use% Mounted on udev 962M 0 962M 0% /dev tmpfs 402M 24M 378M 6% /run /dev/mmcblk0p2 29G 26G 1.4G 96% / tmpfs 1004M 544K 1003M 1% /dev/shm tmpfs 5.0M 48K 5.0M 1% /run/lock tmpfs 1.0M 0 1.0M 0% /run/credentials/systemd-journald.service tmpfs 1004M 16K 1004M 1% /tmp /dev/mmcblk0p1 510M 78M 433M 16% /boot/firmware tmpfs 201M 256K 201M 1% /run/user/1000 tmpfs 1.0M 0 1.0M 0% /run/credentials/getty@tty1.service tmpfs 1.0M 0 1.0M 0% /run/credentials/serial-getty@ttyAMA10.service |
I used a 32GB microSD card, but a 64GB microSD card would give more breathing room. If you need to delete models to save space, you’ll find them in /usr/share/hailo-ollama/models/blob/:
|
1 2 3 4 5 6 7 |
pi@raspberrypi:~ $ ls -lh /usr/share/hailo-ollama/models/blob/ total 11G -rw-rw-r-- 1 pi pi 3.2G Jan 17 22:04 sha256_1129f5f8384e4e45c5890104dc4ec1aee77e800ce1484ddc3aa942399aada425 -rw-rw-r-- 1 pi pi 2.2G Jan 18 17:35 sha256_5310176848638505fbc28add04ba60c97abe345cdb0ec7e3b8ffaa4b0a8c65dd -rw-rw-r-- 1 pi pi 1.7G Jan 17 21:24 sha256_88aa7633ebe3385452430ae19f2b459b5a00791cab035576a3262a41ec1350f5 -rw-rw-r-- 1 pi pi 2.3G Jan 17 14:32 sha256_9c4506dda44d0a1730d939d4049a3cbf72d5179a88762ca551363db087adb38f -rw-rw-r-- 1 pi pi 1.6G Jan 17 21:20 sha256_ab056548c60945cdf4fb30ca43fc7aeed2b9ffc751ad8d4c201dc4c4ab31e86a |
At that point, I was convinced I had used a Raspberry Pi 5 8GB. I thought I would install ollama to run the models above on the CPU, but with the lack of space, and my realization I had used a Pi with 2GB only, I had to find another Raspberry Pi 5. I turn out the board I use now is my only one, and I don’t own the other models anymore. So instead, I ran the same tests on the Raspberry Pi development kit for CM5 using the Broadcom BCM2712 CPU cores, 4GB RAM, and a 32GB eMMC flash..
The first step is to install ollama and check that it’s running fine.
|
1 2 3 |
curl -fsSL https://ollama.com/install.sh | sh pi@raspberrypi:~ $ ollama --version ollama version is 0.14.2 |
I can now run ollamo with deepseek-r1:1.5b, and the verbose parameters to enable metrics:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
pi@raspberrypi:~ $ ollama run deepseek-r1:1.5b --verbose >>> Translate to French: The cat is on the table. Thinking... Alright, so I've got this translation task here: "The cat is on the table." I need to translate it into French. Hmm, where do I start? First off, the main subject here is a cat. In French, I think 'cat' translates to 'chien.' That seems straightforward enough. Next, the object is "on the table." In everyday language, that's clear, but in a formal or academic context, maybe I should consider if there's an alternate term? I'm not sure, so probably stick with "par le tablier." Putting it together, the sentence would be "La chat est sur le tablier." Let me check that. "Chat" is indeed French for cat, and "tablier" means table in French. Wait, does "tablier" always mean 'on the table'? In some contexts, people use "arracher" instead of "par." But I think for just saying something's on a surface or a place like a table, "par le tablier" is correct. Maybe it varies based on the situation. I should also consider if there are any other nuances. Sometimes, when talking about animals in a formal setting, they might use different terms, but since this seems pretty straightforward, I think "par le tablier" is spot on. Is there a chance I'm missing something? Perhaps the grammatical structure or the phrasing. Let me see: "La chat" is definite singular, and "sur" means 'on.' So yeah, that makes sense together. I could also consider if "tablier" has any regional variations in French. Sometimes, especially in European countries, "table" might be referred to as "le tablais" or something similar. But again, in everyday language, just saying "par le tablier" is sufficient. Another thought: Is there a more precise word than "par"? Maybe using "à la table." Would that work? "La chat à la table." Yes, but the direct translation would be "la chat par le tablier," which sounds more natural and concise. I think I've covered all bases here. The key points are correctly translating both the noun and the verb, ensuring clarity in context, and sticking to the most common phrases. ...done thinking. La chat est sur le tablier. total duration: 53.656290673s load duration: 235.07706ms prompt eval count: 14 token(s) prompt eval duration: 105.627889ms prompt eval rate: 132.54 tokens/s eval count: 476 token(s) eval duration: 52.666363919s eval rate: 9.04 tokens/s |
An eval rate of 9.04 tokens/s is quite higher than the 6.7 tokens/s reported using the Hailo-10H accelerator. That’s disappointing. Nevertheless, I’ve repeated all tests as follows:
Qwen2:1.5b:
|
1 2 3 4 5 6 7 8 9 10 11 12 |
pi@raspberrypi:~ $ ollama run qwen2:1.5b --verbose >>> Translate to French: The cat is on the table. Le chat est sur la table. total duration: 1.89893998s load duration: 208.040356ms prompt eval count: 19 token(s) prompt eval duration: 969.034931ms prompt eval rate: 19.61 tokens/s eval count: 8 token(s) eval duration: 708.57683ms eval rate: 11.29 tokens/s |
llama 3.2:3b:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
pi@raspberrypi:~ $ ollama run llama3.2:3b --verbose success >>> Translate to French: The cat is on the table. La chatte est sur la table. (Note: "chatte" is a more common and informal way of saying "chat" in French, especially when referring to a female cat.) total duration: 14.694186151s load duration: 265.720196ms prompt eval count: 36 token(s) prompt eval duration: 6.648299231s prompt eval rate: 5.41 tokens/s eval count: 37 token(s) eval duration: 7.734155392s eval rate: 4.78 tokens/s |
qwen2.5:1.5b-instruct:
|
1 2 3 4 5 6 7 8 9 10 11 12 |
pi@raspberrypi:~ $ ollama run qwen2.5:1.5b-instruct --verbose >>> Translate to French: The cat is on the table. Le chat est sur la table. total duration: 4.02122163s load duration: 1.450590735s prompt eval count: 40 token(s) prompt eval duration: 1.869869038s prompt eval rate: 21.39 tokens/s eval count: 8 token(s) eval duration: 681.880437ms eval rate: 11.73 tokens/s |
qwen2.5-coder:1.5b:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
pi@raspberrypi:~ $ ollama run qwen2.5-coder:1.5b --verbose >>> Write an FFT function in Python Certainly! Below is an example of how to implement the Fast Fourier Transform (FFT) in Python using NumPy: ... You can adjust the input array as needed to test different sequences. total duration: 44.247390636s load duration: 230.700317ms prompt eval count: 35 token(s) prompt eval duration: 1.635518696s prompt eval rate: 21.40 tokens/s eval count: 429 token(s) eval duration: 41.82226033s eval rate: 10.26 tokens/s |
It should be noted that the results on the Hailo-10H might include the prompt evaluation time (usually quite fast compared to generation) on top of the answer time, so actual tokens/s might be slightly higher than reported. Nevertheless, the table below gives us an idea of the LLM performance of the Raspberry Pi AI HAT+ 2 against the Raspberry Pi 5/CM5 CPU.
| Raspberry Pi 5/CM5 CPU | Raspberry Pi 5 + AI HAT+ 2 | |
|---|---|---|
| Deepseek R1 1.5B | 9.04 tokens/s | 6.72 tokens/s |
| Qwen2 1.5B | 11.29 tokens/s | 5.89 tokens/s |
| Llama3.2 3B | 4.78 tokens/s | 2.60 tokens/s |
| Qwen 2.5 instruct 1.5B | 11.73 tokens/s | 6.74 tokens/s |
| Qwen 2.5 Coder 1.5B | 10.26 tokens/s | 8.06 tokens/s |
Just looking at this table, the Raspberry Pi AI HAT+ 2 feels more like an AI decelerator than an AI accelerator! I was expecting the Raspberry Pi AI HAT+ 2 and Hailo-10H AI accelerator to compete against the Rockchip RK1820/RK1828 AI accelerators, but those are in a different league, as shown in the table below (and the price will probably be a few hundred dollars).

However, there are still some benefits of using the Hailo-10H board for LLMs. First, the host does not need to have a lot of RAM, and I could run all models on a Raspberry Pi 5 2GB, since models are loaded into the 8GB RAM chip on the HAT.
Besides RAM usage, the CPU does not do anything, as shown in the screenshot with htop below when running a DeepSeek R1 1.5B request.
It looks like it could even run on one of the new Raspberry Pi 5 1GB RAM boards. Now compare this to running the same task with ollama on the Raspberry Pi developer kit for CM5.
There’s high CPU usage, and the memory used is much higher since the model is loaded in the Pi’s RAM, and these resources could be used for other tasks on the Raspberry Pi 5. In theory, you could also offload your LLM to another Raspberry Pi 5 8GB SBC with better performance and a cheaper price tag than the AI HAT+ 2. However, the solution would be larger/heavier and consume more power, which may be important for battery-powered robots, for instance.
- DeepSeek-R1 1.5B Ollama on Raspberry Pi CM5 devkit: 10.2-10.6 Watts.
- DeepSeek-R1 1.5B Hailo-ollama on Raspberry Pi 5 2GB with AI HAT+ 2: 7.2 to 7.6Watts.
I also contacted Eben Upton about the results, and he explained that it’s not surprising that the tokens-per-second figure is similar, as this is memory-bandwidth limited, and Raspberry Pi 5 and Hailo-10 have similar memory subsystems (LPDDR4X-4267). He also replied that the key goals of the AI HAT+ 2 product were:
- To deliver a much improved time-to-first-token
- To unload the host Raspberry Pi CPU (and memory) to execute other tasks
I don’t see any obvious data/parameters in the Hailo tools to check the time-to-first-token (TTFT), so I haven’t tested it. Having said that, we should expect some TTFT benchmarks from Raspberry Pi in the near future, as long as larger models for the AI HAT+ 2.
VLM (Vision Language Model) testing with the Raspberry Pi AI HAT+ 2
We’re not done yet, as we haven’t tested VLMs (Vision Language Models), which might be the optimal workloads for the Hailo-10H since they leverage both its computer vision and large language model capabilities. We can test this using the Hailo apps:
|
1 2 3 |
pi@raspberrypi:~ $ git clone https://github.com/hailo-ai/hailo-apps pi@raspberrypi:~ $ cd hailo-apps pi@raspberrypi:~/hailo-apps $ sudo ./install.sh |
This will take a while, and if it ends successfully, it should show something like:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
════════════════════════════════════════════════════════════════ Installation Summary ════════════════════════════════════════════════════════════════ ✅ User Detection User: pi, Group: pi ✅ Prerequisites Check All required components found ✅ System Packages Packages installed ✅ Resources Setup Resources at /usr/local/hailo/resources ✅ Virtual Environment venv: /home/pi/hailo-apps/venv_hailo_apps ✅ Python Packages Packages installed ✅ Post-Installation Post-install done ✅ Installation completed successfully! Virtual environment: /home/pi/hailo-apps/venv_hailo_apps To activate: source /home/pi/hailo-apps/setup_env.sh Log file: /home/pi/hailo-apps/logs/install_20260119_212642.log |
I had to run it twice because my microSD card was full, and I deleted the LLMs to free some space in order to complete the installation.
Let’s start the VLM chat demo that relies on a Raspberry Pi camera module to capture an image and the AI accelerator to describe it.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
pi@raspberrypi:~/hailo-apps $ source /home/pi/hailo-apps/setup_env.sh (venv_hailo_apps) pi@raspberrypi:~/hailo-apps $ cd hailo_apps/python/gen_ai_apps/vlm_chat/ (venv_hailo_apps) pi@raspberrypi:~/hailo-apps/hailo_apps/python/gen_ai_apps/vlm_chat $ python vlm_chat.py --input rpi ⚠️ WARNING: Default model 'Qwen2-VL-2B-Instruct' is not downloaded. Downloading model for vlm_chat/hailo10h... This may take a while depending on your internet connection. Downloading model: Qwen2-VL-2B-Instruct for hailo10h... ... [0:16:29.905550842] [2683] INFO Camera camera_manager.cpp:340 libcamera v0.6.0+rpt20251202 [0:16:29.918273881] [2686] INFO RPI pisp.cpp:720 libpisp version 1.3.0 [0:16:29.931851942] [2686] INFO IPAProxy ipa_proxy.cpp:180 Using tuning file /usr/share/libcamera/ipa/rpi/pisp/imx708.json [0:16:29.941043094] [2686] INFO Camera camera_manager.cpp:223 Adding camera '/base/axi/pcie@1000120000/rp1/i2c@88000/imx708@1a' for pipeline handler rpi/pisp [0:16:29.941081631] [2686] INFO RPI pisp.cpp:1181 Registered camera /base/axi/pcie@1000120000/rp1/i2c@88000/imx708@1a to CFE device /dev/media0 and ISP device /dev/media1 using PiSP variant BCM2712_D0 [0:16:29.944612573] [2683] INFO Camera camera.cpp:1215 configuring streams: (0) 640x480-RGB888/sRGB (1) 1536x864-BGGR_PISP_COMP1/RAW [0:16:29.944770055] [2686] INFO RPI pisp.cpp:1485 Sensor: /base/axi/pcie@1000120000/rp1/i2c@88000/imx708@1a - Selected sensor format: 1536x864-SBGGR10_1X10/RAW - Selected CFE format: 1536x864-PC1B/RAW ================================================================================ 🎥 LIVE VIDEO | Press Enter to CAPTURE image ('q' to quit) ======================================================================== |
After a while, a window with the camera output will show up, and we can press Enter to capture an image, and then we can ask a question about it, or press Enter to use the default prompt (“Describe the image”).
Here’s one of the outputs I got:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
================================================================================ 📷 IMAGE CAPTURED | Type question (Enter='Describe the image', 'q' to Cancel) ================================================================================ Question: Using default prompt: 'Describe the image' ================================================================================ ⏳ PROCESSING... | Please wait ================================================================================ The image depicts a toy that resembles a small, anthropomorphic creature. The toy has a round head with a small, smiling face, and it is wearing a hat that has a pattern resembling a forest or woodland scene. The toy has a blue and white pattern on its body, and it appears to be made of a soft, plush material. The toy has a small, button-like nose and a small, button-like mouth, giving it a friendly and cheerful appearance. The toy has a small, button-like hand and a small, button-like foot, which are also attached to the toy. The toy has a small, button-like ear on its head, which is a characteristic feature of the toy. The toy is sitting upright, and the background appears to be a light-colored surface, possibly a table or a wall. ================================================================================ ✅ RESULT READY | Press Enter to continue ================================================================= |
The camera colors are off, and there doesn’t seem to be any parameter to fix that:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
chat $ python vlm_chat.py --help usage: vlm_chat.py [-h] [--log-level {critical,error,warning,info,debug}] [--debug] [--log-file LOG_FILE] [--input INPUT] [--hef-path HEF_PATH] [--list-models] [--batch-size BATCH_SIZE] [--width WIDTH] [--height HEIGHT] [--arch {hailo8,hailo8l,hailo10h}] [--show-fps] [--frame-rate FRAME_RATE] [--labels LABELS] [--track] [--list-inputs] [--list-nets] [--resolution {sd,hd,fhd}] [--output-dir OUTPUT_DIR] [--save-output] Hailo Standalone Processing Application options: -h, --help show this help message and exit --input, -i INPUT Input source for processing. Can be a file path (image or video), camera index (integer), folder path containing images, or RTSP URL. For USB cameras, use 'usb' to auto-detect or '/dev/video<X>' for a specific device. For Raspberry Pi camera, use 'rpi'. If not specified, defaults to application-specific source. --hef-path, -n HEF_PATH Path or name of Hailo Executable Format (HEF) model file. Can be: (1) full path to .hef file, (2) model name (will search in resources), or (3) model name from available models (will auto-download if not found). If not specified, uses the default model for this application. --list-models List all available models for this application and exit. Shows default and extra models that can be used with --hef-path. --batch-size, -b BATCH_SIZE Number of frames or images to process in parallel during inference. Higher batch sizes can improve throughput but require more memory. Default is 1 (sequential processing). --width, -W WIDTH Custom output width in pixels for video or image output. If specified, the output will be resized to this width while maintaining aspect ratio. If not specified, uses the input resolution or model default. --height, -H HEIGHT Custom output height in pixels for video or image output. If specified, the output will be resized to this height while maintaining aspect ratio. If not specified, uses the input resolution or model default. --arch, -a {hailo8,hailo8l,hailo10h} Target Hailo architecture for model execution. Options: 'hailo8' (Hailo-8 processor), 'hailo8l' (Hailo-8L processor), 'hailo10h' (Hailo-10H processor). If not specified, the architecture will be auto-detected from the connected device. --show-fps Enable FPS (frames per second) counter display. When enabled, the application will display real-time performance metrics showing the current processing rate. Useful for performance monitoring and optimization. --frame-rate, -f FRAME_RATE Target frame rate for video processing in frames per second. Controls the playback speed and processing rate for video sources. Default is 30 FPS. Lower values reduce processing load, higher values increase throughput. --labels, -l LABELS Path to a text file containing class labels, one per line. Used for mapping model output indices to human- readable class names. If not specified, default labels for the model will be used (e.g., COCO labels for detection models). --track Enable object tracking for detections. When enabled, detected objects will be tracked across frames using a tracking algorithm (e.g., ByteTrack). This assigns consistent IDs to objects over time, enabling temporal analysis, trajectory visualization, and multi-frame association. Useful for video processing applications. --list-inputs List available demo inputs for this application and exit. This uses the shared resources catalog (images/videos) defined in resources_config.yaml. --list-nets List available models for this application and exit. Alias for --list-models to align with legacy app flags. --resolution, -r {sd,hd,fhd} Predefined resolution for camera input sources. Options: 'sd' (640x480, Standard Definition), 'hd' (1280x720, High Definition), 'fhd' (1920x1080, Full High Definition). Default is 'sd'. This flag is only applicable when using camera input sources. --output-dir, -o OUTPUT_DIR Directory where output files will be saved. When --save-output is enabled, processed images, videos, or result files will be written to this directory. If not specified, outputs are saved to a default location or the current working directory. The directory will be created if it does not exist. --save-output, -s Enable output file saving. When enabled, processed images or videos will be saved to disk. The output location is determined by the --output-dir flag. Without this flag, output is only displayed (if applicable). |
I did try the show-fps option, but nothing showed up.
Conclusion
The Raspberry Pi AI HAT+ 2 did not quite meet my expectations as an LLM accelerator, as I wrongly assumed that it would speed up performance in terms of tokens/s. Instead, it delivers Computer Vision performance similar to the Hailo-8-based Raspberry Pi AI HAT+ launched a couple of years ago, and adds support for LLMs and VLMs.
The performance of Large Language Models is actually somewhat lower than running these on the Broadcom BCM2712 CPU found on the Raspberry Pi 5, and the main benefits of the AI HAT+ 2 here are that it offloads processing using very little RAM and CPU from the SBC itself, and consumes less power, which may be important for battery-powered applications. I used it with a Raspberry Pi 5 2GB, but it should also work with a Raspberry Pi 5 1GB, thanks to the built-in 8GB RAM found on the HAT itself. The Time to First Token (TTFT) should also be much faster, but I didn’t find a way to test this with the provided tools. The best use case for the AI HAT+2 is probably Vision Language Models since we can leverage Computer Vision and Large Models capabilities of the Hailo-10H AI accelerator.
The Raspberry Pi AI HAT+ 2 sells for $130 compared to $110 for the first-generation Raspberry Pi AI HAT+. Since both have about the same AI vision performance, and the main benefits of the AI HAT+ 2 over running LLM on the Pi 5 directly are offloading (since Hailo-10H and CPU performance are about the same) and lower power consumption, I think the new Hailo-10H HAT+ is probably best suited to security camera systems like Frigate or robots equipped with cameras that need to understand their environments through VLMs.

Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.
Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress. We also use affiliate links in articles to earn commissions if you make a purchase after clicking on those links.












