edge-aideploymenttutorial

Edge AI Prototyping Kit: From Raspberry Pi 5 to a Deployed Micro-Service

UUnknown

2026-02-06

11 min read

End-to-end guide to build, containerize, secure, and monitor an Edge AI micro-service on Raspberry Pi 5 with AI HAT+ 2.

Hook: Stop wasting days on edge prototypes — build, secure, and monitor an AI micro-service on Pi 5 in hours

If your team is still spending too much time assembling ad-hoc scripts, juggling incompatible toolchains, and struggling to push an AI proof-of-concept from bench to field, this guide is for you. In 2026 the expectation is clear: prototypes must be production-minded. This end-to-end walkthrough shows how to go from a Raspberry Pi 5 + AI HAT+ 2 to a deployed, containerized micro-service with a secure API, remote monitoring, and production-grade best practices.

Why this matters in 2026

Edge-first architectures are mainstream — teams push models to edge devices to lower latency and data egress costs.
Hardware accelerators such as the AI HAT+ 2 make small-form factor generative and inferencing workloads feasible on Pi-class devices.
Operational expectations now include containerization, observability, TLS by default, and secure update mechanisms, even for micro prototypes.

“Micro-services at the edge are no longer experiments — they must be secure, observable, and automatable from day one.”

What you’ll build

By the end of this tutorial you will have:

Prepared a Raspberry Pi 5 with AI HAT+ 2.
Installed vendor runtime and a quantized ONNX model suitable for edge inferencing.
Developed a tiny FastAPI micro-service that exposes a secure inference API.
Containerized the service for ARM64 using Docker Buildx (or Podman), deployed it as a controlled systemd service, and proxied it with a TLS reverse proxy.
Added Prometheus metrics and remote monitoring to a cloud backend for performance alerts.

Prerequisites & recommended hardware (2026)

Raspberry Pi 5 (64-bit OS recommended, up-to-date to 2026-01 releases).
AI HAT+ 2 (vendor SDK + runtime). The HAT provides a hardware accelerator and a vendor-supplied runtime/driver — install those per the vendor guide.
16–32 GB microSD or NVMe boot for reliable IO.
Network access (Ethernet recommended for deployments), power supply, and a development machine with Docker Buildx for cross-builds.
Accounts for remote monitoring (Grafana Cloud/Prometheus remote write) — optional but recommended.

Step 1 — Prepare the OS and vendor runtime

1.1 Flash a 64-bit Raspberry Pi OS or Ubuntu Server (2026 updated image)

Use a 64-bit image. For reliability in 2026, many teams prefer Ubuntu Server LTS or the Raspberry Pi OS 64-bit updated images. After flashing, perform an update:

sudo apt update && sudo apt upgrade -y
sudo reboot

1.2 Install AI HAT+ 2 drivers and runtime

Follow the vendor's SDK install instructions. Typically this includes:

Installing a kernel module or firmware for the HAT.
Installing a runtime (e.g., an ONNX/TF runtime with an NPU execution provider).

Example (replace with vendor commands):

# vendor install script (example placeholder)
curl -sSL https://vendor.example/install-ai-hat2.sh | sudo bash
# verify: vendor runtime exposes an execution provider
/opt/vendor/bin/runtime --list-providers

1.3 System hardening

Disable password SSH logins; use key auth only.
Enable automatic security updates for packages.
Run a minimal firewall: allow SSH, HTTPS, and your app port.

sudo sed -i 's/PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
sudo apt install -y unattended-upgrades ufw
sudo ufw allow OpenSSH
sudo ufw allow 443/tcp
sudo ufw enable

Step 2 — Choose and prepare a tiny model (edge best practices)

In 2026 the dominant patterns for edge models are quantization and runtime-specific acceleration. Use a small vision model (MobileNet, EfficientNet-lite) or a distilled language model if you need text. For this tutorial we'll use a quantized MobileNet ONNX model.

2.1 Quantize for performance

Quantize to INT8/FP16 depending on vendor support. Quantization dramatically reduces memory and inference latency on NPUs.

# example using onnxruntime quantization tool (run on your dev machine)
python -m onnxruntime_tools.optimizer_cli --input model.onnx --output model_quant.onnx --quantize
# push model_quant.onnx to the Pi (scp or artifact storage)
scp model_quant.onnx pi@raspberrypi:/home/pi/models/

2.2 Verify model runs with the vendor provider

python -c "import onnxruntime as ort; sess = ort.InferenceSession('model_quant.onnx', providers=['VendorNPU','CPUExecutionProvider']); print(sess.get_providers())"
# run a sample inference to confirm latency and correctness

Step 3 — Build the micro-service (FastAPI example)

A small, production-minded API includes:

/health and /metrics endpoints
/predict endpoint accepting multipart image uploads or base64 payloads
JWT or mTLS for authentication

3.1 App outline (key files)

app/main.py — FastAPI app
app/model.py — model loader and inference wrapper
Dockerfile
prom/metrics (expose Prometheus metrics)

3.2 Example model wrapper (app/model.py)

from PIL import Image
import numpy as np
import onnxruntime as ort

class EdgeModel:
    def __init__(self, model_path):
        providers = ['VendorNPUExecutionProvider', 'CPUExecutionProvider']
        self.sess = ort.InferenceSession(model_path, providers=providers)
        self.input_name = self.sess.get_inputs()[0].name

    def preprocess(self, pil_image: Image.Image):
        img = pil_image.resize((224,224)).convert('RGB')
        arr = np.array(img).astype('float32') / 255.0
        arr = np.transpose(arr, (2,0,1))
        arr = np.expand_dims(arr, 0)
        return arr

    def predict(self, pil_image: Image.Image):
        inp = self.preprocess(pil_image)
        out = self.sess.run(None, {self.input_name: inp})
        return out[0].tolist()

3.3 Example FastAPI app (app/main.py)

from fastapi import FastAPI, File, UploadFile, Depends, HTTPException
from prometheus_client import start_http_server, Summary, Counter
from PIL import Image
import io
from .model import EdgeModel

app = FastAPI()
model = EdgeModel('/models/model_quant.onnx')

REQUEST_TIME = Summary('request_processing_seconds', 'Time spent processing request')
REQUESTS = Counter('requests_total', 'Total requests')

@app.get('/health')
def health():
    return {'status': 'ok'}

@app.post('/predict')
@REQUEST_TIME.time()
async def predict(file: UploadFile = File(...)):
    REQUESTS.inc()
    if file.content_type.split('/')[0] != 'image':
        raise HTTPException(status_code=400, detail='Invalid file type')
    data = await file.read()
    img = Image.open(io.BytesIO(data))
    preds = model.predict(img)
    return {'predictions': preds}

# Separate /metrics endpoint is exposed by prometheus_client via WSGI gateway or a separate process

Step 4 — Containerize for ARM64 and reproducible builds

Use Docker Buildx to produce an ARM64 image from your development machine. Keep images minimal and avoid running as root.

4.1 Dockerfile (production-minded)

FROM --platform=linux/arm64 python:3.11-slim

# Create non-root user
RUN useradd --create-home appuser
WORKDIR /home/appuser

# Install runtime deps (adjust for vendor runtime requirements)
RUN apt-get update && apt-get install -y libsndfile1 libjpeg62-turbo --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY app ./app
COPY models ./models

USER appuser
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]

4.2 Build and push for ARM

# enable buildx and create a builder
docker buildx create --use --name edge-builder
docker buildx build --platform linux/arm64 -t myregistry/edge-ai-pi5:latest --push .

Step 5 — Deploy on Pi 5 (container runtime) and run under supervision

On the Pi, use Docker or Podman. For security and rootless operation, Podman is a strong choice. Below shows a systemd unit that runs a container and ensures restart on failure.

5.1 Pull & run container (Podman example)

podman pull myregistry/edge-ai-pi5:latest
podman run -d --name edge-ai \
  --device /dev/vendor-npu0:/dev/vendor-npu0 \
  --restart=always \
  -p 127.0.0.1:8000:8000 \
  myregistry/edge-ai-pi5:latest

5.2 systemd unit (for Podman)

[Unit]
Description=Edge AI Service (Podman)
After=network.target

[Service]
Restart=always
ExecStart=/usr/bin/podman run --rm --name edge-ai --device /dev/vendor-npu0:/dev/vendor-npu0 -p 127.0.0.1:8000:8000 myregistry/edge-ai-pi5:latest
ExecStop=/usr/bin/podman stop -t 10 edge-ai

[Install]
WantedBy=multi-user.target

Step 6 — Secure the API (TLS, auth, least privilege)

Best practice in 2026: terminate TLS at a strong reverse proxy (Caddy/Traefik) that supports automatic TLS and mTLS. Avoid exposing the app port directly to the network.

6.1 Caddyfile example (automatic TLS & optional mTLS)

edge.example.com {
    reverse_proxy 127.0.0.1:8000
    tls /etc/certs/edge.crt /etc/certs/edge.key {
      client_auth {
        mode require_and_verify
        trusted_ca_cert_file /etc/certs/ca.pem
      }
    }
}

Alternatively, use JWTs verified at the proxy or within the app. In either case, enforce mutual authentication or short-lived tokens for device-level access.

Step 7 — Observability: metrics, logs, and remote monitoring

Visibility into edge devices is non-negotiable. Ship lightweight metrics and logs off-device for central dashboards and alerting.

7.1 App metrics (Prometheus)

Expose application metrics via prometheus_client (use a separate /metrics port or path). On Pi, run a node_exporter to collect system metrics and a vendor exporter for NPU utilization if available.

# start a small metrics exporter (example)
nohup prometheus_client_wsgi --bind 127.0.0.1:9101 metrics_app:app &

7.2 Remote write to Grafana / hosted backend

Edge devices often cannot be scraped from the cloud due to NAT. Use Prometheus remote_write or a push gateway pattern to forward metrics to Grafana Cloud, Cortex, or VictoriaMetrics. In 2026, hosted telemetry with remote-write plus compression and sampling is the recommended pattern for fleets of Pi devices. See how on-device AI is reshaping data visualization for patterns that pair well with remote_write.

7.3 Logs and traces

Use Vector or Fluent Bit to forward logs to a central sink (Elasticsearch, Loki, Datadog).
Instrument critical paths with OpenTelemetry (traces) for request latency profiling — tie trace data back to model decisions where possible and consider explainability tools for complex models.

Step 8 — Performance tuning & hardware profiling

Measure these core metrics:

Inference latency (p50/p90/p99)
CPU and memory utilization
NPU utilization and temperature
Network egress and request rate

Use the vendor's profiling tools to identify NPU bottlenecks. In many cases, batching lightly (batch size 2–4) or running asynchronous workers will improve throughput without increasing tail latency.

Step 9 — Upgrade strategy and CI/CD for edge containers

For reproducible, auditable updates:

Build images via CI (GitHub Actions/GitLab) with buildx and sign images (cosign).
Push to a restricted container registry, then pull by the Pi using a short-lived token.
Use an update agent (e.g., balena, Mender, or a simple systemd timer that does a safe pull+restart) with healthchecks to rollback on bad releases.

# simplified update script
podman pull myregistry/edge-ai-pi5:latest && podman stop edge-ai && podman rm edge-ai && podman run -d --name edge-ai ... myregistry/edge-ai-pi5:latest

Security checklist (must-dos for production prototypes)

Use non-root containers and limited Linux capabilities.
Sign and verify container images (cosign) before running.
Terminate TLS at a trusted proxy; enforce mTLS or short-lived JWTs.
Encrypt sensitive model artifacts at rest and restrict access to /models.
Keep vendor runtimes and kernel drivers up to date — the HAT vendor will release critical fixes through 2026.
Scan images for vulnerabilities during CI.

Advanced strategies and 2026 trends to adopt

WASM for lightweight inference: Wasm runtimes (WasmEdge/Wasmtime) are maturing as an alternative to containers for constrained devices — consider Wasm modules for tiny models where portability matters. See WASM and edge patterns.
Model shards & split execution: Offload parts of a model to cloud for heavy tasks and run lightweight heads on device.
Federated telemetry: Aggregate per-device metrics at the edge and forward summaries to cloud to reduce egress costs.
Secure attestation: Hardware-backed attestation and signed boot chains are becoming expected for fleets in regulated industries — pair this with edge observability and attestation flows.

Troubleshooting & common gotchas

Issue: Model runs locally but fails in container on the Pi

Causes: missing vendor runtime/device mapping or wrong user permissions. Ensure you map the device into the container and install the vendor runtime inside the container or rely on host drivers with correct permissions.

Issue: Slow tail latency under load

Profile CPU and NPU. Lower worker count, enable asynchronous queues, or add an input buffer and rate-limiter. Sometimes the kernel I/O or thermal throttling causes spikes — monitor temperature and limit concurrency.

Issue: Cannot remotely scrape Prometheus

Use Prometheus remote_write or a push gateway. For NAT-limited devices use a secure reverse tunnel or an edge proxy that batches metrics. See on-device visualization patterns for telemetry-friendly architectures.

Example: Minimal CI snippet (GitHub Actions) for ARM build + cosign sign

name: Build and Sign
on: [push]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - name: Set up QEMU
      uses: docker/setup-qemu-action@v2
    - name: Set up Docker Buildx
      uses: docker/setup-buildx-action@v3
    - name: Login to registry
      uses: docker/login-action@v3
      with:
        registry: myregistry
        username: ${{ secrets.REG_USER }}
        password: ${{ secrets.REG_PASS }}
    - name: Build and push
      run: |
        docker buildx build --platform linux/arm64 -t myregistry/edge-ai-pi5:${{ github.sha }} --push .
    - name: Sign image
      uses: sigstore/cosign-installer@v1
    - name: Cosign sign
      run: cosign sign --key ${{ secrets.COSIGN_KEY }} myregistry/edge-ai-pi5:${{ github.sha }}

Actionable takeaways (quick checklist)

Quantize models and verify the vendor NPU provider on Pi 5 before containerizing. See on-device AI patterns for recommended validation steps.
Run the micro-service as a non-root container and terminate TLS at a reverse proxy; pair with edge proxy best practices.
Expose /metrics and forward to a remote backend using Prometheus remote_write.
Automate builds with buildx and sign images; deploy with a managed update agent or systemd with healthchecks.
Instrument for temperature and NPU utilization — thermal issues are a frequent production pain point on compact devices.

Conclusion & next steps

Edge AI in 2026 expects more than a script that works on your desk: it requires repeatable builds, secure deployment, and centralized observability. The Raspberry Pi 5 paired with AI HAT+ 2 is now capable of running meaningful inferencing workloads — but success depends on operational rigor. Use the patterns in this guide to build a micro-service that’s prototype-fast and production-ready.

Call to action

Ready to try this on your Pi 5? Clone the companion repo (sample FastAPI code, Dockerfile, systemd unit, and Prometheus configs) and follow the README to complete a hands-on run. If you want a tailored checklist or an enterprise rollout plan for fleets of Pi 5 devices with AI HAT+ 2, contact our team for an audit and deployment template.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.