Minimal
RPi Zero 2W · RPi 3 / 3B+ · PINE64+ 1–2 GB · Star64 (RISC-V) · <2 GB boards
Proof-of-concept, not practical. The smallest models (Qwen3 0.6B, TinyLlama 1.1B) technically run at ~0.5–1 tok/s. Educational value: high. Usability for real tasks: very low. If you're here for actual AI work, consider an RPi 5 or RK3588 board.
Qwen3 0.6B · 0.4 GB
TinyLlama 1.1B · 0.7 GB
Llama 3.2 1B ⚠ very slow
CPU Entry
RPi 4 4 GB · RPi 5 2–4 GB · PINE64 4 GB · Pinebook Pro · RK3399 boards 4 GB
Llama 3.2 1B is your daily driver — 8–15 tok/s on a well-cooled board. Fast enough for patient interactive use. 3B models technically run but expect 1–3 tok/s. Active cooling and SSD storage matter enormously here.
Llama 3.2 1B · 0.7 GB ★
SmolLM2 1.7B · ~1 GB ★
Llama 3.2 3B ⚠ slow
CPU Capable
RPi 5 8 GB+ · RPi 4 8 GB · PINE64 8 GB · RK3399 boards 8 GB
RPi 5 changes the picture. Cortex-A76 cores run 3B models at 10–15 tok/s — genuinely interactive. With 8 GB, 7B models load and run at ~3–5 tok/s (usable for low-urgency tasks). NVMe via HAT transforms the experience. RPi 4 8 GB is capable but ~2–3× slower than RPi 5 at the same RAM.
Llama 3.2 3B · 2 GB ★
Phi-4 mini · 2.2 GB ★
Qwen3 4B · 2.5 GB
Llama 3.1 8B ⚠ slow on 8 GB
NPU Accelerated
Orange Pi 5 / 5 Plus · Radxa ROCK 5B · Khadas Edge2 (RK3588, 4–32 GB)
The RK3588's 6 TOPS NPU changes the SBC AI equation. RKLLM + NPU hybrid inference runs 3B models at 20–30 tok/s and 7B at 8–14 tok/s — competitive with a MacBook Air 8 GB at the same model size. Models must be converted to RKNN format to use the NPU; llama.cpp falls back to the fast A76 CPU cores.
Qwen3 4B via RKLLM ★ NPU
Llama 3.2 3B via RKLLM ★ NPU
Llama 3.1 8B · CPU only
13B+ ⚠ CPU only, slow
Jetson Entry
Jetson Orin Nano 4 GB / 8 GB (1024 CUDA cores, 40 TOPS)
A real CUDA GPU in a small board. 7B models run at 20–30 tok/s with GPU offloading — faster than most CPU-only laptop configurations. On 8 GB, 13B models are possible at reduced speed. TensorRT and llama.cpp CUDA are both supported. This is where SBC AI becomes genuinely fast.
Llama 3.1 8B · CUDA ★
Phi-4 mini · CUDA ★
Qwen3 4B · CUDA
13B ⚠ tight on 8 GB
Jetson Pro
Jetson Orin NX 8/16 GB · Jetson AGX Orin 32/64 GB (2048 CUDA cores, 275 TOPS)
Production-grade edge AI. The AGX Orin is what robotics platforms and inference servers run. 13B models at 30–50 tok/s; 70B Q4 is possible on 64 GB at 5–10 tok/s. TensorRT-LLM gives another 2–3× over standard llama.cpp. Capable of serving inference to other devices on the network.
Llama 3.1 8B fast ★ CUDA
Qwen 2.5 14B · CUDA
Llama 3.1 70B Q4 · 64 GB only
Network inference server
x86 Linux
Intel NUC · Beelink · Minisforum · Generic x86 Desktop/Laptop (8–64 GB RAM)
x86 CPUs run llama.cpp with AVX2/AVX-512 acceleration — noticeably faster than ARM at the same clock speed. 7B at 15–25 tok/s on a Ryzen 7 or Core i7 without any GPU. Integrated Intel/AMD graphics can offload some layers via Vulkan — results vary widely. Discrete NVIDIA GPU changes everything (but that's the full desktop, not mini PC territory).
Llama 3.1 8B · 15–25 tok/s ★
Phi-4 mini · fast
Qwen 2.5 14B · 32 GB+
iGPU Vulkan offload ⚠ varies