// Linux Board Check

Can your Linux board
run AI?

Raspberry Pi, Orange Pi, Jetson, Pine64 — the answer depends on more than RAM. The compute unit (CPU / NPU / GPU) changes everything. Pick your board for an honest breakdown.

> On Linux SBCs, board architecture beats RAM — an Orange Pi 5 beats an RPi 5 at the same memory because of its NPU.

Not sure which board? Check the label on the board, the box, or cat /proc/cpuinfo | head -5 in terminal.

Check in terminal: free -h — look at the "total" column on the Mem row. Or check your purchase receipt / board spec sheet.

info Unlike Mac, Linux SBC tiers are driven by compute unit first, RAM second. Two boards with 8 GB RAM can be in completely different tiers if one has a Neural Processing Unit (NPU) or CUDA GPU and the other doesn't.

What nobody tells you about Linux SBC AI

The board spec is just the beginning. These six factors will determine whether your experience is satisfying or frustrating.

// Storage type
SD card vs SSD — 10–15× difference
Loading a 4 GB model from an SD card can take 3–5 minutes. From NVMe or USB3 SSD: 15–30 seconds. If you're on an SD card, the model doesn't run slowly — it barely loads at all.
→ Use SSD if at all possible
// Thermal throttling
No cooling = 40–60% performance
AI inference is a sustained max-CPU workload. Without active cooling, most SBCs throttle within 30–60 seconds. A Raspberry Pi 5 with no heatsink drops from 2.4 GHz to 1.5 GHz fast.
→ Active cooling is not optional
// Power supply
Underpowered = throttled before you start
An RPi 5 running AI needs a 5V/5A (25W) USB-C supply. Generic chargers often droop to 4.8V under load, triggering throttling. The board runs — but at a fraction of rated speed.
→ Use the official supply or equivalent
// Quantization
INT4/INT8 is mandatory, not optional
FP16 (the "clean" format) doubles memory needs. A 7B model in FP16 needs ~14 GB — it won't load. In INT4/Q4_K_M it needs ~4.5 GB. llama.cpp handles quantization automatically. Always download Q4 or Q8 GGUF files.
→ Download Q4_K_M or Q8_0 GGUF format
// Swap on SSD
Swap is a safety net, not a solution
Models slightly larger than physical RAM can use swap on SSD — but SSD swap is 10–50× slower than RAM for random access. A 7B model with 2 GB of swap might run at 0.5 tok/s. On SD card swap: don't.
→ Stay within physical RAM for interactive use
// Runtime matters
The right runtime multiplies performance
llama.cpp works everywhere. But RKNN-Toolkit2 on RK3588 boards unlocks the NPU and can double throughput. TensorRT on Jetson unlocks the GPU and multiplies it further. Wrong runtime = leaving 2–5× speed on the table.
→ See the runtime guide below

Four runtimes, four use cases

llama.cpp
All boards · ARM64 CPU inference
The universal baseline. Runs on every board listed here via CPU. Supports ARM NEON acceleration, thread pinning, and all GGUF-format models. Slower than hardware-specific runtimes but always works. Best starting point on any new board.
./llama-cli -m model.gguf -ngl 0 -t 4 -p "Hello"
Ollama
All boards · Easy wrapper for llama.cpp
Manages model downloads, layers, and a REST API automatically. Runs on ARM64. Slightly slower than hand-tuned llama.cpp but far easier to set up. Good for anything where you want a local API without configuration overhead.
curl -fsSL https://ollama.com/install.sh | sh
RKNN-Toolkit2
RK3588 boards only (Orange Pi 5, Rock 5B, Khadas Edge2)
Unlocks the RK3588's 6 TOPS NPU for INT8 inference. Models must be converted to RKNN format first (ONNX → RKNN). Significantly faster than CPU-only for compatible models. The RKLLM project specifically targets LLM inference via NPU+CPU hybrid.
pip install rknn-toolkit2 # then convert + run
TensorRT / llama.cpp CUDA
NVIDIA Jetson only
Unlocks the Jetson's GPU for inference — the biggest jump in performance on any SBC. Compile llama.cpp with CUDA enabled or use TensorRT-LLM for production-grade throughput. GPU offloading layers (-ngl 999) is the key flag.
./llama-cli -m model.gguf -ngl 999 -t 4

What your board unlocks

Minimal
RPi Zero 2W · RPi 3 / 3B+ · PINE64+ 1–2 GB · Star64 (RISC-V) · <2 GB boards
Proof-of-concept, not practical. The smallest models (Qwen3 0.6B, TinyLlama 1.1B) technically run at ~0.5–1 tok/s. Educational value: high. Usability for real tasks: very low. If you're here for actual AI work, consider an RPi 5 or RK3588 board.
Qwen3 0.6B · 0.4 GB TinyLlama 1.1B · 0.7 GB Llama 3.2 1B ⚠ very slow
CPU Entry
RPi 4 4 GB · RPi 5 2–4 GB · PINE64 4 GB · Pinebook Pro · RK3399 boards 4 GB
Llama 3.2 1B is your daily driver — 8–15 tok/s on a well-cooled board. Fast enough for patient interactive use. 3B models technically run but expect 1–3 tok/s. Active cooling and SSD storage matter enormously here.
Llama 3.2 1B · 0.7 GB ★ SmolLM2 1.7B · ~1 GB ★ Llama 3.2 3B ⚠ slow
CPU Capable
RPi 5 8 GB+ · RPi 4 8 GB · PINE64 8 GB · RK3399 boards 8 GB
RPi 5 changes the picture. Cortex-A76 cores run 3B models at 10–15 tok/s — genuinely interactive. With 8 GB, 7B models load and run at ~3–5 tok/s (usable for low-urgency tasks). NVMe via HAT transforms the experience. RPi 4 8 GB is capable but ~2–3× slower than RPi 5 at the same RAM.
Llama 3.2 3B · 2 GB ★ Phi-4 mini · 2.2 GB ★ Qwen3 4B · 2.5 GB Llama 3.1 8B ⚠ slow on 8 GB
NPU Accelerated
Orange Pi 5 / 5 Plus · Radxa ROCK 5B · Khadas Edge2 (RK3588, 4–32 GB)
The RK3588's 6 TOPS NPU changes the SBC AI equation. RKLLM + NPU hybrid inference runs 3B models at 20–30 tok/s and 7B at 8–14 tok/s — competitive with a MacBook Air 8 GB at the same model size. Models must be converted to RKNN format to use the NPU; llama.cpp falls back to the fast A76 CPU cores.
Qwen3 4B via RKLLM ★ NPU Llama 3.2 3B via RKLLM ★ NPU Llama 3.1 8B · CPU only 13B+ ⚠ CPU only, slow
Jetson Entry
Jetson Orin Nano 4 GB / 8 GB (1024 CUDA cores, 40 TOPS)
A real CUDA GPU in a small board. 7B models run at 20–30 tok/s with GPU offloading — faster than most CPU-only laptop configurations. On 8 GB, 13B models are possible at reduced speed. TensorRT and llama.cpp CUDA are both supported. This is where SBC AI becomes genuinely fast.
Llama 3.1 8B · CUDA ★ Phi-4 mini · CUDA ★ Qwen3 4B · CUDA 13B ⚠ tight on 8 GB
Jetson Pro
Jetson Orin NX 8/16 GB · Jetson AGX Orin 32/64 GB (2048 CUDA cores, 275 TOPS)
Production-grade edge AI. The AGX Orin is what robotics platforms and inference servers run. 13B models at 30–50 tok/s; 70B Q4 is possible on 64 GB at 5–10 tok/s. TensorRT-LLM gives another 2–3× over standard llama.cpp. Capable of serving inference to other devices on the network.
Llama 3.1 8B fast ★ CUDA Qwen 2.5 14B · CUDA Llama 3.1 70B Q4 · 64 GB only Network inference server
x86 Linux
Intel NUC · Beelink · Minisforum · Generic x86 Desktop/Laptop (8–64 GB RAM)
x86 CPUs run llama.cpp with AVX2/AVX-512 acceleration — noticeably faster than ARM at the same clock speed. 7B at 15–25 tok/s on a Ryzen 7 or Core i7 without any GPU. Integrated Intel/AMD graphics can offload some layers via Vulkan — results vary widely. Discrete NVIDIA GPU changes everything (but that's the full desktop, not mini PC territory).
Llama 3.1 8B · 15–25 tok/s ★ Phi-4 mini · fast Qwen 2.5 14B · 32 GB+ iGPU Vulkan offload ⚠ varies
lock

Local AI on Linux is private by default. llama.cpp, Ollama, RKLLM, and TensorRT all run entirely offline. No API key, no account, no telemetry — your prompts stay on your hardware.