terminalMULBERRY_IDE

// Linux Board Check

Can your Linux board
run AI?

Raspberry Pi, Orange Pi, Jetson, Pine64 — the answer depends on more than RAM. The compute unit (CPU / NPU / GPU) changes everything. Pick your board for an honest breakdown.

> On Linux SBCs, board architecture beats RAM — an Orange Pi 5 beats an RPi 5 at the same memory because of its NPU.

iPhone / iPad Mac Linux

01 Which board?

Not sure which board? Check the label on the board, the box, or cat /proc/cpuinfo | head -5 in terminal.

02 How much RAM?

Check in terminal: free -h — look at the "total" column on the Mem row. Or check your purchase receipt / board spec sheet.

info Unlike Mac, Linux SBC tiers are driven by compute unit first, RAM second. Two boards with 8 GB RAM can be in completely different tiers if one has a Neural Processing Unit (NPU) or CUDA GPU and the other doesn't.

// Before you start

What nobody tells you about Linux SBC AI

The board spec is just the beginning. These six factors will determine whether your experience is satisfying or frustrating.

// Storage type

SD card vs SSD — 10–15× difference

Loading a 4 GB model from an SD card can take 3–5 minutes. From NVMe or USB3 SSD: 15–30 seconds. If you're on an SD card, the model doesn't run slowly — it barely loads at all.

→ Use SSD if at all possible

// Thermal throttling

No cooling = 40–60% performance

AI inference is a sustained max-CPU workload. Without active cooling, most SBCs throttle within 30–60 seconds. A Raspberry Pi 5 with no heatsink drops from 2.4 GHz to 1.5 GHz fast.

→ Active cooling is not optional

// Power supply

Underpowered = throttled before you start

An RPi 5 running AI needs a 5V/5A (25W) USB-C supply. Generic chargers often droop to 4.8V under load, triggering throttling. The board runs — but at a fraction of rated speed.

→ Use the official supply or equivalent

// Quantization

INT4/INT8 is mandatory, not optional

FP16 (the "clean" format) doubles memory needs. A 7B model in FP16 needs ~14 GB — it won't load. In INT4/Q4_K_M it needs ~4.5 GB. llama.cpp handles quantization automatically. Always download Q4 or Q8 GGUF files.

→ Download Q4_K_M or Q8_0 GGUF format

// Swap on SSD

Swap is a safety net, not a solution

Models slightly larger than physical RAM can use swap on SSD — but SSD swap is 10–50× slower than RAM for random access. A 7B model with 2 GB of swap might run at 0.5 tok/s. On SD card swap: don't.

→ Stay within physical RAM for interactive use

// Runtime matters

The right runtime multiplies performance

llama.cpp works everywhere. But RKNN-Toolkit2 on RK3588 boards unlocks the NPU and can double throughput. TensorRT on Jetson unlocks the GPU and multiplies it further. Wrong runtime = leaving 2–5× speed on the table.

→ See the runtime guide below

// Which runtime to use

Four runtimes, four use cases

llama.cpp

All boards · ARM64 CPU inference

The universal baseline. Runs on every board listed here via CPU. Supports ARM NEON acceleration, thread pinning, and all GGUF-format models. Slower than hardware-specific runtimes but always works. Best starting point on any new board.

./llama-cli -m model.gguf -ngl 0 -t 4 -p "Hello"

Ollama

All boards · Easy wrapper for llama.cpp

Manages model downloads, layers, and a REST API automatically. Runs on ARM64. Slightly slower than hand-tuned llama.cpp but far easier to set up. Good for anything where you want a local API without configuration overhead.

curl -fsSL https://ollama.com/install.sh | sh

RKNN-Toolkit2

RK3588 boards only (Orange Pi 5, Rock 5B, Khadas Edge2)

Unlocks the RK3588's 6 TOPS NPU for INT8 inference. Models must be converted to RKNN format first (ONNX → RKNN). Significantly faster than CPU-only for compatible models. The RKLLM project specifically targets LLM inference via NPU+CPU hybrid.

pip install rknn-toolkit2 # then convert + run

TensorRT / llama.cpp CUDA

NVIDIA Jetson only

Unlocks the Jetson's GPU for inference — the biggest jump in performance on any SBC. Compile llama.cpp with CUDA enabled or use TensorRT-LLM for production-grade throughput. GPU offloading layers (-ngl 999) is the key flag.

./llama-cli -m model.gguf -ngl 999 -t 4

// The seven tiers

What your board unlocks

Minimal

RPi Zero 2W · RPi 3 / 3B+ · PINE64+ 1–2 GB · Star64 (RISC-V) · <2 GB boards

Proof-of-concept, not practical. The smallest models (Qwen3 0.6B, TinyLlama 1.1B) technically run at ~0.5–1 tok/s. Educational value: high. Usability for real tasks: very low. If you're here for actual AI work, consider an RPi 5 or RK3588 board.

Qwen3 0.6B · 0.4 GB TinyLlama 1.1B · 0.7 GB Llama 3.2 1B ⚠ very slow

CPU Entry

RPi 4 4 GB · RPi 5 2–4 GB · PINE64 4 GB · Pinebook Pro · RK3399 boards 4 GB

Llama 3.2 1B is your daily driver — 8–15 tok/s on a well-cooled board. Fast enough for patient interactive use. 3B models technically run but expect 1–3 tok/s. Active cooling and SSD storage matter enormously here.

Llama 3.2 1B · 0.7 GB ★ SmolLM2 1.7B · ~1 GB ★ Llama 3.2 3B ⚠ slow

CPU Capable

RPi 5 8 GB+ · RPi 4 8 GB · PINE64 8 GB · RK3399 boards 8 GB

RPi 5 changes the picture. Cortex-A76 cores run 3B models at 10–15 tok/s — genuinely interactive. With 8 GB, 7B models load and run at ~3–5 tok/s (usable for low-urgency tasks). NVMe via HAT transforms the experience. RPi 4 8 GB is capable but ~2–3× slower than RPi 5 at the same RAM.

Llama 3.2 3B · 2 GB ★ Phi-4 mini · 2.2 GB ★ Qwen3 4B · 2.5 GB Llama 3.1 8B ⚠ slow on 8 GB

NPU Accelerated

Orange Pi 5 / 5 Plus · Radxa ROCK 5B · Khadas Edge2 (RK3588, 4–32 GB)

The RK3588's 6 TOPS NPU changes the SBC AI equation. RKLLM + NPU hybrid inference runs 3B models at 20–30 tok/s and 7B at 8–14 tok/s — competitive with a MacBook Air 8 GB at the same model size. Models must be converted to RKNN format to use the NPU; llama.cpp falls back to the fast A76 CPU cores.

Qwen3 4B via RKLLM ★ NPU Llama 3.2 3B via RKLLM ★ NPU Llama 3.1 8B · CPU only 13B+ ⚠ CPU only, slow

Jetson Entry

Jetson Orin Nano 4 GB / 8 GB (1024 CUDA cores, 40 TOPS)

A real CUDA GPU in a small board. 7B models run at 20–30 tok/s with GPU offloading — faster than most CPU-only laptop configurations. On 8 GB, 13B models are possible at reduced speed. TensorRT and llama.cpp CUDA are both supported. This is where SBC AI becomes genuinely fast.

Llama 3.1 8B · CUDA ★ Phi-4 mini · CUDA ★ Qwen3 4B · CUDA 13B ⚠ tight on 8 GB

Jetson Pro

Jetson Orin NX 8/16 GB · Jetson AGX Orin 32/64 GB (2048 CUDA cores, 275 TOPS)

Production-grade edge AI. The AGX Orin is what robotics platforms and inference servers run. 13B models at 30–50 tok/s; 70B Q4 is possible on 64 GB at 5–10 tok/s. TensorRT-LLM gives another 2–3× over standard llama.cpp. Capable of serving inference to other devices on the network.

Llama 3.1 8B fast ★ CUDA Qwen 2.5 14B · CUDA Llama 3.1 70B Q4 · 64 GB only Network inference server

x86 Linux

Intel NUC · Beelink · Minisforum · Generic x86 Desktop/Laptop (8–64 GB RAM)

x86 CPUs run llama.cpp with AVX2/AVX-512 acceleration — noticeably faster than ARM at the same clock speed. 7B at 15–25 tok/s on a Ryzen 7 or Core i7 without any GPU. Integrated Intel/AMD graphics can offload some layers via Vulkan — results vary widely. Discrete NVIDIA GPU changes everything (but that's the full desktop, not mini PC territory).

Llama 3.1 8B · 15–25 tok/s ★ Phi-4 mini · fast Qwen 2.5 14B · 32 GB+ iGPU Vulkan offload ⚠ varies

lock

Local AI on Linux is private by default. llama.cpp, Ollama, RKLLM, and TensorRT all run entirely offline. No API key, no account, no telemetry — your prompts stay on your hardware.

Can your Linux boardrun AI?

What nobody tells you about Linux SBC AI

Four runtimes, four use cases

What your board unlocks

Can your Linux board
run AI?