Characterising LLM Inference on a Low-Cost RISC-V Matrix-Accelerated Platform
Supervisor: Dr Nikela Papadopoulou
School: Computing Science
Description:
Large language models are increasingly being deployed outside cloud data centres, but for emerging low-power RISC-V platforms it is still unclear when hardware AI acceleration translates into useful tokens-per-second, latency, and energy efficiency.
This project will investigate local LLM inference on the Milk-V Jupiter, a very low-cost and low-power RISC-V development board that nevertheless offers unusually powerful architectural features, including vector support and integer matrix acceleration.
The project will ask a focused systems question: how much of the board’s advertised compute capability is visible in end-to-end LLM inference? The project will evaluate upstream and SpacemiT-enabled llama.cpp builds, separating prompt prefill from token generation, and comparing performance across model sizes and quantization formats. The work will focus on small, open models suitable for edge deployment, such as TinyLlama, Qwen-family models, and Llama-family 1B/3B models.
The central research contribution will be an evaluation of how well the existing compiler and runtime stack can exploit the board’s two main acceleration features: the RISC-V vector unit and the integer matrix unit. The project will ask whether current compilers can generate useful code for both units, whether llama.cpp/ggml can route vector-friendly and integer-matrix-friendly operations to the appropriate execution paths, and where the computation falls back to generic CPU code. The student will compare standard and SpacemiT-enabled builds, measuring prompt-processing throughput, decode throughput, memory use, and power-normalised performance, while relating end-to-end results to operator-level behaviour. A particular focus will be the interaction between compiler support, framework dispatch, and quantization layout: for example, whether formats such as Q4_0 or Q8_0 map naturally to the matrix unit, while more complex formats require unpacking or data-layout changes that limit acceleration.
The expected outcome will be a practical assessment of what the current software stack already supports, what opportunities remain, and what limitations arise at the compiler, framework, and hardware-interface levels. The broader value is to inform hardware designers, compiler/runtime developers, and users of edge AI systems about the software abstractions and quantization formats needed to turn small matrix accelerators into useful LLM performance.