Umm… they don’t directly claims it. But it actually is! I tested on colab without GPU (2 weeks ago), and llamafile is faster! This might help: https://github.com/Mozilla-Ocho/llamafile/issues/363 llamafile just optimizes llama.cpp
Question: Radxa X4 vs Rock 5B plus for LLMs
No, I kinda hate docker, I am just forced to use it because all the developers seem to put their apps inside docker. Anyway, I will test the performance of the same model on llamafile and ollama and we will see.
You can skip it if you want. I was just saying…
I remember some of these contributions from almost a year ago, they were merged. But they affect the calculation part, hence the prompt evaluation speed (parsing the request), not the token generation speed (producing the response), which is ram-limited and is the slowest and most annoying part on small machines. Results were even published here: https://justine.lol/matmul/
That’s not at all to dismiss Justine’s awesome work, I’m just trying to make sure you don’t compare apples and oranges and end up with some deception. I.e. if you need to process large documents, and get just a short verdict, you can benefit from such optimizations because in your case the calculations will be the most important part. But if you’re interested in a chat server to replace google by asking it short questions and get long answers, you need faster RAM.