Use Llama.cpp in Xojo

For MBS Xojo Plugins 26.1 we include new Llama classes to use local LLMs on your computer. Instead of paying for a web service to run the LLM on someone else’s computer, you can run it locally on yours.

Llama chat example showing what the LLM knows about Xojo.

About Llama

The Llama.cpp project allows you to run efficient Large Language Model Inference in pure C/C++. You can run any powerful artificial intelligence model including all LLaMa models, Falcon and RefinedWeb, Mistral models, Gemma from Google, Phi, Qwen, Yi, Solar 10.7B and Alpaca.

You do not need to pay to use Llama.cpp or buy a subscription. It is completely free, open-source, constantly updated and available under the “MIT” license. And Monkeybread Software provides an interface for Xojo as part of the MBS Plugin.

Get Models

You can find various models on the Internet, e.g. on huggingface.co. For Llama you need a model in the GGUF format. Other formats may need a conversion currently.

You may search models for llama.cpp as app: huggingface.co/models?apps=llama.cpp.

We made tests with gemma-3-1b-it.Q8_0.gguf model (1 GB) from google and gpt-oss-20b-Q4_K_M.gguf (11.6 GB) from openai. The bigger models are usually better in their knowledge, but please be aware, that the model needs to fit into memory.
The links will break as new models get uploaded and old models disappear.

Get Libraries

To install llama.cpp you can use package managers like homebrew to install a copy: brew.sh/llama.cpp.

Or you download binaries from the llama website or directly from the github llama release page.

You find there builds for macOS (Apple Silicon or Intel), for Linux and for Windows. For Windows you have various builds to use either Vulkan, CUDA or CPU for performing interference.

You may include the libraries with your applications. Like on Windows just put them next to the DLLs from the Xojo runtime. For macOS you could include them within the app bundle in the Frameworks folder.

The libraries can stay in whatever folder you or the installer chooses. Just tell our plugin where to find them.

Load the library

Now you may want to open the example file from us: LLama Simple Chat.xojo_binary_project.

There you need to modify the script to load the llama libraries. Go to the Opening event and look for the calls to LoadLibrary. For Windows you also need to use SetCurrentWorkingDirectoryMBS to set the folder, so Windows actually finds the DLLs.

Next you put in the file path to the model into the modelPath variable. This should be a native path to the model like “C:\Users\User\Modesl\gemma-3-1b-it.Q8_0.gguf”.

After loading the libraries, the code calls BackendLoadAll to load all the backends. You may then check which bankends are there

Use the model

Next you can use LlamaModelMBS constructor to load a model. Use the LlamaContextMBS class to start a session with its own context memory. You can of course have multiple sessions in parallel.

Here is a sample:

	// initialize the model
	Var ModelParams As New LlamaModelParametersMBS
	
	model = new LlamaModelMBS(modelPath, ModelParams)
	
	// initialize the context
	var contextParams as new LlamaContextParametersMBS
	
	// n_ctx is the context size
	contextParams.n_ctx = 20480
	
	// n_batch is the maximum number of tokens that can be processed in a single call to llama_decode
	contextParams.n_batch = 20480
	
	context = new LlamaContextMBS(model, contextParams)
	
	// initialize the sampler
	Var SampleParameters As New LlamaSamplerChainParametersMBS
	
	sampler = new LlamaSamplerMBS(SampleParameters)
	sampler.AddToChain( LlamaSamplerMBS.InitMinP(0.05, 1))
	sampler.AddToChain( LlamaSamplerMBS.InitTemp(0.8) )
	sampler.AddToChain( LlamaSamplerMBS.InitDist(LlamaSamplerMBS.DefaultSeed) )
	
	// wait for input
	InputArea.SetFocus

On the session, you can use the Ask function to ask the LLM a question. The function then returns the output of the model. For a chat, we use LlamaChatMessageMBS class to collect the questions and answers and apply a chat template to pass it to the LLM as JSON.

We have a lot of parameters like context size or for the samplers used. Different parameters to the sampler or context may provide very different outputs.

Please try and let us know.

6 Likes