Using llama.cpp in Xojo

Running a local LLM directly from Xojo is now easier than ever thanks to the MBS Xojo Plugins and their integration with llama.cpp. With just a few lines of code, you can load a model, create a context, and generate text—all on-device, with optional GPU acceleration.

In this article, we’ll walk through the basics of setting up llama.cpp with Xojo and the MBS Plugin, and then examine a complete example that loads a model and asks it simple questions.

What Is llama.cpp?

llama.cpp is a high-performance C/C++ implementation for running LLaMA-family language models locally, optimized for CPUs and GPUs (Metal, CUDA, etc.). It is lightweight, fast, and ideal for on-device inference with small to medium-sized models.

The MBS Xojo Plugins provide a direct bridge between Xojo and llama.cpp, exposing model loading, context creation, sampling, and inference capabilities through the LlamaMBS, LlamaModelMBS, LlamaContextMBS, and related classes.

Requirements

To follow along, you will need:

  • Xojo 2006r4 or newer
  • Latest MBS Xojo Tools Plugin with llama.cpp support
  • A compiled llama.cpp library:
    • libllama.dylib on macOS
    • libllama.dll on Windows
    • libllama.so on Linux
  • A GGUF model file (.gguf format)

For a lot of platforms, you find downloads on the llama.cpp release page.

Installation with Homebrew

On macOS you can install homebrew from their website. Then you can use brew to install the llama.cpp package:

brew install llama.cpp

This provides libllama.dylib inside your Homebrew cellar with e.g. this path

/opt/homebrew/Cellar/llama.cpp/6710/lib/libllama.dylib

If you have a newer version, the path will be different, but luckily you can use the path to the libs folder instead:

/opt/homebrew/lib/libllama.dylib

Step 1 — Loading the llama.cpp Library

Before interacting with any model, you must load the llama.cpp dynamic library:

	If Not LlamaMBS.LoadLibrary("/opt/homebrew/Cellar/llama.cpp/6710/lib/libllama.dylib") Then
		System.DebugLog LlamaMBS.LoadErrorMessage
		Return 2
	End If

For macOS, please pass full path to the dylib. For Linux you may just pass the file name, if the package manager installed it properly. Otherwise you pass the full path. On Windows you pass the name of the DLL. You may want to use SetDllDirectoryMBSfunction to set the folder with the DLL, so Windows can find all the related DLL files.

If the path is wrong or dependencies are missing, you’ll get a detailed error message from LoadErrorMessage. On Windows you may see error 193 if the architecture of the DLL doesn’t match the application or error 126 if either the path to the DLL is invalid or some dependency is not found.

Step 2 — Initialize the Backend

llama.cpp supports multiple compute backends: CPU, Metal (macOS/iOS), CUDA with Nvidia GPUs, ROCm / HIP for AMD GPUs, or Vulkan.

The MBS plugin can load all available ones:

	// load dynamic backends
	LlamaMBS.BackendLoadAll

This ensures GPU-accelerated layers are enabled if available.

Step 3 — Load the Model

You specify the path to your .gguf model file and configure parameters such as the number of GPU layers:

	// path to the model gguf file
	Var modelPath As String = "/Users/cs/Temp/test.gguf"
		
	// number of layers to offload to the GPU
	Var ngl As Integer = 99

	// initialize the model
	Var ModelParams As New LlamaModelParametersMBS
	ModelParams.n_gpu_layers = ngl
	
	Var model As New LlamaModelMBS(modelPath, ModelParams)

If your GPU supports it, offloading 20–100 layers can dramatically speed up inference. Otherwise, just set it to 0 for CPU-only execution.

Step 4 — Create the Context

The context manages the state of a conversation and the token buffer.

	// initialize the context
	Var contextParams As New LlamaContextParametersMBS
	
	Var context As New LlamaContextMBS(model, contextParams)
	
	If context.Handle = 0 Then
		System.DebugLog "Failed to create context."
		Return 3
	End If

Each context is independent, so you can have multiple simultaneous sessions with the same model.

You may set properties in LlamaContextParametersMBS class before calling LlamaModelMBS constructor to set the parameters. For example n_ctx defines the context size.

Step 5 — Set Up a Sampler

In llama.cpp, samplers determine how tokens are selected.

For simple deterministic output, we can use a Greedy sampler:

	// initialize the sampler
	Var SampleParameters As New LlamaSamplerChainParametersMBS
	SampleParameters.no_perf = True
	
	Var smpl As New LlamaSamplerMBS(SampleParameters)
	
	smpl.AddToChain( LlamaSamplerMBS.InitGreedy )

You could also add temperature sampling, top-p sampling, or multiple samplers chained together.

Step 6 — Ask the Model a Question

Once everything is initialized, generating text is as simple as:

	System.DebugLog context.Ask(smpl, "Can you add 5 and 3 together?")
	System.DebugLog context.Ask(smpl, "And now double?")

Each call feeds your prompt into the model, runs inference and returns the generated completion as a string. The output depends on the settings applied above and what the model is trained on.

You may also use LlamaSamplerMBS class to do the Ask method yourself. We have that as an alternative in the example project.

Complete sample code

Here is the complete, ready-to-run example:

	// path to the model gguf file
	Var modelPath As String = "/Users/cs/Temp/test.gguf"
	
	// prompt to generate text from
	Var prompt As String = "Hello my name is"
	
	// number of layers to offload to the GPU
	Var ngl As Integer = 99
	// number of tokens to predict
	
	If Not LlamaMBS.LoadLibrary("/opt/homebrew/Cellar/llama.cpp/6710/lib/libllama.dylib") Then
		System.DebugLog LlamaMBS.LoadErrorMessage
		Return 2
	End If
	
	// load dynamic backends
	LlamaMBS.BackendLoadAll
	
	
	// initialize the model
	Var ModelParams As New LlamaModelParametersMBS
	ModelParams.n_gpu_layers = ngl
	
	Var model As New LlamaModelMBS(modelPath, ModelParams)
	
	// initialize the context
	Var contextParams As New LlamaContextParametersMBS
	
	Var context As New LlamaContextMBS(model, contextParams)
	
	If context.Handle = 0 Then
		System.DebugLog "Failed to create context."
		Return 3
	End If
	
	
	// initialize the sampler
	Var SampleParameters As New LlamaSamplerChainParametersMBS
	SampleParameters.no_perf = True
	
	Var smpl As New LlamaSamplerMBS(SampleParameters)
	
	smpl.AddToChain( LlamaSamplerMBS.InitGreedy )
	
	System.DebugLog context.Ask(smpl, "Can you add 5 and 3 together?")
	System.DebugLog context.Ask(smpl, "And now double?")

Conclusion

With only a handful of API calls, the MBS Xojo Plugins let you load llama.cpp models, run inference, and build fully local AI features directly into your Xojo applications. Whether you’re building chatbots, reasoning tools, or creative assistants, this integration gives you full control and zero cloud dependency.

Please try and let us know how well this works.

Example projects: Llama.zip

7 Likes

Thanks Christian, this looks really helpful.

Could you explain the model file a little bit please? How is this created and how could I add my own data (pdfs) etc to expand the knowledge for my specific use case?

Getting started and appreciate the help!

Well, the plugin only does the part with interference.
To train a model, you need to use other tools.

Which ones ? Any example or any documentation of how to train using these tools ?

Did you search for this on the web?

e.g. read here

or maybe this article:
https://medium.com/@ujwalkaka/fine-tuning-llama-3-1-on-your-custom-dataset-a-comprehensive-guide-e5944dd6b5ef

Thanks for this clarification Christian. This is such a fantastic plugin.

I’ve been neck-deep building out Xojo classes to interact with Ollama recently and building a custom Xojo web app to interact with it but I’m now turning my attention to using this plugin to do inference myself within Xojo. Mostly because there are several models that llama.cpp supports but Ollama does not (and may not ever).

I have a few questions Christian, I wonder if you know the answer?

  1. How do chat templates work? My understanding when dealing with the raw tokens that come out of the model that they are bracketed by special tokens like "|channel", “|start_stop|”, etc. I think chat templates are a way to instruct the inference engine how to interpret these tokens. Does your plugin support this or are they somehow included in the GGUF metadata?

  2. Can you elaborate a bit more about Samplers and what they are?

  3. Is there a way to tell llama.cpp to offload to the gPU as much as possible? For instance, how do I tell it to offload the maximum amount to the GPU?

  4. Is there a way to know if the model is loadable (i.e. is too big to fit in GPU)? Does this plugin support loading the model from disk rather than VRAM?

1 Like

Well, for most of the questions, I need to google myself. I don’t know more than you do!
I just implemented a plugin based on the library.

  1. you can read here Templates supported by llama_chat_apply_template
  2. The sampler picks the next token from a list of candidates. There are different samplers. Like the one which picks a random one, or always the first.
  3. You need a Llama built with GPU support.You set n-gpu-layers property with the number of transformer layers to offload to GPU.
  4. If I read correctly, llama will load the parts that fit into VRAM.

Maybe ask your question to a chatbot :slight_smile: