Understanding and leveraging Semantic Kernel - Using Semantic Kernel with Ollama

In this post, we'll explore what Ollama is and how it allows us to run language models locally, without relying on any cloud platform.

In the previous post, we deployed a basic "Hello World" Semantic Kernel project by connecting to Azure OpenAI. While this approach is recommended and easy to set up, it can become costly since it requires an active Azure subscription.

That’s why several tools have emerged to let us run language models locally. Among the most popular are Ollama and LM Studio. In this post, we’ll focus on Ollama, but the concepts and approach we cover here apply equally well to LM Studio or similar tools.

What is Ollama ?

Imagine a scenario where we decide to use the latest language model released by a well-established, reputable company. We've heard that this LLM is highly performant, and we're eager to try it out immediately. But how exactly can we do that ?
In practice, the model itself is just an abstract concept—it doesn't come ready to use. So how can we actually leverage it ? This is where certain tools come into play: tools that make it easy to work with these models by allowing us to load them and interact with them through simple queries.

Ollama is one of these tools. Ollama is a lightweight, user-friendly tool that allows us to run large language models locally on our machine, without needing to rely on cloud services. It simplifies the setup and execution of models like LLaMA, Mistral, or Gemma, making it easy for developers to experiment with generative AI in a private, offline environment.

In short, Ollama is ideal when we want to:

  • avoid cloud costs or API limits
  • run models offline for privacy or compliance
  • quickly prototype and test AI functionality locally
Information 1

Despite the similarity in name, Ollama has nothing to do with LLaMA, the language model developed by Meta (formerly Facebook).

Ollama is a tool—a platform—that allows us to run various language models locally, including but not limited to LLaMA. Think of Ollama as a convenient runtime and interface, while LLaMA is just one of many models it can execute.

Information 2

How does Ollama load and query language models ?

In fact, Ollama relies on models packaged in the GGUF format (a binary format designed for efficient inference). GGUF is specifically optimized for use with inference engines like llama.cpp, which Ollama uses under the hood.

In practice, models are converted into GGUF format, which includes the model weights, tokenizer, configuration, and metadata—all in one compact, efficient file. This specification makes them quite easy to work with.

Information 3

At first glance, running models locally may seem ideal—it helps save money and avoids cloud dependencies. However, large language models are, as the name suggests, very large. They often require machines with at least 32 GB of RAM, and in some cases even more.

In the end, depending on our use case and hardware, it might actually be more practical—and even more cost-effective—to use models deployed in the cloud.

Installing Ollama

Installing Ollama is a straightforward process, although it varies slightly depending on the operating system. Simply visit Ollama and download the installer.

After that, we can open a PowerShell terminal (on Windows) and install the desired model. In our case, to keep things simple, we’ll use the phi-3 model, which is relatively lightweight at "only" 2 GB in size.

1PS >> ollama run phi3

This command will download the model, and once the download is complete, it will present a prompt where we can start interacting with it directly.

There are a few important observations to make regarding this result and the phi-3 model. Contrary to Microsoft's enthusiastic claims, this one performs poorly, at least when it comes to sports-related knowledge.

  • First of all, the 1988 Tour de France was won by Pedro Delgado, not Greg LeMond.
  • Secondly, Greg LeMond is American, not French.
  • Thirdly, Steve Bauer is actually Canadian, not American.
  • And the list of errors continues.

With such a high number of factual inaccuracies, it becomes difficult to trust the model’s responses, particularly for tasks that require reliable knowledge retrieval. As a result, we’ll now try using a much larger LLM —the venerable LLaMA 3— and query it with exactly the same question to see how it compares in terms of accuracy and reliability.

1PS >> ollama run llama3

This time, we can see that the answer is accurate, demonstrating the improved reliability and knowledge depth of a more advanced model like LLaMA 3.

Information

This simple experiment reinforces what we previously said about Ollama: while it’s certainly easy to use for running models locally, the lightweight models—those that are easy to download and run—often offer limited performance and reliability.

On the other hand, high-quality models typically require a large amount of RAM, making them impractical to run on a standard personal computer.

This presents a clear dilemma: local setups are convenient and cost-effective, but often lack power, while more capable models demand significant hardware resources.

Anyway, Ollama isn’t limited to simply downloading models and querying them through a command-line prompt. It also provides a built-in API that can be called from any client application. By default, this API listens on port 11434. Let’s now take a look at how it works in practice.

Integrating Ollama with Semantic Kernel

Just like in the previous post, we’ll set up a Console application to demonstrate how to integrate Semantic Kernel with Ollama. We’ll start by creating a new solution called SemanticKernelOllama, and within it, we’ll add a Console project named SemanticKernelOllama.Main. Below is the content of the Program.cs file.

 1using Microsoft.SemanticKernel;
 2using Microsoft.SemanticKernel.ChatCompletion;
 3
 4
 5// Configure Semantic Kernel
 6var builder = Kernel.CreateBuilder();
 7
 8// Add Ollama chat completion service
 9builder.AddOllamaChatCompletion("llama3", new Uri("http://localhost:11434"));
10var kernel = builder.Build();
11
12// Get the chat completion service
13var chatService = kernel.GetRequiredService<IChatCompletionService>();
14// Initialize chat history
15var history = new ChatHistory();
16history.AddUserMessage("You are a helpful assistant.");
17
18while (true)
19{
20    Console.Write("You: ");
21    var userMessage = Console.ReadLine();
22    if (string.IsNullOrWhiteSpace(userMessage))
23    {
24        break;
25    }
26    history.AddUserMessage(userMessage);
27    var response = await chatService.GetChatMessageContentsAsync(history);
28    Console.WriteLine($"\nBot: {response[0].Content}\n");
29    history.AddMessage(response[0].Role, response[0].Content ?? string.Empty);
30}
Information

We can observe that this code closely resembles the one used in the previous post with Azure OpenAI. This is no coincidence: Semantic Kernel fully leverages modularity, allowing us to switch from one model to another in a matter of seconds.

  • We’re using a dedicated connector for Ollama: Microsoft.SemanticKernel.ChatCompletion, which simplifies the integration by providing a structured way to connect Semantic Kernel with Ollama’s local API.

  • As mentioned earlier, we’re using the API provided by Ollama to interact with the locally running language model (http://localhost:11434).

  • We specified the model we want to use (LLaMA 3) through the dedicated method AddOllamaChatCompletion, which allows us to define the model name when configuring the Semantic Kernel services.

Running the program

Now it’s time to run the program.

We can now see that we have a fully functional AI agent capable of responding to queries using a local language model.

To sum up, we’ve dedicated two posts to exploring Semantic Kernel in action. We’ve seen how straightforward it is to get started and connect to language models, whether they’re hosted in the cloud or running locally. At first glance, Semantic Kernel might seem like a simple orchestration tool designed to streamline the development of AI agents. On its own, it may appear to offer little added value, apart from saving us from writing repetitive, boilerplate code.

However, in reality, Semantic Kernel goes much further. It truly shines through its support for plugins, which enable powerful capabilities such as retrieval-augmented generation (RAG), allowing our AI agents to access external knowledge sources and provide more accurate, context-aware responses. Plugins will be the topic of the next post.

Understanding and leveraging Semantic Kernel - Discovering plugins and RAG