Dec 29, 2025

Exploring Ollama: Running LLMs Locally with Go

Running large language models locally has become surprisingly accessible. Ollama makes it possible to run models like DeepSeek-R1, Gemma 3, Llama 3, and Mistral on your own machine. And since Ollama is written in Go, it comes with excellent Go support out of the box.

With over 158,000 GitHub stars, Ollama has become one of the most popular ways to run LLMs locally. Let’s explore how to use it with Go.

What is Ollama?

Ollama is a tool for running large language models locally. It handles model management, inference, and provides a simple API. You can run models like:

DeepSeek-R1 - Great for reasoning tasks
Gemma 3 and Gemma3n - Google’s latest open models
Llama 3 - Meta’s powerful open-source model
Mistral - Fast and efficient
Phi-4 - Microsoft’s compact model
Qwen - Alibaba’s multilingual model
LLaVa - For vision tasks

The project includes an official Go client library. This means you get first-class support for building Go applications that use LLMs.

Setting Up Ollama

First, install Ollama from ollama.com. Then pull a model:

ollama pull llama3.2

Now you’re ready to use it from Go.

Using the Go Client Library

Install the official client:

go get github.com/ollama/ollama/api

Here’s a basic example that generates text:

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/ollama/ollama/api"
)

func main() {
	client, err := api.ClientFromEnvironment()
	if err != nil {
		log.Fatal(err)
	}

	ctx := context.Background()
	
	req := &api.GenerateRequest{
		Model:  "llama3.2",
		Prompt: "Explain goroutines in one paragraph.",
	}

	var response string
	err = client.Generate(ctx, req, func(resp api.GenerateResponse) error {
		response += resp.Response
		return nil
	})
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(response)
}

Notice how the callback function receives chunks of the response. This enables streaming output. If you’re new to working with context in Go, it’s worth understanding how it helps manage request lifecycles.

Streaming Responses in Real-Time

For chat applications, you want to display text as it’s generated. Here’s how to stream responses to stdout:

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/ollama/ollama/api"
)

func main() {
	client, err := api.ClientFromEnvironment()
	if err != nil {
		log.Fatal(err)
	}

	ctx := context.Background()
	
	messages := []api.Message{
		{
			Role:    "system",
			Content: "You are a helpful Go programming assistant.",
		},
		{
			Role:    "user",
			Content: "What's the difference between a slice and an array?",
		},
	}

	req := &api.ChatRequest{
		Model:    "gemma3",
		Messages: messages,
	}

	err = client.Chat(ctx, req, func(resp api.ChatResponse) error {
		fmt.Print(resp.Message.Content)
		return nil
	})
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println()
}

The callback fires for each token. This creates that familiar typing effect you see in ChatGPT-like interfaces.

Building a Simple CLI Chat

Let’s build something practical. Here’s a minimal interactive chat:

package main

import (
	"bufio"
	"context"
	"fmt"
	"log"
	"os"
	"strings"

	"github.com/ollama/ollama/api"
)

func main() {
	client, err := api.ClientFromEnvironment()
	if err != nil {
		log.Fatal(err)
	}

	ctx := context.Background()
	scanner := bufio.NewScanner(os.Stdin)
	
	var history []api.Message
	history = append(history, api.Message{
		Role:    "system",
		Content: "You are a helpful assistant. Keep responses concise.",
	})

	fmt.Println("Chat with Ollama (type 'quit' to exit)")
	
	for {
		fmt.Print("\nYou: ")
		if !scanner.Scan() {
			break
		}
		
		input := strings.TrimSpace(scanner.Text())
		if input == "quit" {
			break
		}
		if input == "" {
			continue
		}

		history = append(history, api.Message{
			Role:    "user",
			Content: input,
		})

		req := &api.ChatRequest{
			Model:    "llama3.2",
			Messages: history,
		}

		fmt.Print("Assistant: ")
		var response strings.Builder
		
		err = client.Chat(ctx, req, func(resp api.ChatResponse) error {
			fmt.Print(resp.Message.Content)
			response.WriteString(resp.Message.Content)
			return nil
		})
		if err != nil {
			log.Printf("Error: %v\n", err)
			continue
		}
		fmt.Println()

		history = append(history, api.Message{
			Role:    "assistant",
			Content: response.String(),
		})
	}
}

This maintains conversation history. Each request includes previous messages. The model uses this context to give relevant responses.

Handling Errors Gracefully

When working with LLMs, things can go wrong. The model might not be downloaded. The server might be busy. Handle these cases properly:

func generateWithRetry(ctx context.Context, client *api.Client, model, prompt string) (string, error) {
	var result strings.Builder
	
	req := &api.GenerateRequest{
		Model:  model,
		Prompt: prompt,
	}

	err := client.Generate(ctx, req, func(resp api.GenerateResponse) error {
		result.WriteString(resp.Response)
		return nil
	})
	
	if err != nil {
		// Check if it's a context cancellation
		if ctx.Err() != nil {
			return "", fmt.Errorf("request cancelled: %w", ctx.Err())
		}
		return "", fmt.Errorf("generation failed: %w", err)
	}

	return result.String(), nil
}

For robust error handling patterns in Go, consider wrapping errors with context about what operation failed.

Switching Between Models

One of Ollama’s strengths is easy model switching. You can use DeepSeek for complex reasoning, Gemma for general tasks, or LLaVa for image understanding:

type ModelConfig struct {
	Name        string
	Description string
}

var models = map[string]ModelConfig{
	"reasoning": {Name: "deepseek-r1", Description: "Best for complex reasoning"},
	"general":   {Name: "llama3.2", Description: "Good all-around model"},
	"fast":      {Name: "gemma3", Description: "Quick responses"},
	"code":      {Name: "qwen2.5-coder", Description: "Code generation"},
}

func getModel(task string) string {
	if cfg, ok := models[task]; ok {
		return cfg.Name
	}
	return models["general"].Name
}

This pattern lets you pick the right model for each task.

Performance Tips

Running LLMs locally uses significant resources. Here are some tips:

Use smaller models when possible - Gemma 2B is faster than Llama 70B
Set reasonable timeouts - Some queries take time
Limit context length - Shorter prompts mean faster responses
Consider quantized models - They use less memory

ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
defer cancel()

// Use context for all Ollama calls
err := client.Chat(ctx, req, callback)

Wrapping Up

Ollama makes running LLMs locally straightforward. The Go client library is clean and well-designed. You can build chat applications, code assistants, or integrate AI into existing Go services.

Start with a simple model like Llama 3.2 or Gemma 3. Experiment with different models to find what works for your use case. The official Ollama GitHub repository has excellent documentation and a growing model library.

Running models locally means no API costs, no rate limits, and complete privacy. That’s a compelling combination for many applications.