Feb 10, 2026

The distributed key-value store that powers Kubernetes

Every Kubernetes cluster relies on etcd. It stores all cluster state - pods, services, secrets, everything. If etcd goes down, your cluster becomes a mess. But what exactly is etcd, and why is it such a critical piece of infrastructure?

Etcd is a distributed key-value store written in Go. It’s designed for the most critical data in distributed systems. The CNCF graduated it as a project, putting it in the same tier as Kubernetes itself.

What makes etcd special?

Etcd uses the Raft consensus algorithm to maintain consistency across nodes. This means even if some nodes fail, your data stays consistent and available.

The key features are:

Strong consistency through Raft
Watch functionality for real-time updates
Automatic leader election
Transactional operations (compare-and-swap)

If you’re building distributed systems in Go, understanding etcd is essential. The patterns it uses - consensus, leader election, distributed locking - appear everywhere in modern infrastructure.

Connecting to etcd with Go

Let’s start with a basic connection. You’ll need the official etcd client library:

package main

import (
	"context"
	"fmt"
	"log"
	"time"

	clientv3 "go.etcd.io/etcd/client/v3"
)

func main() {
	// Create a client connection
	cli, err := clientv3.New(clientv3.Config{
		Endpoints:   []string{"localhost:2379"},
		DialTimeout: 5 * time.Second,
	})
	if err != nil {
		log.Fatal(err)
	}
	defer cli.Close()

	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()

	// Put a key-value pair
	_, err = cli.Put(ctx, "service/config", "production")
	if err != nil {
		log.Fatal(err)
	}

	// Get the value back
	resp, err := cli.Get(ctx, "service/config")
	if err != nil {
		log.Fatal(err)
	}

	for _, kv := range resp.Kvs {
		fmt.Printf("%s : %s\n", kv.Key, kv.Value)
	}
}

Notice how context is used throughout. Etcd operations can hang if the cluster is unavailable, so timeouts are crucial.

Watching for changes

One of etcd’s most powerful features is watching keys for changes. This is how Kubernetes controllers react to state changes:

func watchConfig(cli *clientv3.Client) {
	// Watch for changes to any key with the "service/" prefix
	watchChan := cli.Watch(context.Background(), "service/", clientv3.WithPrefix())

	for watchResp := range watchChan {
		for _, event := range watchResp.Events {
			switch event.Type {
			case clientv3.EventTypePut:
				fmt.Printf("Key updated: %s = %s\n", event.Kv.Key, event.Kv.Value)
			case clientv3.EventTypeDelete:
				fmt.Printf("Key deleted: %s\n", event.Kv.Key)
			}
		}
	}
}

The watch channel stays open and streams events as they happen. This is far more efficient than polling.

Distributed locking

Etcd provides distributed locking through its concurrency package. This is useful when you need to ensure only one process runs a task:

package main

import (
	"context"
	"fmt"
	"log"
	"time"

	clientv3 "go.etcd.io/etcd/client/v3"
	"go.etcd.io/etcd/client/v3/concurrency"
)

func main() {
	cli, err := clientv3.New(clientv3.Config{
		Endpoints:   []string{"localhost:2379"},
		DialTimeout: 5 * time.Second,
	})
	if err != nil {
		log.Fatal(err)
	}
	defer cli.Close()

	// Create a session with a 10 second TTL
	session, err := concurrency.NewSession(cli, concurrency.WithTTL(10))
	if err != nil {
		log.Fatal(err)
	}
	defer session.Close()

	// Create a mutex on the /my-lock/ prefix
	mutex := concurrency.NewMutex(session, "/my-lock/")

	ctx := context.Background()

	// Acquire the lock
	if err := mutex.Lock(ctx); err != nil {
		log.Fatal(err)
	}
	fmt.Println("Lock acquired, doing critical work...")

	// Simulate work
	time.Sleep(2 * time.Second)

	// Release the lock
	if err := mutex.Unlock(ctx); err != nil {
		log.Fatal(err)
	}
	fmt.Println("Lock released")
}

The session has a TTL. If your process crashes, the lock releases automatically after the TTL expires. This prevents deadlocks in distributed systems.

Best practices

Use prefixes for organization. Structure your keys like a filesystem: /services/api/config, /services/worker/status. This makes watching and listing easier.

Set reasonable timeouts. Network partitions happen. Always use contexts with timeouts to avoid hanging indefinitely.

Handle connection failures gracefully. Etcd clusters can become temporarily unavailable. Your application should retry with backoff.

Don’t store large values. Etcd is optimized for small values (under 1MB). For larger data, store a reference and keep the actual data elsewhere.

When to use etcd

Etcd shines for:

Configuration management
Service discovery
Leader election
Distributed locking
Coordination between services

It’s not meant for high-throughput data storage. For that, look at traditional databases or purpose-built solutions.

If you’re building infrastructure tools in Go, etcd is worth understanding deeply. The Raft consensus algorithm it implements is the same one used by many other distributed systems. Learning how etcd works teaches you fundamental patterns for building reliable distributed software.