Jun 10, 2026

How Wellcake structures its Kubernetes operator API in Go

Wellcake is a Kubernetes operator for Valkey, the open-source Redis fork. It handles standalone deployments, replication, sentinel setups, cluster mode, failover, rolling restarts, slot migration, and S3 backups. That’s a huge operational surface for one operator, and the Go code behind it is worth picking apart if you’re building your own.

The scaffolding will look familiar if you’ve used Kubebuilder, but a few details are worth a closer look: how it keeps two API versions alive at once, where it puts defaulting and validation, and how it splits controller logic by operational concern instead of dumping everything into one reconcile function.

CRD types: two API versions, two custom resources

Wellcake defines two custom resources: ValkeyCluster and ValkeyACL. Both exist in two API versions: v1alpha1 and v1beta1.

The type definitions live in:

api/v1alpha1/valkeycluster_types.go
api/v1alpha1/valkeyacl_types.go
api/v1beta1/valkeycluster_types.go
api/v1beta1/valkeyacl_types.go

That is standard Kubebuilder layout. Each version directory has its own groupversion_info.go, and the generated deepcopy files provide the methods Kubernetes needs to move these objects through its API machinery.

The interesting part is that both versions are actively registered in cmd/main.go. v1beta1 is the storage version the controller works with, while v1alpha1 stays served as a compatibility layer. That is more work than many operators take on, but it gives the project room to evolve the API without forcing every user to migrate in lockstep.

API version conversion

Both version directories include a conversion.go file:

api/v1alpha1/conversion.go
api/v1beta1/conversion.go

When you support multiple API versions for the same CRD, the usual pattern is a hub-and-spoke model. One version acts as the canonical form, and the other versions implement ConvertTo() and ConvertFrom() around it. That is the model described in the Kubebuilder multi-version guide and the Kubernetes CRD versioning docs.

In Wellcake, v1beta1 is the hub. api/v1beta1/conversion.go just marks both types with Hub(). The real work sits in api/v1alpha1/conversion.go, where ValkeyCluster and ValkeyACL copy ObjectMeta and then JSON round-trip Spec and Status into the hub type.

That JSON round-trip is a tell. The two versions are structurally identical today, so Wellcake does not need hand-written field mapping yet. The comment in the file even says to replace that with explicit mappings once the schemas diverge. It is a practical middle ground: support multiple versions now, keep the conversion code cheap until the API actually changes.

Defaults and webhook validation

Both API version directories also include a defaults.go file:

api/v1alpha1/defaults.go
api/v1beta1/defaults.go

Those files do not register webhooks by themselves. They define a Default() method on the API types. In practice, the mutating webhook is wired in internal/webhook/v1beta1/valkeycluster_webhook.go, and that webhook calls vc.Default() on the v1beta1 type after conversion.

That split is clean. Default values live next to the type definition, while the admission wiring stays in the webhook package. Wellcake also calls this defaulting logic defensively from reconcile, so older objects or clusters with webhooks disabled still converge on the same defaults. The validation side follows the same pattern: internal/webhook/v1beta1/valkeycluster_webhook.go and valkeyacl_webhook.go enforce cross-resource checks that simple schema rules cannot cover, like referenced Secrets existing in the same namespace.

The controller layer

The reconciliation logic lives in internal/controller/. This is where the operator reacts to changes in ValkeyCluster and ValkeyACL resources. The directory structure tells you how much Wellcake is doing:

internal/controller/cluster.go — main cluster reconciliation
internal/controller/cluster_pershard.go — per-shard reconciliation for Valkey cluster mode
internal/controller/failover.go — operator-driven failover logic
internal/controller/rollout.go — proactive zero-downtime rolling restarts
internal/controller/backup.go — S3 backup orchestration
internal/controller/restore_assembly.go — restore from backup
internal/controller/password_rotation.go — credential rotation
internal/controller/metrics.go — Prometheus metrics
internal/controller/valkeycluster_controller.go — the main ValkeyClusterReconciler
internal/controller/valkeyacl_controller.go — ACL reconciliation

Those files feed into controller-runtime’s normal reconcile loop:

func (r *ValkeyClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // fetch the ValkeyCluster object
    // determine current state
    // determine desired state
    // reconcile the difference
}

The part I like is the split by operational concern. Failover, backups, rollout orchestration, and restore assembly all have different failure modes. Keeping them in separate files makes the codebase easier to reason about than one oversized reconcile method with twenty branches and a pile of helper booleans.

The reconciler code also does the expected Kubernetes ownership wiring with controllerutil.SetControllerReference(). You can see that pattern in places like ensureStatefulSet, ensureHeadlessService, ensureClientService, ensureBackupCronJob, and the restore/scale jobs. That means child resources track back to the parent ValkeyCluster, so Kubernetes can clean them up through owner references instead of forcing the operator to manually chase everything down later.

The kubectl plugin

Wellcake ships a kubectl plugin at cmd/kubectl-valkey/. It is a standalone Go binary with its own main.go and several subcommands:

cmd/kubectl-valkey/status.go — cluster status reporting
cmd/kubectl-valkey/backup.go — trigger backups
cmd/kubectl-valkey/ops.go — operational commands
cmd/kubectl-valkey/cli.go — Valkey CLI passthrough
cmd/kubectl-valkey/report.go — diagnostic reports
cmd/kubectl-valkey/certificate.go — TLS certificate management
cmd/kubectl-valkey/client.go — Kubernetes client setup

This follows the kubectl plugin convention: an executable named kubectl-valkey becomes available as kubectl valkey.

The implementation is also a little cleaner than the usual “CLI wraps shell commands” approach. cmd/kubectl-valkey/main.go builds a Cobra root command and registers subcommands for status, CLI access, backup, restart, reshard, failover, hibernation, certificate handling, and report generation.

cmd/kubectl-valkey/client.go builds a controller-runtime client, registers the core Kubernetes scheme plus Wellcake’s v1beta1 API, and loads config from the ambient kubeconfig with ctrlconfig.GetConfig(). That matters because the plugin needs to work with custom resources like ValkeyCluster, not just built-in types. The backup subcommand is a good example of the style: it reads the cluster’s backup CronJob, materializes a one-off Job from the template, and submits it directly through the Kubernetes API.

The operator entry point

The main entry point at cmd/main.go is the usual controller-manager bootstrap, but it is worth reading because it shows the operator’s real boundaries. It registers both API versions, scopes watches with WATCH_NAMESPACE when requested, narrows the cached Pods to the operator’s own data-plane label set, configures the webhook server, then wires up ValkeyClusterReconciler and ValkeyACLReconciler.

That file is also where you see the project moving beyond pure scaffold code. There is explicit handling for secure metrics, webhook certificates, leader election, and a max-concurrent-reconciles flag. In other words, this is not just a generated operator skeleton. It is a generated skeleton that has been pushed toward production concerns.

What I’d take from this codebase

If you’re building your own operator, the first thing worth stealing is the file-per-concern split in internal/controller/. Most operators become hard to work on because every concern ends up welded into one reconcile path. Wellcake is not tiny, but it stays legible because failover, rollout, backup, restore, and ACL handling are clearly separated.

The multi-version API setup is the second useful lesson. Even though v1alpha1 and v1beta1 are still structurally identical, the project has already put the hub-and-spoke conversion path in place. That is exactly the kind of boring groundwork that saves pain later.

If you’re managing stateful infrastructure on Kubernetes, these are the problems that actually justify an operator: failover policy, rolling restarts, backup jobs, restore paths, TLS, and API evolution. Wellcake is worth reading because it tackles those directly instead of hiding them behind magic. For more on structuring Go projects around explicit boundaries, see how to structure a Go project.