May 3, 2026

NUMA-aware CPU scheduling in Go for 128-core Kubernetes nodes

A 128-thread server is not really a bag of 128 identical CPUs. On a modern EPYC or Xeon box those CPUs are split across sockets, NUMA nodes, memory controllers, and cache domains. Linux will keep the machine busy either way, but a process that keeps moving between distant CPUs, or allocates memory on one NUMA node while running on another, pays for it in cache misses and remote memory access.

NexusFlow is a Go project built around that seam between orchestration and hardware topology. It does not try to replace kube-scheduler or Slurm. It works one host at a time: read the local topology, choose a CPU set, pin a process, optionally ask numactl to bind memory to the same NUMA node, and expose a daemon path for cgroup-backed execution cells.

The repository is worth reading because the implementation is concrete. It is mostly ordinary Go wrapped around Linux interfaces: sysfs, cgroups v2, /proc, /dev/shm, sched_setaffinity, taskset, perf events, and a small gRPC API.

Starting from the machine Linux sees

The topology path is in nexusflow/pkg/topology. On Linux, discover_linux.go reads the same files you would inspect by hand when debugging a NUMA host:

/sys/devices/system/cpu/present
/sys/devices/system/cpu/cpu*/topology/physical_package_id
/sys/devices/system/cpu/cpu*/topology/core_id
/sys/devices/system/cpu/cpu*/topology/thread_siblings_list
/sys/devices/system/node/node*/cpulist
/sys/devices/system/node/node*/distance

That gives NexusFlow a Topology containing logical CPUs and NUMA nodes. CPUs carry their package ID, core ID, NUMA node, and sibling threads. NUMA nodes carry their CPU list and, when the kernel exposes it, a distance vector.

There is also an hwloc package for richer topology sources, but the sysfs implementation is the interesting baseline: no daemon, no kernel module, no vendor library. Just parse the kernel’s text files carefully and keep the result as plain data.

The placement policy is small enough to trust

CPU selection happens in nexusflow/pkg/topology/placement.go. The only strategy today is same-numa, and the behaviour is intentionally conservative.

If the caller passes --numa N, NexusFlow takes CPUs from that NUMA node and fails if the node cannot satisfy the request. Without an explicit node, it looks for a NUMA node large enough for the requested CPU count, chooses the largest one, and breaks ties by lower node ID. If no single node is large enough, it spills across nodes in order until it has enough CPUs.

That is not pretending to be a full scheduling paper. It does not model every L3 cache group or try to score every possible placement. The useful part is that the policy is deterministic and easy to inspect. On a production host, that matters. If a job got CPUs 0-15, you can work backwards from the topology and understand why.

`run` pins the thread that is about to become your workload

The most direct user path is nexusflow run:

nexusflow run --cpus 16 --numa 0 --priority normal --membind=true -- make -j16

The command discovers topology, calls SelectCPUs, wraps with numactl when memory binding is possible, then calls into pkg/affinity/run_linux.go.

The Linux code builds a unix.CPUSet, locks the current goroutine to its OS thread, optionally changes niceness, applies sched_setaffinity(0, &set), and finally calls syscall.Exec:

var set unix.CPUSet
set.Zero()
for _, id := range cpuIDs {
    set.Set(id)
}

runtime.LockOSThread()
if opts.Nice != nil {
    _ = unix.Setpriority(unix.PRIO_PROCESS, 0, *opts.Nice)
}
_ = unix.SchedSetaffinity(0, &set)
return syscall.Exec(path, argv, os.Environ())

The LockOSThread call is doing real work here. NexusFlow pins the OS thread that will call execve; after exec, the replacement process inherits that affinity mask.

Memory binding is separate. If all selected CPUs belong to one NUMA node and numactl is available, the command becomes:

numactl --cpunodebind=N --membind=N -- <argv...>

If numactl is missing, NexusFlow still pins CPUs. It just leaves memory policy to the kernel default.

Build tags keep the syscall code contained

The Linux-specific parts are split into build-tagged files rather than guarded with runtime.GOOS checks. You see the pattern across the tree:

pkg/affinity/run_linux.go and run_stub.go
pkg/affinity/command_linux.go and command_stub.go
pkg/topology/discover_linux.go and discover_stub.go
pkg/daemon/grpc_linux.go and grpc_stub.go
pkg/hugepages/hugepages_linux.go and hugepages_stub.go
pkg/perf/perf_linux.go and perf_stub.go
pkg/shm/shm_linux.go and shm_stub.go

That keeps review focused. Linux files use Linux things: cgroups, /proc, perf, shared memory, hugepage sysfs writes, and sched_setaffinity. Non-Linux files provide stubs so the package graph remains buildable elsewhere without smuggling fake behaviour into the main path.

For a project like this, build tags are not just tidiness. They make the boundary between portable orchestration code and host-specific code obvious.

DAG steps are launched under `taskset`

The nexusflow/pkg/dag package handles YAML pipelines. spec.go defines the node shape, load.go parses YAML, graph.go computes topological order, run.go executes the steps, and prom.go writes Prometheus textfile metrics.

The runner is sequential today. For each node, it selects CPUs from the host topology and asks pkg/affinity/command_linux.go for the command to run. On Linux, that command is wrapped with util-linux taskset -c.

That difference from nexusflow run is important. The DAG runner needs to stay alive so it can supervise multiple child processes, collect exit codes, and write metrics. taskset gives each child a CPU mask without replacing the runner itself.

If a node fails, the runner records the exit code, writes any partial Prometheus output configured through --prom-file, and returns the error. There is nothing glamorous here, which is exactly what you want in a build or ETL pipeline runner.

The daemon exposes cgroup-backed cells

The gRPC service lives in nexusflow/api/v1/nexusflow.proto. It has RPCs for creating and destroying cells, attaching PIDs, running commands in a cell, watching L3 counters, setting hugepages, and evicting foreign processes from a CPU set.

On Linux, pkg/daemon/grpc_linux.go backs those cells with cgroups v2 through pkg/cgv2/manager_linux.go. A cell is a directory under /sys/fs/cgroup/nexusflow-daemon by default. Creating one writes cpuset.cpus and cpuset.mems; if exclusive is requested, NexusFlow also attempts to write isolated to cpuset.cpus.partition.

RunInCell starts the process, writes its PID to cgroup.procs, and streams stdout and stderr back over gRPC. WatchL3 samples last-level-cache references and misses using the project’s perf wrapper. EvictForeign walks /proc, checks process affinity with SchedGetaffinity, and sends a signal to processes whose mask intersects the target CPUs. The default signal is SIGSTOP, unless the request asks for something else.

That is narrower and more accurate than calling it a Kubernetes scheduler. The daemon gives you a host API for CPU/memory cells and a few privileged operations around them.

Shared memory, Plasma, and hugepages

NexusFlow also has a data-plane experiment under pkg/plasma. The coordinator listens on a Unix stream socket, accepts JSON-framed messages, supports dynamic DAG branch requests, records perf samples, and can pass file descriptors using SCM_RIGHTS.

The shared-memory package underneath is straightforward Linux code. pkg/shm/shm_linux.go creates files under /dev/shm with names like nexusflow-{name}, uses O_CREATE|O_EXCL, sets mode 0600, calls ftruncate, and maps the region with mmap(MAP_SHARED).

Hugepage management is even thinner. pkg/hugepages/hugepages_linux.go maps 2M and 1G to the relevant nr_hugepages sysfs files:

/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

Then it writes the requested count and reads the value back. For a privileged host operation, that simplicity is a feature. There is very little hidden behind the CLI.

One binary, including the dashboard

The dashboard follows the same pattern. pkg/dashboard/embed.go uses //go:embed static/*, so the static frontend ships inside the Go binary. The server exposes health endpoints and a command execution API; the README calls out TLS, bearer auth, and CIDR restrictions if you bind it anywhere beyond loopback.

That single-binary shape fits the rest of NexusFlow. Operators can inspect topology, run a pinned command, start the daemon, reserve hugepages, create shared memory, or serve the dashboard without deploying a separate frontend or sidecar stack.

Why this codebase is useful to read

The practical lesson is that Go is comfortable at the boundary where orchestration turns into kernel interaction. NexusFlow does not hide that boundary. It reads sysfs directly. It writes cgroup files directly. It uses build tags for Linux-only implementations. It uses exec when the tool should become the workload, and taskset when it needs to supervise children.

That is the part I would copy into other systems projects: keep the host model explicit, make the placement rule boring and auditable, and put the privileged Linux writes in a small number of packages reviewers can reason about.

On high-core hosts, locality problems often arrive disguised as “the machine is busy but throughput is worse than expected”. NexusFlow is a useful example of attacking that problem with plain Go and the Linux interfaces already sitting on the box.