Cut your p99 latency in Go with hedged requests
Your p50 latency looks fine. Your p95 is acceptable. Then p99 ruins the user experience. This is the tail latency problem, and Google wrote the canonical paper on it: The Tail at Scale.
The basic idea behind hedged requests is simple: send one request, wait for a latency threshold, and if it has not returned yet, send a backup request. Use whichever response arrives first and cancel the loser. The hard part is choosing the threshold without doubling your backend load.
hedge is a Go library that implements this at the transport layer. The important detail: it does not expose a generic Do function. It wraps http.RoundTripper and provides a gRPC unary client interceptor.
HTTP usage: wrap the transport
For normal HTTP clients, the API is deliberately small:
client := &http.Client{
Transport: hedge.New(
http.DefaultTransport,
hedge.WithPercentile(0.95),
),
}
req, err := http.NewRequestWithContext(ctx, "GET", "https://api.example.com/data", nil)
if err != nil {
return err
}
resp, err := client.Do(req)
That shape matters. Existing code keeps using http.Client; hedge sits underneath as a RoundTripper. If your code already accepts an *http.Client or an http.RoundTripper, you do not need to redesign it around a new request API.
WithPercentile takes a quantile such as 0.95, not 95. The default is p90.
How the adaptive threshold works
hedge learns a latency distribution per target host. Internally it uses a rolling WindowedSketch built from DDSketch rather than a slice of durations that gets sorted on every request.
That choice is useful for latency data. DDSketch gives relative-error quantile estimates, so the estimate scales with the values being measured. A 1% error on 10ms is very different from a 1% error on 2s, and latency systems care about that distinction.
The rolling window also matters. A threshold learned from last week’s traffic is not very helpful during today’s deploy or peak load. hedge keeps the estimate current by recording completed requests into the sketch.
Streaming responses measure first body byte
A subtle detail in the implementation is where latency gets recorded. For HTTP, hedge wraps the response body and records time-to-first-body-byte, not just time-to-headers.
That matters for streaming workloads. A server can return headers quickly and then delay the first useful byte. If you calibrate hedging from header receipt, the library may think the service is fast and hedge far too aggressively or at the wrong time.
Load amplification is controlled
Hedging can make tail latency better by spending extra work. That is also the danger. During an outage or overload, blindly duplicating requests can make the incident worse.
hedge includes a token-bucket budget so backup requests are capped. This is the part people often skip when they implement hedging by hand. The latency win only matters if the system remains stable under stress.
gRPC usage: unary interceptor
For gRPC clients, hedge provides a unary interceptor:
conn, err := grpc.NewClient(target,
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithUnaryInterceptor(hedge.NewUnaryClientInterceptor(
hedge.WithPercentile(0.95),
)),
)
The interceptor applies the same idea to unary RPCs: start the primary call, wait for the learned threshold, then fire a backup if needed. The first successful result wins and the other context is cancelled.
This belongs on idempotent operations. Reads are usually fine. Payments, job creation, or anything with side effects need careful idempotency keys before you even consider hedging.
What to take from hedge
The Go lesson is not a generic goroutine race helper. The real design is more practical:
- implement HTTP support as a
RoundTripper - implement gRPC support as an interceptor
- learn per-host latency using a rolling DDSketch
- record the latency signal that matches the workload
- cap backup requests with a budget
- rely on context cancellation to clean up losing attempts
That is the shape of a Go library that can be adopted without forcing a new application architecture.