xnew: deduplicating huge line-based files in Go
Recon workflows produce ugly files. Subdomains, URLs, paths, parameters, hosts that responded yesterday but not today. You run subfinder, Amass, httpx, waybackurls, and every run overlaps with the last one.
At some point you need a boring operation to be fast:
cat fresh.txt | xnew all-seen.txt
xnew reads lines from stdin, compares them with a target file, appends only new lines, and prints those new lines to stdout so the pipeline can continue. The repository is small enough to read in one sitting, but there are a few decisions in it that are more interesting than the size suggests.
The job xnew is built for
The common tool in this space is anew: append lines the target file has not seen before. xnew keeps the same basic shape but is tuned for much larger inputs. Its README benchmark tests adding a single new line to files from 1K up to 100M existing lines. On the listed 4 GB / 2-core instance, the 100M-line case is reported at about 30 seconds for xnew versus about 1m38s for anew.
xnew gets there by avoiding full-string storage for everything it has seen. It stores 64-bit hashes.
Storing hashes instead of strings
The core path looks roughly like this:
func loadExistingHashes(path string, trim bool, set *hashSet) error {
f, err := os.Open(path)
if err != nil {
if os.IsNotExist(err) {
return nil
}
return err
}
defer f.Close()
sc := bufio.NewScanner(f)
sc.Buffer(make([]byte, 0, 64<<10), defaultReadBuf)
for sc.Scan() {
line := sc.Bytes()
if trim {
line = bytes.TrimSpace(line)
}
if len(line) == 0 {
continue
}
set.AddHash(hashLine(line))
}
return sc.Err()
}
Each existing line becomes a uint64 via XXH3:
func hashLine(line []byte) uint64 {
h := xxh3.Hash(line)
if h == 0 {
return 1 // keep zero as empty-slot sentinel
}
return h
}
That 0 check matters because the hash set uses zero as its empty-slot marker. If the hash function returns zero, xnew remaps it to one so the table can keep using zero to mean “nothing here”. Small detail, but exactly the kind of thing custom data structures need.
This design has a real tradeoff. Two different lines can theoretically produce the same 64-bit hash, and xnew would treat the second line as already seen. For recon data, the probability is tiny enough to be acceptable for many users, and the memory savings are large. If you need mathematically exact deduplication, storing only hashes is the wrong trade. If you need to process very large line sets on a small machine, it is a reasonable one.
That is the real article here: xnew chooses bounded memory pressure over perfect identity.
The custom hash set
Instead of map[uint64]struct{}, xnew uses a compact open-addressed set:
type hashSet struct {
keys []uint64 // 0 == empty slot
used int
}
Insertion probes linearly until it finds either the same hash or an empty slot:
func (s *hashSet) AddHash(h uint64) bool {
if (s.used+1)*10 >= len(s.keys)*7 {
s.grow()
}
mask := uint64(len(s.keys) - 1)
i := h & mask
for {
k := s.keys[i]
if k == 0 {
s.keys[i] = h
s.used++
return true
}
if k == h {
return false
}
i = (i + 1) & mask
}
}
A few choices are doing the work here.
The backing slice length is always a power of two, so modulo can be replaced with a bitmask. i := h & mask is cheap, and wrapping during probing is just another mask.
The set grows before it gets too full. The load factor is capped around 70%, which keeps probe chains short. Push an open-addressed table too close to full and performance collapses quickly.
The structure stores only the hash values. No map buckets, no string headers, no retained byte arrays for every line. That is why the approach scales better for enormous files.
This is the sort of code worth reading because the data structure is shaped by the workload.
Streaming input without retaining line bytes
For stdin, the loop is similarly direct:
sc := bufio.NewScanner(os.Stdin)
sc.Buffer(make([]byte, 0, 64<<10), defaultReadBuf)
for sc.Scan() {
line := sc.Bytes()
if trim {
line = bytes.TrimSpace(line)
}
if len(line) == 0 {
continue
}
if set.AddHash(hashLine(line)) {
// write to stdout and output file
}
}
Using Scanner.Bytes() avoids allocating a new string for every input line. The bytes are only valid until the next scan, but that is fine here because xnew hashes the line immediately and stores the hash, not the line itself.
The scanner buffer is also bumped to 8 MiB:
const defaultReadBuf = 8 << 20
That avoids the easy-to-miss bufio.Scanner limit where long lines fail with ErrTooLong. URLs and hostnames usually stay small, but large scraped payloads and odd recon output can surprise you. Raising the limit is cheap insurance.
Buffered writes and pipeline behaviour
xnew writes new lines to two places by default:
- stdout, so the next command in the pipeline receives only fresh values
- the target file, so future runs remember them
The file write goes through a large buffered writer:
const defaultWriteBuf = 4 << 20
w := bufio.NewWriterSize(out, defaultWriteBuf)
defer func() { _ = w.Flush() }()
That is the right level of optimization for this tool. The code does not introduce workers, locks, batching protocols, or a database. It just avoids turning every line into an immediate syscall.
The stdout path stays separate because stdout is part of the CLI contract:
cat new-subdomains.txt | xnew all-subdomains.txt | httpx -silent
That is useful in practice. New values get persisted and immediately flow into the next tool.
The flags are small but sensible
There are only a few flags:
-o write new lines to another output file instead of appending to the input file
-trim trim whitespace before comparing
-q quiet mode; suppress stdout
-o is handy when you want “things I have not seen before” as a separate artifact. -trim is useful for messy text sources, though it changes the definition of equality. -q helps when stdout is noise and the file side effect is all you care about.
None of these options complicate the core model. They sit at the edges, which is where CLI options belong.
What I would steal
The useful parts apply well beyond bug bounty tooling.
If you are building Go CLIs that process big line-oriented data, xnew is a good reminder to choose the representation first. Full strings give exactness but cost memory. Hashes are smaller but introduce collision risk. A custom open-addressed table can beat a general-purpose map when the data shape is simple. Scanner.Bytes() is a good fit when you can process a line immediately. Large buffered writes can do more for throughput than adding goroutines.
The code is short, and the choices are deliberate. That is what makes it worth reading.