Bumblebee inventories packages across ecosystems for supply-chain security. Here's how its Go architecture makes it fast.

How Bumblebee scans your dev machine fast with Go concurrency


Bumblebee is a read-only tool from Perplexity AI that inventories every package, browser extension, editor plugin, and MCP config on a developer’s machine. It’s aimed at supply-chain security: figure out what’s installed before something malicious slips through. What caught my attention as a Go developer is how it structures concurrent scanning across a dozen ecosystems. The patterns here are worth stealing.

The problem: scanning many ecosystems fast

Your average developer machine has npm lockfiles, Python dist-info directories, Go modules, Ruby gems, Composer packages, browser extensions, VS Code extensions, and probably more. Bumblebee needs to walk the filesystem, figure out which ecosystem a directory belongs to, parse the metadata, normalize it, and ship results. None of that can block on a single slow directory tree.

Do this sequentially on a machine with hundreds of thousands of files and you’re waiting minutes. Bumblebee’s answer: each ecosystem scanner runs independently, and the filesystem walk feeds them all in parallel.

How the scanner orchestrates work

The coordination happens in internal/scanner/scanner.go. This is the pipeline: kick off a filesystem walk, route discovered paths to ecosystem-specific parsers, collect results.

Each ecosystem — npm, PyPI, Go modules, Ruby gems, browser extensions — gets its own package under internal/ecosystem/:

  • internal/ecosystem/npm/npm.go
  • internal/ecosystem/pypi/pypi.go
  • internal/ecosystem/gomod/gomod.go
  • internal/ecosystem/browserext/browserext.go
  • internal/ecosystem/mcp/mcp.go

They all implement the same interface pattern. The scanner doesn’t care how npm lockfiles differ from Python METADATA files. It hands off paths. It gets results back. This is Go interface composition doing exactly what it’s supposed to do: small interfaces, many implementations, callers that stay ignorant of details. Same philosophy as functional options — self-contained, composable components.

Filesystem walking with platform awareness

The internal/walk/ package handles directory traversal, and two files tell you something right away: dirkey_unix.go and dirkey_other.go. That’s Go’s build constraint system. On Unix (macOS, Linux), Bumblebee uses inode-based directory identification to avoid rescanning the same directory via symlinks or mount points. On other platforms, it falls back to something else.

The main walk logic lives in internal/walk/walk.go. Filesystem walking looks simple until you run it on real machines. Developer home directories are full of node_modules trees hundreds of levels deep, .git directories with thousands of objects, and virtual filesystems that hang on read. A good walker needs to:

  1. Skip known-unproductive directories early
  2. Handle permission errors without crashing
  3. Not follow symlink loops

Go’s filepath.WalkDir (added in Go 1.16) helps because it gives you fs.DirEntry instead of calling os.Stat on every file. One fewer syscall per file, which matters across millions of entries. Bumblebee builds on that with its own filtering logic.

The output pipeline

Once ecosystem scanners produce results, those results need to go somewhere. Bumblebee supports multiple output sinks, configured from cmd/bumblebee/sink.go. The internal/output/ package has at least two: a standard output writer in output.go and an HTTP sink in httpsink.go.

The HTTP sink (internal/output/httpsink.go) is the one I find most interesting for real-world use. Point Bumblebee at an internal API that collects package inventories from every developer machine in your org. The sink presumably uses Go’s net/http client with proper timeouts and error handling, which is non-negotiable when you’re running this across a fleet. One hung connection can’t be allowed to stall everything.

This separation of scanning from output is worth copying. The scanner doesn’t know or care whether results go to stdout, a file, or an HTTP endpoint. cmd/bumblebee/sink.go wires up the right output based on CLI flags, and the scanning logic stays clean.

The data model and normalization

internal/model/model.go defines the structures that flow through the pipeline. Every ecosystem parser produces the same normalized output: package name, version, ecosystem identifier, file path. The internal/normalize/normalize.go package cleans up ecosystem-specific quirks (version formats, name casing, that kind of thing) into a consistent shape.

This matters because you need to match discovered packages against vulnerability databases. If your npm scanner reports lodash@4.17.21 and your PyPI scanner reports requests 2.28.0 in a completely different format, downstream tools have to deal with that mess. Normalizing at the scan layer means nobody else has to.

Exposure checks

internal/exposure/exposure.go is where Bumblebee connects package inventory to actual risk. After scanning, it checks discovered packages against known supply-chain threats. This is the gap between “here’s what’s installed” and “here’s what’s dangerous” — and it’s the whole reason this tool exists.

Self-testing with real fixtures

Bumblebee includes a self-test system rooted in cmd/bumblebee/selftest.go with fixtures in cmd/bumblebee/selftest/fixtures/. There are test directories for npm (npm-fixture/package-lock.json), PyPI (pypi-fixture/bumblebee_selftest_evil-0.0.0.dist-info/METADATA), and MCP (mcp-fixture/mcp.json).

I like this approach a lot. Instead of mocking filesystem calls, Bumblebee ships real fixture files and runs its actual scanner against them. selftest/catalog.json likely defines expected results so the test can verify end-to-end correctness. If you’re building tools that parse many different file formats, this catches integration bugs that unit tests miss entirely. We’ve talked about testing strategies in Go before — fixtures like these are one of the most reliable approaches for parser-heavy code.

Performance lessons for your own Go tools

A few patterns from Bumblebee worth lifting:

Use build constraints for platform-specific code. The dirkey_unix.go / dirkey_other.go split keeps platform logic isolated without runtime if checks. Go’s build system handles it at compile time, and your code reads cleaner for it.

Separate scanning from output. Bumblebee’s sink abstraction means you can add new output formats without touching any scanner code. This is interface segregation in practice, and it’s surprising how many tools skip it.

Normalize early. Converting ecosystem-specific formats into a common model at parse time (in internal/normalize/normalize.go) means downstream code never has to think about format differences. Every consumer gets clean data.

Ship fixtures, not mocks. Real fixture files catch bugs that mocks hide. Bumblebee’s selftest/fixtures/ directory is a good model to follow.

If you’re drawn to Go tools that handle file operations with clean architecture, Bumblebee’s codebase is worth a read. It’s compact, the package boundaries are clear, and there’s very little ceremony getting in the way of the actual work.