Scrutineer: scanning open source without flooding maintainers

Scrutineer scans open source repositories for security vulnerabilities and then handles everything that follows: verifying each one, working out who to contact, drafting a fix, and tracking it through to a published advisory. I’ve been building it for Alpha-Omega for the past couple of months.

Large language models have made finding vulnerabilities in open source code much easier. Point one at a codebase and it turns up real bugs alongside invented ones, faster and cheaper than the fuzzers and scanners that came before, but the bottleneck hasn’t moved with it. Every finding still has to be read, confirmed, and fixed by a maintainer, whose time and attention is a finite resource the whole ecosystem depends on. Trying to secure everything by firing machine-generated reports at maintainers would burn out the people the effort relies on.

When I pointed a couple of AI scanners at curl back in May, most of the output collapsed against the project’s own disclosure policy, and the findings worth having were buried in the rest. Scrutineer is built so the volume a model can generate never lands directly on a maintainer.

You add a repo by URL, it runs a pipeline of skills against the code, and presents the results in a web UI for triage. It’s already in the hands of ecosystem security engineers and several of the teams Alpha-Omega funds, and between us a fair number of vulnerabilities have been found, reported, fixed, and shipped in a release with its help.

How a scan runs

Every scan is a skill on disk: a SKILL.md file, a JSON schema for its output, and any scripts it needs. When you add a repo the triage skill runs first and enqueues the rest of the pipeline in parallel. What comes back is a set of structured findings, each carrying a severity, a CWE, a location linked back to the source line, the affected versions, and a six-step trace of how it was reached.

Because skills are just files in a directory, changing what runs is editing markdown rather than recompiling a scanner. The default set lives in skills/, the triage skill’s SKILL.md lists what to trigger, and dropping a new directory in adds a scan type with no code changes.

The skills

Each skill is a directory in the skills folder on GitHub. triage runs first and gathers the context the audit feeds on, and a supporting cast of static-analysis, dedup, and export skills fills in around the edges. The ones that shape how the tool works:

security-deep-dive is the model-backed audit that produces the findings, and by a wide margin the skill that matters most; everything else either feeds it context or acts on what it returns. It runs in two phases. The first builds an inventory of every sink in the codebase, each place that executes code, shells out, or touches a path that could be hostile, without judging any of them yet. The second works through that inventory one entry at a time, tracing each sink back to a trust boundary and deciding whether hostile input can reach it. The inventory is part of the report rather than scratch work, so two runs against the same commit land on the same list. It audits the project’s own code, not its dependencies’ known CVEs: a finding counts only if the vulnerable logic lives in the repo.
threat-model derives the project’s security contract before any auditing happens: what it assumes about its callers, the properties it guarantees under those assumptions, what it leaves to the integrator, and which code is out of scope. Every claim is tagged documented, with a file and line or a closed issue behind it, or inferred, reasoned from the code and flagged for a human to confirm. It lifts whatever SECURITY.md already says about scope verbatim, so the model is a superset of the project’s own stated position rather than a competing one. The deep-dive loads this instead of re-deriving boundaries on every run, which keeps it on the parts of the code the project claims to defend.
maintainers works out who to contact about a disclosure and sorts the people it finds into active leads, regular maintainers, one-off contributors, and bots. It pulls commit history, issue and PR activity, and registry ownership from ecosyste.ms, and reads SECURITY.md and CODEOWNERS for a named security contact, rather than mailing whoever appears first in the git log. The output names a disclosure channel to go with the people: private vulnerability reporting where the repo has it enabled, a published contact where there is one.
patch proposes a fix for a confirmed finding as a unified diff against the scanned ref. It is held to a minimal change in place at the sink, matching the existing code style and reusing whatever sanitiser or validator the project already has, with a regression test when the suite makes one practical. If it cannot tell where the dangerous path diverges from legitimate use, it refuses rather than guess. A diff that parses, targets files that exist, and passes git apply --check is stored on the finding as its suggested fix and downloadable as a .patch; nothing is pushed, so the analyst reviews and opens the PR by hand.
breaking-change reads that proposed fix and works out whether shipping it would break the library’s top dependents. It classifies what the diff changes in the public API surface, a pure addition, a tightened input contract, a changed signature, or a same-shape change in behaviour, then checks the most-depended-on packages for whether they plausibly call the affected symbols. It is static analysis on the diff and the dependent metadata, never running anyone’s code, and it returns unknown rather than a confident wrong call when a package name is all it has to reason from. That verdict is often the difference between a fix that ships and one that sits.
release-watch handles the part that is easy to lose track of once a patch lands. A finding reaches fixed when a commit merges upstream, but consumers cannot pin to a commit, they need a tagged release. So once a finding is fixed the skill polls the upstream’s releases, maps each tag back to its commit, and checks whether the fix is reachable from it. When a release carrying the fix appears it records the tag, URL, and timestamp on the finding; until then it reports the latest release and checks again on the next run.

There are more than thirty skills bundled by default, and the list keeps growing as we hit cases the existing ones don’t cover.

The data underneath

A lot of what scrutineer has on a project before any code is read comes from ecosyste.ms. The metadata, packages, advisories, and dependents skills all query its APIs: repo metadata, every published package and its download and dependent counts, known advisories already filed, and the projects downstream that a vulnerability would affect. The maintainer analysis leans on registry ownership data from the same place. The dependency side is read locally rather than fetched: the dependencies and sbom skills run git-pkgs over the checkout to index every manifest in the tree and emit a CycloneDX SBOM, so the dependency graph reflects what the repo declares. A finding then arrives with context attached, how widely the package is used and who depends on it, instead of a bare line number.

Verification before disclosure

The instinct running through these skills is to rule a finding out rather than to collect it. The deep-dive keeps a sink only when it can trace hostile input to it. breaking-change and patch return unknown or refuse outright rather than commit to a wrong call, and threat-model writes down the project’s documented non-issues so they are never raised as bugs in the first place. Putting the burden of proof on the finding keeps most of the noise a model generates inside the tool rather than in someone’s inbox. And a project that documents in its SECURITY.md or threat model what it does not count as a vulnerability gives any scanner, not just this one, grounds to drop those reports at source.

The reason scrutineer has a whole workflow rather than a report button is that a raw model finding is not something to send anyone. Every finding starts at new and moves through verification, triage, disclosure draft, and reporting, with a human gate at each step. High and critical findings auto-enqueue a cheap read-only classifier first that sorts them into true positive, false positive, already-fixed, or uncertain before any expensive work happens. A true positive on a serious finding then chains into an independent verification pass. Nothing reaches a maintainer until a person has looked at it, and when it does it arrives through GitHub’s private vulnerability reporting as a verified bug with a proposed patch attached, rather than another plausible-sounding maybe for them to disprove on a weekend.

Findings in and out

Findings don’t have to originate in scrutineer to go through its workflow. POST another scanner’s output or a pentest report at it, SARIF, CSV, markdown, or a minimal JSON shape, and each one lands in the same triage and disclosure flow as a native finding, deduplicated by content fingerprint against what’s already there. An uploaded CycloneDX or SPDX SBOM resolves each component to a source repository and queues it for scanning. What leaves is in shapes a coordinator or a registry will take: a finding exports as an OSV record or a CSAF 2.0 advisory, and a disclosure bundle packs the OSV, the CSAF, the markdown report, and the patch into one tarball for when GitHub’s private reporting isn’t the route.

Try it across your ecosystem

Scrutineer is MIT licensed and the code is on GitHub. It runs locally, scans run in an ephemeral Docker container by default with a read-only source mount and an egress allowlist, and you point it at Claude with either a Claude Code subscription token or an Anthropic API key.

What I’d most like now is for people to run it against repositories in ecosystems I’m not living in. The pipeline leans on ecosyste.ms and the major registries, so it already reaches across npm, PyPI, RubyGems, crates.io, Go, Packagist, Hex, and NuGet, but the skills were shaped by the projects we happened to scan first. If you work in an ecosystem with its own conventions and the default skills come up short, the fix is usually a new skill file, about as small a contribution as open source asks for, and the issue tracker is the place to tell me what broke.