Downstream Testing

The information about how a library is actually used lives in the dependents’ code, not in the library’s own tests or docs. Someone downstream is parsing your error messages with a regex, or relying on the iteration order of a result set you never documented, or depending on a method you consider internal because it wasn’t marked private in a language that doesn’t enforce visibility. Hyrum’s Law says all of these implicit contracts exist once you have enough users, and semver can’t help because a version number declares what the maintainer intended, not what downstream code actually depends on. A 2023 study of Maven found that 11.58% of dependency updates contain breaking changes that impact clients, and nearly half arrived in non-major version bumps. Most library maintainers have no way to validate their version number before publishing, so the feedback loop is reactive: release, wait for bug reports, and hope the breakage wasn’t too widespread before you can cut a patch.

Distributions

Debian packages declare test suites following the DEP-8 specification, and when a package is a candidate for migration from unstable to testing, the migration tool Britney triggers autopkgtest for the package and all of its reverse dependencies. A regression blocks migration, so an Expat update that causes test failures in its dependents sits in unstable until someone resolves them, and a Coq update that broke mathcomp-analysis and mathcomp-finmap did the same. The maintainer finds out who they broke and how before the change reaches anyone who didn’t opt into unstable.

Autopkgtest doesn’t check API compatibility. It runs actual test suites of actual consumers, which encode whatever implicit contracts those consumers have built against, including ones the upstream maintainer has never heard of. If library Y changes the sort order of a hash table in a patch release and package X’s tests assumed that order was stable, migration blocks until someone decides whose assumption was wrong.

Fedora’s recent work with tmt, Packit, and Testing Farm runs downstream tests in the PR, before anything is released. The Cockpit project configured it so that opening a PR on their core library automatically runs the test suites of cockpit-podman and other dependents against the proposed change, with results showing up as status checks before merge. As they put it, “it is too late at the distro level anyway: at that point the new upstream release which includes the regression was already done, and the culprit landed possibly weeks ago already.”

When a maintainer discovers breakage in a PR, they’re still inside the change. They remember why they restructured that error path, they know which tests they considered, and the diff is right in front of them. The cost of responding to a downstream failure at this point is a few minutes of thought and maybe a revised approach. When the same breakage surfaces as an issue filed three weeks after release, the maintainer has to reload the context of the change, understand the downstream project’s usage well enough to see why it broke, decide whether to fix forward or revert, cut a new release, and hope that consumers who already pinned away will unpin. The information is the same in both cases, a downstream test failed, but the cost of acting on it scales with the distance from the change that caused it.

Debian’s autopkgtest catches breakage before migration to testing, which is better than catching it after, but the change has already been released upstream by that point. The Fedora approach catches it before the upstream release happens at all, which means the maintainer can fix it before anyone outside their own CI ever encounters it. František Lachman and Cristian Le presented the PTE project at FOSDEM. Downstream feedback that arrives while you’re still writing the code changes how you think about the change itself.

Language ecosystems

Distributions can do this because they have structural properties that language ecosystems lack: a single canonical dependency graph, a standardized test interface (DEP-8 in Debian’s case), a shared execution environment where every package builds and runs the same way, and the authority to block a release based on downstream results. npm, PyPI, and RubyGems have fragmented tooling, no standard way to invoke a package’s tests from outside its own repo, heterogeneous execution environments, and no mechanism to gate a publish on anything other than the maintainer’s own judgement. A few language ecosystems have built partial versions of downstream testing anyway, though they tend to belong to compiler teams with the resources to work around these gaps.

Rust’s crater compiles and tests every crate on crates.io against both the current and proposed compiler, then diffs the results. A recent PR adding impl From<f16> for f32 to the standard library broke 3,143 crates out of 650,587 tested. Adding a trait implementation is unambiguously backwards-compatible by semver’s rules, but it broke type inference in thousands of downstream projects because existing code depended on there being exactly one conversion path between those types. Crater caught it before it shipped, during a run that took five to six days across Linux x86_64. Without it, the Rust team would have discovered the breakage from 3,143 individual bug reports.

Crater also benefits from Rust being compiled: a type inference failure shows up at build time, before any tests run. In Python, Ruby, or JavaScript, the equivalent breakage only surfaces at runtime, so you need downstream test suites that actually exercise the affected code paths, and those code paths need to be covered in the first place. The case for downstream testing is stronger in dynamic ecosystems because there’s no compile step to catch the easy ones, and the signal is harder to get.

Node.js runs CITGM (Canary in the Goldmine), which tests about 80 curated npm packages against proposed Node versions. A refactor in Node 12 moved isFile from Stats.prototype to StatsBase.prototype, changing nothing about the public API but breaking the esm module because it walked the prototype chain directly. In a separate release, a change to the timing of a readable event on EOF broke the dicer module, which depended on that event firing synchronously.

All of these were built by teams with dedicated infrastructure budgets and release processes, and an individual library maintainer who publishes a widely-used package on npm or PyPI or RubyGems has nothing comparable, even though they face the same problem at a different scale.

Merge confidence

Renovate’s Merge Confidence aggregates data from millions of update PRs to tell consumers whether an update is safe: how old the release is, what percentage of Renovate users have adopted it, and what percentage of updates result in passing tests. The signal comes from real test results across real projects, but it arrives after the release and flows to consumers, never back to the maintainer who shipped the change. The algorithm is private, and the underlying dataset of which updates broke which projects’ tests stays behind Mend’s paywall. Dependabot shows a compatibility score on security update PRs, calculated from CI results across other public repos that made the same update, but only when at least five candidate updates exist, and the data doesn’t flow back to the maintainer either. I’ve started indexing Dependabot PRs at dependabot.ecosyste.ms to build an open version of this signal. It doesn’t have CI data yet, but it already tracks merge percentages per update, which gives a rough proxy for how much trouble a particular version bump is causing across the ecosystem.

Discovery

Registries track which packages declare dependencies on other packages, but applications that consume libraries are mostly invisible: a Rails app that depends on a gem won’t show up in RubyGems’ reverse dependency list, and a company’s internal service using an npm package won’t appear on npmjs.com. The maintainer’s view of their dependents is limited to whatever the registry can see, which skews heavily toward other libraries and misses the applications, which are where the stranger usage patterns and more surprising implicit contracts show up.

ecosyste.ms tracks dependents across both packages and open source repositories, scanning millions of repos on GitHub, GitLab, and other forges for manifest files that declare dependencies. A maintainer can see which applications actually use their library, which is the view you’d need to build a downstream testing system on.

Building it

This is something I want to build on top of ecosyste.ms. A maintainer connects the service to their CI, and on every PR or pre-release branch it queries ecosyste.ms for the top N dependents of the package, both libraries and applications, ranked by some combination of dependent count, download volume, and recency of commits. It clones each one, installs the proposed version of the library in place of the current release, and runs their test suite in an isolated environment. The results come back as a report on the PR: which dependents were tested, which ones regressed, what the stack traces look like, which of the maintainer’s changes likely caused each failure.

A maintainer looking at that report before tagging a release would see things that are currently invisible to them. They’d see that popular applications parse their error messages with regex and will break if the wording changes, that a widely-used wrapper library calls a method they considered internal and were about to remove, that their optimisation to batch database calls changed the callback order in a way that two downstream projects’ integration tests depend on.

Michal Gorny’s catalogue of problems with downstream testing Python packages lays out the failure modes: test suites that modify installed files assuming they’re in a disposable container, pytest plugins in the environment causing unexpected test collection, tests requiring network access or Docker, timing-dependent assertions, floating-point precision differences across architectures, source distributions that omit test files entirely. Any service trying this across a registry would need to handle all of these gracefully, distinguishing genuine regressions from environmental noise, which is a hard problem that Debian has spent years refining with autopkgtest and still hasn’t fully solved.

Developer tools usually fund themselves by selling an enterprise version, but large companies facing similar coordination problems between internal teams already solved them with monorepos. When all your code lives in one tree, downstream testing is just CI: you run every affected test before merging, no special infrastructure needed. Google, Meta, and Microsoft have invested heavily in making that work, and inside their monorepos the problem is already solved. Nobody’s going to buy an enterprise version of downstream testing when their codebase doesn’t have a “downstream,” which leaves open source maintainers as the only audience for a tool like this, and they can’t fund it.

ecosyste.ms already provides the dependent discovery, source repositories are linked from package metadata, test suites follow ecosystem conventions that are well-understood enough to automate, and container infrastructure makes isolated environments cheap. Crater and autopkgtest have proven the approach works at ecosystem scale.